Design and Implementation of Regular Expressions for Version Number Parsing

Keywords: regular expression | version number parsing | wildcard

Abstract: This paper explores the design of regular expressions for parsing version numbers in the format version.release.modification, where each component can be digits or the wildcard '*', and parts may be missing. It analyzes the regex ^(\d+\.)?(\d+\.)?(\*|\d+)$ for validation, with code examples for extraction. Alternative approaches using non-capturing groups and string splitting are discussed, highlighting the balance between regex simplicity and extraction accuracy in software versioning.

Design and Implementation of Regular Expressions for Version Number Parsing

In software development, version number management is a common requirement, often used to identify different release stages. This paper addresses a specific problem: parsing version numbers in the format version.release.modification, where version, release, and modification can be digits or the wildcard *, and these parts along with preceding dots may be missing. For example, 1.23.456 parses to version 1, release 23, modification 456, while 1.* indicates version 1, any release, and any modification. Invalid examples include *.12 or 12.*.34, which do not conform to the format rules.

To validate and extract version number components, we design a regular expression: ^(\d+\.)?(\d+\.)?(\*|\d+)$. This expression matches components starting with digits followed by optional dots, up to three parts, with the last component being digits or a wildcard. For instance, for input 1.23.456, the regex matches the entire string, with capture groups corresponding to 1., 23., and 456. However, this design requires additional processing for value extraction, as capture groups include dot characters, potentially leading to inaccurate extraction.

To improve the extraction process, we can use non-capturing groups to avoid capturing dots, such as modifying to ^(?:\d+\.)?(?:\d+\.)?(\*|\d+)$, but this may complicate the regex. Another approach is to validate the format first using regex, then handle extraction via string splitting. For example, in Perl, we can implement it as follows:

if ($input =~ /^(\d+\.)?(\d+\.)?(\*|\d+)$/) {
    @groups = ($1, $2, $3);
    @version = ();
    foreach (@groups) {
        next if !defined;
        s/\.//;
        push @version, $_;
    }
    ($major, $minor, $mod) = (@version, "*", "*");
}

This code first validates the input, then processes the capture groups: removes dots, stores results in an array, and finally pads missing parts with wildcards. Although this increases code volume, it offers more flexible extraction logic.

Additionally, other answers suggest using string splitting directly, e.g., by splitting the input on dots and checking if each part is digits or a wildcard. This method is straightforward and avoids regex complexity, but may be less efficient for format validation. For example, in Python:

parts = input_str.split('.')
if len(parts) <= 3 and all(part.isdigit() or part == '*' for part in parts):
    major, minor, mod = (parts + ['*', '*'])[:3]

This ensures each part is valid and handles missing components.

In summary, the regular expression ^(\d+\.)?(\d+\.)?(\*|\d+)$ provides a concise validation solution, but may require extra processing for extraction. By combining non-capturing groups or post-processing logic, we can balance validation and extraction needs. In practical applications, choosing between regex or string splitting based on specific scenarios can enhance code readability and maintainability. This paper, through detailed analysis and code examples, helps readers deeply understand the core concepts of version number parsing.

Copyright Notice: All rights in this article are reserved by the operators of DevGex. Reasonable sharing and citation are welcome; any reproduction, excerpting, or re-publication without prior permission is prohibited.

Design and Implementation of Regular Expressions for Version Number Parsing

Cite this article