Perl - Regular Expression

perl

http://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf
https://ocppreparation.wordpress.com/regex/regular-expression/
http://www.sitepoint.com/demystifying-regex-with-practical-examples/

Look-ahead:

(?=pattern)

Negative look-ahead:

(?!pattern)

Look-behind:

(?<=pattern)

Negative look-behind:

(?<!pattern)

Non-capture grouping:

(?:pattern)
(?imsx-imsx:pattern)

How can we insert a forward slash every 2 characters?

$str =~ s/([^\/]{2})/$1\//g;

What are some best practice when working with regular expression?

  • Put a complicate regular expression into a named sub-routine. Give the sub-routine a meaningful name. Document the purpose of the sub-routine and the regular expression.

Arbitrary bytes are represented by octal escape sequences, e.g., \033 , or hexadecimal escape sequences, e.g., \x1B

Regexes are treated mostly as double quoted strings, so variable substitution works:

    $foo = 'house';
    'cathouse' =~ /cat$foo/;   # matches
    'housecat' =~ /${foo}cat/; # matches

a{n,m} = match at least n times, but not more than m times.
a{n,} = match at least n or more times
a{n} = match exactly n times

Modifiers:
i case insensitive
o compile this regular expression once

The global modifier //g allows the matching operator to match within a string as many times as possible. In scalar context, successive matches against a string will have //g jump from match to match, keeping track of position in the string as it goes along. You can get or set the position with the pos() function. For example,

    $x = "cat dog house"; # 3 words
    while ($x =~ /(\w+)/g) {
        print "Word is $1, ends at position ", pos $x, "\n";
    }

prints

    Word is cat, ends at position 3
    Word is dog, ends at position 7
    Word is house, ends at position 13

A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the //c , as in /regex/gc .

In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex: @words = ($x =~ /(\w+)/g);

With the s/// operator, the matched variables $1 , $2 , etc. are immediately available for use in the replacement expression.

With the global modifier, s///g will search and replace all occurrences of the regex in the string.

The evaluation modifier s///e wraps an eval{…} around the replacement string and the evaluated result is substituted for the matched substring. Some examples:

$x =~ s/(\w+)/reverse $1/ge;
$x =~ s!(\d+)%!$1/100!e;

The last example shows that s/ can use other delimiters, such as s!!! and s{}{}, and even s{}. If single quotes are used s''', then the regex and replacement are treated as single quoted strings.

in scalar context, $time =~ /(\d\d):(\d\d):(\d\d)/ returns a true or false value. In list context, however, it returns the list of matched values ($1,$2,$3) . So we could write the code more compactly as:

($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/);

For convenience, Perl sets $+ to the string held by the highest numbered $1 , $2 ,… that got assigned.

Back references:
Closely associated with the matching variables $1 , $2 , … are the backreferences \1 , \2 ,… Backreferences are simply matching variables that can be used inside a regexp. This is a really nice feature — what matches later in a regexp is made to depend on what matched earlier in the regexp. Suppose we wanted to look for doubled words in a text, like 'the the'. The following regexp finds all 3-letter doubles with a space in between:

/\b(\w\w\w)\s\1\b/;

The grouping assigns a value to \1, so that the same 3 letter sequence is used for both parts.

Counting the opening parentheses to get the correct number for a backreference is errorprone as soon as there is more than one capturing group. A more convenient technique became available with Perl 5.10: relative backreferences. To refer to the immediately preceding capture group one now may write \g{-1} , the next but last is available via \g{-2} , and so on.

Perl 5.10 also introduced named capture buffers and named backreferences. To attach a name to a capturing group, you write either (?<name>…) or (?'name'…). The backreference may then be written as \g{name} . It is permissible to attach the same name to more than one group, but then only the leftmost one of the eponymous set can be referenced. Outside of the pattern a named capture buffer is accessible through the %+ hash.

    $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
    $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
    $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
    for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){
        if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
            print "day=$+{d} month=$+{m} year=$+{y}\n";
        }
    }

If any of the alternatives matches, the hash %+ is bound to contain the three key-value pairs.

Position information:

In addition to what was matched, Perl (since 5.6.0) also provides the positions of what was matched as contents of the @- and @+ arrays. $-[0] is the position of the start of the entire match and $+[0] is the position of the end. Similarly, $-[n] is the position of the start of the $n match and $+[n] is the position of the end. If $n is undefined, so are $-[n] and $+[n]

Even if there are no groupings in a regexp, it is still possible to find out what exactly matched in a string. If you use them, Perl will set $‘ to the part of the string before the match, will set $& to the part of the string that matched, and will set $’ to the part of the string after the match.

It is important to note that using $‘ and $’ slows down regexp matching quite a bit, while $& slows it down to a lesser extent, because if they are used in one regexp in a program, they are generated for all regexps in the program. So if raw performance is a goal of your application, they should be avoided. If you need to extract the corresponding substrings, use @- and @+ instead.

Possessive quantifiers:

With the introduction of the possessive quantifiers in Perl 5.10, we have a way of instructing the regex engine not to backtrack, with the usual quantifiers with a + appended to them. This makes them greedy as well as stingy; once they succeed they won't give anything back to permit another solution. They have the following meanings:

a{n,m}+ means: match at least n times, not more than m times, as many times as possible, and don't give anything up. a?+ is short for a{0,1}+

a{n,}+ means: match at least n times, but as many times as possible, and don't give anything up. a*+ is short for a{0,}+ and a++ is short for a{1,}+

a{n}+ means: match exactly n times. It is just there for notational consistency.

These possessive quantifiers represent a special case of a more general concept, the independent subexpression.

Commenting your regular expression:

   /^
            [+-]?         # first, match an optional sign
            (             # then match integers or f.p. mantissas:
                    \d+\.\d+  # mantissa of the form a.b
                  |\d+\.     # mantissa of the form a.
                  |\.\d+     # mantissa of the form .b
                  |\d+       # integer of the form a
            )
            ([eE][+-]?\d+)?  # finally, optionally match an exponent
      $/x;

If whitespace is mostly irrelevant, how does one include space characters in an extended regexp? The answer is to backslash it '\ ' or put it in a character class [ ] . The same thing goes for pound signs, use \# or [#].

We pointed out earlier that variables in regexps are substituted before the regexp is evaluated.

Global matching:

The modifier g stands for global matching and allows the matching operator to match within a string as many times as possible. In scalar context, successive invocations against a string will have `g jump from match to match, keeping track of position in the string as it goes along. You can get or set the position with the pos() function.

To extract all words, form the simple regexp (\w+) and loop over all matches with /(\w+)/g :

    while ($x =~ /(\w+)/g) {
        print "Word is $1, ends at position ", pos $x, "\n";
    }

A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the //c , as in /regexp/gc . The current position in the string is associated with the string, not the regexp. This means that different strings have different positions and their respective positions can be set or read independently.

In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regexp. So if we wanted just the words, we could use:

    @words = ($x =~ /(\w+)/g);

Closely associated with the //g modifier is the \G anchor. The \G anchor matches at the point where the previous //g match left off. \G allows us to easily do context-sensitive matching:

$x =~ /^([+-]?\d+)\s*/g;
$x =~ /\G\s+(widget|sprocket)/

The combination of //g and \G allows us to process the string a bit at a time and use arbitrary Perl logic to decide what to do next. Currently, the \G anchor is only fully supported when used to anchor to the start of the pattern.

\G is also invaluable in processing fixed length records with regexps (for example, matching DNA sequences):

    while ($dna =~ /\G(\w\w\w)*?TGA/g) {
        print "Got a TGA stop codon at position ", pos $dna, "\n";
    }

If there is a match, s/// returns the number of substitutions made, otherwise it returns false.

With the s/ operator, the matched variables $1 , $2 , etc. are immediately available for use in the replacement expression, so we use $1 to replace the quoted string with just what was quoted. With the global modifier, s/g will search and replace all occurrences of the regexp in the string.

A modifier available specifically to search and replace is the s/e evaluation modifier. s/e wraps an eval{…} around the replacement string and the evaluated result is substituted for the matched substring. s///e is useful if you need to do a bit of computation in the process of replacing text.

As with the match m// operator, s/ can use other delimiters, such as s!!! and s{}{}, and even s{}. If single quotes are used s''', then the regexp and replacement are treated as single quoted strings and there are no substitutions. s/// in list context returns the same thing as in scalar context, i.e., the number of matches.

Power tools:

\l and \u convert the next character to lower or upper case, respectively:

    $x = "perl";
    $string =~ /\u$x/;  # matches 'Perl' in $string
    $x = "M(rs?|s)\\."; # note the double backslash
    $string =~ /\l$x/;  # matches 'mr.', 'mrs.', and 'ms.',

A \L or \U indicates a lasting conversion of case, until terminated by \E or thrown over by another \U or \L :

    $x = "This word is in lower case:\L SHOUT\E";
    $x =~ /shout/;       # matches
    $x = "I STILL KEYPUNCH CARDS FOR MY 360"
    $x =~ /\Ukeypunch/;  # matches punch card string

If there is no \E , case is converted until the end of the string. The regexps \L\u$word or \u\L$word convert the first character of $word to uppercase and the rest of the characters to lowercase.

Control characters can be escaped with \c , so that a control-Z character would be matched with \cZ . The escape sequence \Q …\E quotes, or protects most non-alphabetic characters. For instance,

    $x = "\QThat !^*&%~& cat!";
    $x =~ /\Q!^*&%~&\E/;  # check for rough language

It does not protect $ or @ , so that variables can still be substituted.

Unicode characters greater than chr(255) are represented using the \x{hex} notation, because the \0 octal and \x hex (without curly braces) don't go further than 255.

    /\x{263a}/;  # match a Unicode smiley face :)

another way to specify Unicode characters is to use the named character> escape sequence \N{name} . name is a name for the Unicode character, as specified in the Unicode standard. For instance, if we wanted to represent or match the astrological sign for the planet Mercury, we could use:

    use charnames ":full"; # use named chars with Unicode full names
    $x = "abc\N{MERCURY}def";
    $x =~ /\N{MERCURY}/;   # matches

One can also use short names or restrict names to a certain alphabet:

    use charnames ':full';
    print "\N{GREEK SMALL LETTER SIGMA} is called sigma.\n";

    use charnames ":short";
    print "\N{greek:Sigma} is an upper-case sigma.\n";

    use charnames qw(greek);
    print "\N{sigma} is Greek sigma\n";

A list of full names is found in the file NamesList.txt in the lib/perl5/X.X.X/unicore directory (where X.X.X is the perl version number as it is installed on your system).

Just as with Unicode characters, there are named Unicode character classes represented by the \p{name} escape sequence. Closely associated is the \P{name} character class, which is the negation of the \p{name} class. For example, to match lower and uppercase characters,

    use charnames ":full"; # use named chars with Unicode full names
    $x = "BOB";
    $x =~ /^\p{IsUpper}/;   # matches, uppercase char class
    $x =~ /^\P{IsUpper}/;   # doesn't match, char class sans uppercase
    $x =~ /^\p{IsLower}/;   # doesn't match, lowercase char class
    $x =~ /^\P{IsLower}/;   # matches, char class sans lowercase

IsAlpha, IsAlnum, IsASCII, IsCntrl, IsBlank, IsDigit, IsGraph, IsLower, IsPrint, IsPunct, IsSpace, IsSpacePerl, IsUpper, IsWord, IsXDigit

You can also use the official Unicode class names with the \p and \P , like \p{L} for Unicode 'letters', or \p{Lu} for uppercase letters, or \P{Nd} for non-digits. If a name is just one letter, the braces can be dropped. For instance, \pM is the character class of Unicode 'marks', for example accent marks. For the full list see perlunicode.

The Unicode has also been separated into various sets of characters which you can test with \p{…} (in) and \P{…} (not in). To test whether a character is (or is not) an element of a script you would use the script name, for example \p{Latin} , \p{Greek} , or \P{Katakana} .

Perl also defines POSIX style character classes. These have the form [:name:], with name the name of the POSIX class. The POSIX classes are alpha , alnum , ascii , cntrl , digit , graph , lower , print, punct , space , upper , and xdigit , and two extensions, word (a Perl extension to match \w ), and blank (a GNU extension). If utf8 is being used, then these classes are defined the same as their corresponding Perl Unicode classes: [:upper:] is the same as \p{IsUpper} , etc. The POSIX character classes, however, don't require using utf8 . The [:digit:], [:word:], and [:space:] correspond to the familiar \d , \w , and \s character classes. To negate a POSIX class, put a ^ in front of the name, so that, e.g., [:^digit:] corresponds to \D and under utf8 , \P{IsDigit} . The Unicode and POSIX character classes can be used just like \d , with the exception that POSIX character classes can only be used inside of a character class:

    /\s+[abc[:digit:]xyz]\s*/;  # match a,b,c,x,y,z, or a digit
    /^=item\s[[:digit:]]/;      # match '=item',
                                # followed by a space and a digit
    use charnames ":full";
    /\s+[abc\p{IsDigit}xyz]\s+/;  # match a,b,c,x,y,z, or a digit
    /^=item\s\p{IsDigit}/;        # match '=item',
                                  # followed by a space and a digit

Compiled regular expressions:

The regexp quote qr// does exactly that: qr/string/ compiles the string as a regexp and transforms the result into a form that can be assigned to a variable:

    $str = "foo+";
    $reg = qr/$str/;  # reg contains a compiled regexp
    $x =~ $reg;
    $x =~ /$reg/;   # same thing
    $x =~ /(abc)?$reg/; # $reg  can also be interpolated into a larger regexp

As with the matching operator, the regexp quote can use different delimiters, e.g., qr!!, qr{} or qr~~. Apostrophes as delimiters (qr'') inhibit any interpolation.

@compiled = map qr/$_/, @regexp;

(?#text). This embeds a comment into the regular expression without affecting its meaning. The comment should not have any closing parentheses in the text. An example is:

    /(?# Match an integer:)[+-]?\d+/;

This style of commenting has been largely superseded by the raw, freeform commenting that is allowed with the //x modifier.

The modifiers //i , //m , //s , //x and //k (or any combination thereof) can also embedded in a regexp using (?i), (?m), (?s), and (?x). For instance,

    /(?i)yes/;  # match 'yes' case insensitively
    /yes/i;     # same thing
    /(?x)(          # freeform version of an integer regexp
                          [+-]?  # match an optional sign
                          \d+    # match a sequence of digits
                  )
        /x;

Embedded modifiers can have two important advantages over the usual modifiers. Embedded modifiers allow a custom set of modifiers to each regexp pattern. This is great for matching an array of regexps that must have different modifiers:

    $pattern[0] = '(?i)doctor';
    $pattern[1] = 'Johnson';
    ...
    while (<>) {
        foreach $patt (@pattern) {
            print if /$patt/;
        }
    }

The second advantage is that embedded modifiers (except //k, which modifies the entire regexp) only affect the regexp inside the group the embedded modifier is contained in. So grouping can be used to localize the modifier's effects:

    /Answer: ((?i)yes)/;  # matches 'Answer: yes', 'Answer: YES', etc.

Embedded modifiers can also turn off any modifiers already present by using, e.g., (?-i). Modifiers can also be combined into a single expression, e.g., (?s-i) turns on single line mode and turns off case insensitivity.

Embedded modifiers may also be added to a non-capturing grouping. (?i-m:regexp) is a non-capturing grouping that matches regexp case insensitively and turns off multi-line mode.

Using independent subexpressions to prevent backtracking:

Independent subexpressions are regular expressions, in the context of a larger regular expression, that function independently of the larger regular expression. That is, they consume as much or as little of the string as they wish without regard for the ability of the larger regexp to match. Independent subexpressions are represented by (?>regexp)

    $x = "ab";
    $x =~ /a*ab/;  # matches
    $x =~ /(?>a*)ab/;  # doesn't match!

The independent subexpression (?>a*) doesn't care about the rest of the regexp, so it sees an a and grabs it. Then the rest of the regexp ab cannot match. Because (?>a*) is independent, there is no backtracking and the independent subexpression does not give up its a . Thus the match of the regexp as a whole fails.

Conditional expressions:

(?(condition)yes-regexp) and (?(condition)yes-regexp|no-regexp).

The condition can have several forms. The first form is simply an integer in parentheses (integer) . It is true if the corresponding backreference \integer matched earlier in the regexp. The same thing can be done with a name associated with a capture buffer, written as (<name>) or ('name') . The second form is a bare zero width assertion (?…), either a lookahead, a lookbehind, or a code assertion. The third set of forms provides tests that return true if the expression is executed within a recursion ((R) ) or is being called from some capturing group, referenced either by number ((R1) , (R2) ,…) or by name ((R&name) ).

The integer or name form of the condition allows us to choose, with more flexibility, what to match based on what matched earlier in the regexp.

The lookbehind condition allows, along with backreferences, an earlier part of the match to influence a later part of the match. For instance,

    /[ATGC]+(?(?<=AA)G|C)$/;

matches a DNA sequence such that it either ends in AAG , or some other base pair combination and C . Note that the form is (?(?<=AA)G|C) and not (?((?<=AA))G|C); for the lookahead, lookbehind or code assertions, the parentheses around the conditional are not needed.

Defining named patterns:

Starting with Perl 5.10, it is possible to define named subpatterns in a section of the pattern so that they can be called up by name anywhere in the pattern. This syntactic pattern for this definition group is (?(DEFINE)(?<name>pattern)…). An insertion of a named pattern is written as (?&name).

The example below illustrates this feature using the pattern for floating point numbers that was presented earlier on. The three subpatterns that are used more than once are the optional sign, the digit sequence for an integer and the decimal fraction. The DEFINE group at the end of the pattern contains their definition. Notice that the decimal fraction pattern is the first place where we can reuse the integer pattern.

   /^ (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) )
            (?: [eE](?&osg)(?&int) )?
        $
        (?(DEFINE)
            (?<osg>[-+]?)         # optional sign
            (?<int>\d++)          # integer
            (?<dec>\.(?&int))     # decimal fraction
        )/x

Recursive patterns:

This feature (introduced in Perl 5.10) significantly extends the power of Perl's pattern matching. By referring to some other capture group anywhere in the pattern with the construct (?group-ref), the pattern within the referenced group is used as an independent subpattern in place of the group reference itself. Because the group reference may be contained within the group it refers to, it is now possible to apply pattern matching to tasks that hitherto required a recursive parser.

To illustrate this feature, we'll design a pattern that matches if a string contains a palindrome. (This is a word or a sentence that, while ignoring spaces, interpunctuation and case, reads the same backwards as forwards. We begin by observing that the empty string or a string containing just one word character is a palindrome. Otherwise it must have a word character up front and the same at its end, with another palindrome in between.

    my $pp = qr/^(\W* (?: (\w) (?1) \g{-1} | \w? ) \W*)$/ix;

In (?…) both absolute and relative backreferences may be used. The entire pattern can be reinserted with (?R) or (?0). If you prefer to name your buffers, you can use (?&name) to recurse into that buffer.

Executing Perl code in a regular expression:

Normally, regexps are a part of Perl expressions. Code evaluation expressions turn that around by allowing arbitrary Perl code to be a part of a regexp. A code evaluation expression is denoted (?{code}), with code a string of Perl statements.

Code expressions are zero-width assertions, and the value they return depends on their environment. There are two possibilities: either the code expression is used as a conditional in a conditional expression (?(condition)…), or it is not. If the code expression is a conditional, the code is evaluated and the result (i.e., the result of the last statement) is used to determine truth or falsehood. If the code expression is not used as a conditional, the assertion always evaluates true and the result is put into the special variable $^R . The variable $^R can then be used in code expressions later in the regexp. Here are some silly examples:

    $x = "abcdef";
    $x =~ /abc(?{print "Hi Mom!";})def/; # matches,
                                         # prints 'Hi Mom!'
    $x =~ /aaa(?{print "Hi Mom!";})def/; # doesn't match,
                                         # no 'Hi Mom!'

If the regexp backtracks over a code expression and if the variables used within are localized using local, the changes in the variables produced by the code expression are undone!

The result $^R is automatically localized, so that it will behave properly in the presence of backtracking.

Note that the syntax here is (?(?{…})yes-regexp|no-regexp), not (?((?{…}))yes-regexp|no-regexp). In other words, in the case of a code expression, we don't need the extra parentheses around the conditional.

[0-9]    // any single digit
[[a]    // a or [
[]a]    // a or ]
[0-9\-]    // digit or hyphen
[^\^]    // anything except ^
\d    // [0-9]
\w    // [A-Za-z0-9]
\s    // [ \r\t\n\f]

^    // Beginning of the line
$    // End of line
\A    // Beginning of string
\Z    // End of string
\b    // Word boundary

{m}    // match previous character exactly m times
{m,}    // match previous character at least m times
{,m}    // match previous character at most m times
@@*@@    // same as {0,}
+    // same as {1,}
?    // same as {0,1}

m/...regex.../modifiers
$x =~ m/.../modifiers;
$x !~ m/.../modifiers;

$`: pre-matched. store the string before the match
$&: matched.  store the matched string
$': post-matched.  store the string after the match

tr/.../.../modifiers

s/regex/string/modifiers    // in scalar context, returns the number of substitution or false

split /pattern/,string,limit;    // In a list context, split returns a list of substring found.
                    // In a scalar context, split return a count of substring found

$name =~ s/\W.*//;        // W stand for non-word characters (something beside letter, digit, or underscore)

$name =~ tr/A-Z/a-z/;

\n    newline
\r    carriage return
\t    tab
\f    formfeed
\b    backspace
\U    uppercase until next \E
\\    backslash
\"    double quote
\l    lower-case letter L.  Lowercase the next character
\a    bell
\e    escape
\007    bell
\x7f    delete
\cz    control character
\Q    backslash all nonalphanumeric
\L    Lowercase until \E
\u    Uppercase next character
\E    terminate \l, \u, and \Q

What are the modifiers?

  1. g: global
  2. i: ignore case
  3. m: treat the string as multi-line
  4. o: omit recompilation for interpolated variables
  5. s: single line
  6. x: extended, use white space freely in the regex
  7. e: expression (use with substitude s operator)

What are the four categories of operators?

  1. Sequencing operator: nothing
  2. Alternating operator: |
  3. Repeating operator: {m,n}
  4. Grouping operator: (…)

How can we make greedy repeating operator lazy?

Repeating operators are greedy by nature. It can also be lazy. To do so, we add a question mark (?) after the repeating operator. In this case, the repeating operator tries to match as few characters as it can.

How can we cancel the affect of grouping?

The grouping operator (..), beside grouping, creates buffers and what is matched against inside parenthesizes is stored in the buffers. However, we can cancel the effect of grouping operator by using the (?:…)

What are the modifiers for the tr statement?

  1. o: complement
  2. d: delete old string characters if no match
  3. s: squash new string
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License