Java - Regular Expression

java

http://www.regexplanet.com/advanced/java/index.html - Online Java regular expression evaluation tool
http://ocpsoft.org/tutorials/regular-expressions/java-visual-regex-tester/ - Online Java regular expression evaluation tool

https://zeroturnaround.com/rebellabs/java-regular-expressions-cheat-sheet/

http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html - reference material

http://www.vogella.com/articles/JavaRegularExpressions/article.html - done reading
http://www.tutorialspoint.com/java/java_regular_expressions.htm - done reading
http://ocpsoft.org/opensource/guide-to-regular-expressions-in-java-part-1/ - done reading

http://www.journaldev.com/634/java-regular-expression-tutorial-with-examples - done reading
http://docs.oracle.com/javase/tutorial/essential/regex/matcher.html
http://docs.oracle.com/javase/tutorial/essential/regex/
http://docs.oracle.com/javase/tutorial/essential/regex/pattern.html
http://tutorials.jenkov.com/java-regex/matcher.html
http://www.javaworld.com/article/2073192/core-java/regular-expressions-simplify-pattern-matching-code.html
http://www.regular-expressions.info/java.html
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#matches%28java.lang.String%29

Look ahead and look behind

How can we use java.util.regex.Pattern and java.util.regex.Matcher?

For advanced regular expressions the java.util.regex.Pattern and java.util.regex.Matcher classes are used. You first create a Pattern object which defines the regular expression. This Pattern object allows you to create a Matcher object for a given string. This Matcher object then allows you to do regex operations on a String.

import java.util.regex.Matcher;
import java.util.regex.Pattern;
Pattern pattern = Pattern.compile("\\w+");
String EXAMPLE_TEST = "...";
Matcher matcher = pattern.matcher(EXAMPLE_TEST);
while (matcher.find()) {
  System.out.print("Start index: " + matcher.start());
  System.out.print(" End index: " + matcher.end() + " ");
  System.out.println(matcher.group());
}

In the above code, the regular expression (needle) is specified with Pattern.compile("\\w+"), and the string (haystack) is specified with pattern.matcher(EXAMPLE_TEST). This creates the matcher object. The matcher.find() method let us know if there is match (it was able find the pattern or needle in the haystack). The matcher.start() probably tell us the index within the string where the match occur, and matcher.end() probably tell us the index within the string where the match ends. The matcher.group() method probably tell us group captured number. The group number 0 represent the entire match. In the above example, we do not use any capturing grouping expression, but if we had used any capturing grouping expression, the group number 1 would represent the first captured expression.

How can we use Pattern and Matcher to manipulate a string?

Pattern pattern = Pattern.compile("\\s+");
Matcher matcher = pattern.matcher(EXAMPLE_TEST);
matcher.replaceAll("\t");

In the above example, we replace all white spaces with tabs.

String pattern = "(\\w)(\\s+)([\\.,])";
String result = EXAMPLE_TEST.replaceAll(pattern, "$1$3");

In the above code, we removes whitespace between a word character and . or ,.

// Extract the text between the two title elements
pattern = "(?i)(<title.*?>)(.+?)(</title>)";
String updated = EXAMPLE_TEST.replaceAll(pattern, "$2");

In the above code, we extract and replace the title.

str = str.replaceAll("[0-9]", "X"); // replaces all instances of digits with a letter X
str = str.replaceAll(" {2,}", " "); // replaces all instances of multiple spaces with a single space
str = str.replaceAll("<b>([^<]*)</b>", "$1"); // Replaces a sub-part with a with a matched portion

Why do we have to escape the dollar sign in the replacement string?

The dollar sign has special meaning. For example, $1 represents the captured group.

To actually include a dollar in the replacement string, we need to put a backslash before the dollar symbol:

str = str.replaceAll("USD", "\\$");

The static method Matcher.quoteReplacement() will replace instances of dollar signs and backslashes in a given string with the correct form to allow them to be used as literal replacements:

str = str.replaceAll("USD", Matcher.quoteReplacement("$"));

If there is a chance that the replacement string will include a dollar sign or a backslash character, then you should wrap it in Matcher.quoteReplacement(). The replaceAll() method is suitable for cases where the replacement string is fixed or of a fixed format. For more flexibility, the Matcher.find() method can be used.

How can we use regular expression to extract matched sub-strings?

Pattern pattern = Pattern.compile("^[a-zA-Z]+([0-9]+).*");
Matcher matcher = p.matcher("Testing123Testing");

if (matcher .find()) {
    System.out.println(m.group(1));
}

What is the meaning of \1 through \9?

You can use Backreference in regular expression with backslash (\) and then the number of group to be recalled. These are used inside the pattern itself:

System.out.println(Pattern.matches("(\\w\\d)\\1", "a2a2")); //true

With Java, we have to use the backslash twice.

At runtime first capturing group is (\w\d) which evaluates to “a2″ when matched with the input String “a2a2″ and saved in memory. So \1 is referring to “a2″.

How can we programmatically determine the number of capturing groups in a particular pattern?

To find out how many groups are present in the expression, call the groupCount method on a matcher object. The groupCount method returns an int showing the number of capturing groups present in the matcher's pattern. There is also a special group, group 0, which always represents the entire expression. This group is not included in the total reported by groupCount.

What is the problem with using the .matches static method from the Pattern class?

The Pattern class defines a convenient matches method that allows you to quickly check if a pattern is present in a given input string. It is a static method. It takes both the regular expression, and the string. As with all public static methods, you should invoke matches by its class name, such as Pattern.matches("\\d","1");. In this example, the method returns true, because the digit "1" matches the regular expression \d.

Pattern.matches("\\d","1");

When using this method, keep in mind that the regular expression must match the whole string, otherwise it will return false. The above example works fine, but Pattern.matches("\\d","123") may not. To make it works, we have to modify it as Pattern.matches("\\d+","123")

What is the problem with using the matches method from the String class?

s.matches("regex")

This function return true only if the entire string matches the regular expression from start to end. This is crazy in some way. To get around this, make sure that your regular expression contains .* at the start and and at the end of it.

When may it be handy to use the Matcher class?

Using the Matcher class may come in handy when we need to loop through all the matches and do additional processing on each match rather than processing them all at once. See an example from above.

How can we use multiple flags:

pattern = Pattern.compile("[az]$", Pattern.MULTILINE | Pattern.UNIX_LINES);

See http://docs.oracle.com/javase/tutorial/essential/regex/pattern.html

Pattern configuration flags for Java appear very similar to look-ahead operations. Flags are used to configure case sensitivity, multi-line matching, and more. Flags can be specified in collections, or as individual statements. Again, these expressions are not literal, and do not capture values.

(?idmsux-idmsux)      Turns match flags on - off for entire expression
(?idmsux-idmsux:X)       X, as a non-capturing group with the given flags on – off

(?i)              Toggle case insensitivity (default: off, (?-i)) for the text in this group only
(?d)              Enables UNIX line mode (default: off, (?-d)) 
            In this mode, only the '\n' line terminator is recognized in the behavior of ., ^, and $
(?m)              Toggle treat newlines as whitespace (default: off, (?-m))
            The ^ and $ expressions will no longer match to the beginning and end of a line,
            respectively, but will match the beginning and end of the entire input sequence/string.
(?s)              Toggle dot ‘.’ matches any character (default: off, (?-s))
            Normally, the dot character will match everything except newline characters.
(?u)              Toggle Unicode standard case matching (default: off, (?-u)
            By default, case-insensitive matching assumes that only characters 
            in the US-ASCII charset are being matched.
(?x)              Allow comments in pattern (default: off, (?-x))
            In this mode, whitespace is ignored, and embedded comments starting with '#'
            are ignored until the end of a line.

Global toggle. In order to toggle flags for the entire expression, the statement must be at the head of the expression:

"(?idx)^I\s lost\s my\s .+     #this comment and all spaces will be ignored"

The above expression will ignore case, and will set the dot ‘.’ character to include newlines.

Local toggle. In order to toggle flags for the a single non-capturing group, the group must adhere to the following syntax:

"(?idx:Cars)[a-z]+

The above expression will ignore case within the group, but adhere to case beyond.

public class ConfigurationDemo {
    public static void main(String[] args) {
        String input = "My dog is Blue.\n" +
                "He is not red or green.";

        Boolean controlResult = input.matches("(?=.*Green.*).*Blue.*");
        Boolean caseInsensitiveResult = input.matches("(?i)(?=.*Green.*).*Blue.*");
        Boolean dotallResult = input.matches("(?s)(?=.*Green.*).*Blue.*");
        Boolean configuredResult = input.matches("(?si)(?=.*Green.*).*Blue.*");

        System.out.println("Control result was: " + controlResult);
        System.out.println("Case ins. result was: " + caseInsensitiveResult);
        System.out.println("Dot-all result was: " + dotallResult);
        System.out.println("Configured result was: " + configuredResult);
    }
}

Dissecting the pattern:

"(?si)(?=.*Green.*).*Blue.*"
(?si)            turn on case insensitivity and dotall modes
(?=.*Green.*)         ‘Green’ must be found somewhere to the right of this look-ahead
.*Blue.*        ‘Blue’ must be found somewhere in the input

We had to enable multi-line and case-insensitive modes for our pattern to match. The look-ahead in this example is very similar to the pattern itself, and in this case, the pattern could be substituted for another look-ahead. Because we don’t care in which order we find these two items, the way this is written, substituting “(?=.*Blue.*)” for “.*Blue.*” would be an acceptable change; however, if we did care in which order we wanted to find these colors, we would need to be more precise with our ordering. If we wanted to ensure that the ‘Green’ came after ‘Blue’ we would need to move the look-ahead as seen below, and so on.

How can we search across multiple lines?

Add (?s) to search across multiple lines.

How can we use alternatives?

"I lost my (wallet|car|cell phone|marbles)"

How can we use alternatives without capturing?

"I lost my (?:wallet|car|cell phone|marbles)"

How can we use an independent, non-capturing group?

(?<X)

How can we specify flags?

(?idmsux-idmsux)      Turns match flags on - off for entire expression
(?idmsux-idmsux:X)       X, as a non-capturing group with the given flags on – off

You may have to put these flags at the beginning of the regular expression.

What is the flag that specify case sensitivity?

(?i)

Toggle case insensitivity (default: off, (?-i)) for the text in this group only.

What is the flag that specify Unix line mode?

(?d)

Enables UNIX line mode (default: off, (?-d)). In this mode, only the '\n' line terminator is recognized in the behavior of ., ^, and $

What is the flag that specify multi-line mode?

(?m)

Toggle treat newlines as whitespace (default: off, (?-m)) The ^ and $ expressions will no longer match to the beginning and end of a line, respectively, but will match the beginning and end of the entire input sequence/string.

What is the flag that specify the dot-all mode?

(?s)

Toggle dot ‘.’ matches any character (default: off, (?-s)). Normally, the dot character will match everything except newline characters.

What is the flag that specify the Unicode-case mode?

(?u)

Toggle Unicode standard case matching (default: off, (?-u). By default, case-insensitive matching assumes that only characters in the US-ASCII charset are being matched.

What is the flag that specify comment mode?

(?x)

Allow comments in pattern (default: off, (?-x)). In this mode, whitespace is ignored, and embedded comments starting with '#' are ignored until the end of a line.

How can we specify flags for a particular non-capturing group?

In order to toggle flags for the a single non-capturing group, the group must adhere to the following syntax:

"(?idx:Cars)[a-z]+"

The above expression will ignore case within the group, but adhere to case beyond.

How can we split a string into an array?

String[] splitArray = input.split("\\s+");

What is the purpose of the \Q and \E escape pattern?

There are two ways to use metacharacters as ordinary characters in regular expressions:

  1. Precede the metacharacter with a backslash (\).
  2. Keep metacharcter within \Q (which starts the quote) and \E (which ends it).

What other methods are available via the matcher object?

  • appendReplacement
  • appendTail
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License