All overlapping substrings matching a java regex -
is there api method returns (possibly overlapping) substrings match regular expression?
for example, have text string: string t = 04/31 412-555-1235;
, , have pattern: pattern p = new pattern("\\d\\d+");
matches strings of 2 or more characters.
the matches are: 04, 31, 412, 555, 1235.
how overlapping matches?
i want code return: 04, 31, 41, 412, 12, 55, 555, 55, 12, 123, 1235, 23, 235, 35.
theoretically should possible -- there obvious o(n^2)
algorithm enumerates , checks substrings against pattern.
edit
rather enumerating substrings, safer use region(int start, int end)
method in matcher
. checking pattern against separate, extracted substring might change result of match (e.g. if there non-capturing group or word boundary check @ start/end of pattern).
edit 2
actually, it's unclear whether region()
expect zero-width matches. specification vague, , experiments yield disappointing results.
for example:
string line = "xx90xx"; string pat = "\\b90\\b"; system.out.println(pattern.compile(pat).matcher(line).find()); // prints false (int = 0; < line.length(); ++i) { (int j = + 1; j <= line.length(); ++j) { matcher m = pattern.compile(pat).matcher(line).region(i, j); if (m.find() && m.group().size == (j - i)) { system.out.println(m.group() + " (" + + ", " + j + ")"); // prints 90 (2, 4) } } }
i'm not sure elegant solution is. 1 approach take substring of line
, pad with appropriate boundary characters before checking whether pat
matches.
edit 3
here full solution came with. can handle zero-width patterns, boundaries, etc. in original regular expression. looks through substrings of text string , checks whether regular expression matches @ specific position padding pattern appropriate number of wildcards @ beginning , end. seems work cases tried -- although haven't done extensive testing. less efficient be.
public static void allmatches(string text, string regex) { (int = 0; < text.length(); ++i) { (int j = + 1; j <= text.length(); ++j) { string positionspecificpattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))"; matcher m = pattern.compile(positionspecificpattern).matcher(text); if (m.find()) { system.out.println("match found: \"" + (m.group()) + "\" @ position [" + + ", " + j + ")"); } } } }
edit 4
here's better way of doing this: https://stackoverflow.com/a/11372670/244526
edit 5
the jregex library supports finding overlapping substrings matching java regex (although appears not have been updated in while). specifically, documentation on non-breaking search specifies:
using non-breaking search can finding possible occureneces of pattern, including intersecting or nested. achieved using matcher's method proceed() instead of find()
i faced similar situation , tried above answers in case took of time setting start , end index of matcher think i've found better solution, i'm posting here others. below code sniplet.
if (texttoparse != null) { matcher matcher = placeholder_pattern.matcher(texttoparse); while(matcher.hitend()!=true){ boolean result = matcher.find(); int count = matcher.groupcount(); system.out.println("result " +result+" count "+count); if(result==true && count==1){ mergefieldname = matcher.group(1); mergefieldnames.add(mergefieldname); } } }
i have used matcher.hitend() method check if have reached end of text.
hope helps. thanks!
Comments
Post a Comment