Regex: Captures versus Groups - what's the deal? (Part 2)

Published: Fri 14 July 2017
By EWS

In Blog.

In Part 1 I chatted about using regular expressions in .NET and grumbled a bit about how they're just not as "easy" as say, ruby, sed or awk or even bash:

#!/bin/bash

if [[ "The quick ..." =~ q[[:alpha:]]+ ]]; then
    echo "Matched ${BASH_REMATCH[0]}"
else
    echo "No match"
fi
$ ./content/code/bash-regex.sh
Matched quick

To recap, although at first glance, the following seems intuitive ...

using System.Text.RegularExpressions;

foreach (var m in Regex.Matches(@"The quick", @"q\w+"))
    Console.WriteLine(m);
$ csharp content/code/t-d-w-c-a-g/not-int.cs
quick

The issue is that the type returned on each iteration isn't a string. Although we get a lot more with the type that is returned (that is, a Match object), for "quick and dirty" type matches, if we're to get it to a string, we need to do something along these lines:

if (Regex.Match("The quick", @"q[^\s]*").ToString() == "quick")
    Console.WriteLine("Yes, it matches...");

i.e., add (an unwieldy) ToString(). That being the case, the extra hustle to get it to do what you want to could arguably be placed in the box labelled "nit-pick". The less-than-desirable discoverability issue is something else, that is, once you've got the Match, what do you do with it (that is, if you're wanting to do something more than just get the string representation thereof)? Visual Studio Intellisense gives us two property options that both have a name that would lead us to believe that we can use them to extract the results of the match:

Which one?

The question is: which one do we use?

The trick is to understand that Captures and Groups are only relevant when we're applying a regular expression that contains groups, e.g. (in Bash now, so as not to jump the gun),

$ [[ "The 3:10pm to yuma" =~ '([0-9]+):([0-9]+)(am|pm)' ]]
$ echo $BASH_REMATCH
3:10pm 3 10 pm

picking out groups is very easy, just enclose what you want captured in parentheses, and refer to the relevant index in the shell variable BASH_REMATCH. In the above Bash example, the first index (that is, 0) of the BASH_REMATCH array is the whole match, and subsequent indices are the individual groups picked out in sequential order.

The C# equivalent:

using System.Text.RegularExpressions;

foreach (var g in Regex.Match(
                  "The 3:10pm to yuma", @"([0-9]+):([0-9]+)(am|pm)").Groups)
    Console.WriteLine(g);
$ sed -n '71,75p' content/\*pt2.rst | csharp --
3:10pm
3
10
pm

So the .NET Groups property follows the same convention as the BASH_REMATCH shell variable, i.e., first entry is the whole match, and subsequent entries are individual groups. If we were to perform a match that does not contain any groups then, we'd still expect to have at least one entry in the Groups property, that is, the whole match (assuming, of course the match succeeded):

foreach (var g in Regex.Match(
                  "The 3:10pm to yuma", @"([0-9]+):([0-9]+)(am|pm)").Groups)
    Console.WriteLine(g);
$ sed -rn -e '71p' -e '94,+2p' content/\*pt2.rst | csharp --
3:10pm
3
10
pm

Things get interesting, though, when groups are used in a compositional fashion, for example:

var matches = Regex.Matches("3:10pm is the time...", @"((\d)+):((\d)+)(pm|am)");
foreach (var m in matches.Cast<Match>())
{
    Console.WriteLine("Match: \"" + m + "\"");
    var gnum = 0;
    foreach (var g in m.Groups)
        Console.WriteLine("       Group[" + gnum++ + "]: " + g);
    Console.WriteLine();
}
$ sed -rn -e '1iusing System.Text.RegularExpressions;' -e '112,+8p' \
     content/\*pt2.rst | csharp --
Match: "3:10pm"
       Group[0]: 3:10pm
       Group[1]: 3
       Group[2]: 3
       Group[3]: 10
       Group[4]: 0
       Group[5]: pm

So, what happened to the 1? Well, if we look at the regular expression that was used, we've got a) nested groups and b) potential repeated groups ((\d)+). Since the Groups property on a Match is a collection of Group, each one corresponding to a set of parentheses in the original regular expression, then what happens when there are multiple instances of a group? It looks to be that in that case a ToString() invocation of Group returns the last instance of the match.

This is where Captures comes in; it's a means of "getting at" the intermediate matches for a single group:

 1 var matches = Regex.Matches("3:10pm is the time...", @"((\d)+):((\d)+)(pm|am)");
 2 foreach (var m in matches.Cast<Match>())
 3 {
 4     Func<int,string> rps = (c) => new String(' ', c);
 5     Console.WriteLine("Match: \"" + m + "\"");
 6     var gnum = 0;
 7     foreach (var g in m.Groups.Cast<Group>())
 8     {
 9         Console.WriteLine(rps(7)+"Groups[" + gnum++ + "]: " + g);
10 
11         if (gnum == 5)
12             Console.WriteLine(rps(18)+"Captures: " +
13                 string.Join(", ", g.Captures.Cast<Capture>()));
14 
15     }
16     Console.WriteLine();
17 }

The above code snippet is very similar to the previous one; we've just added lines 11 to 13 that exposes the captures within the group when we reach the fifth group (zero-index = 4).

$ sed -rn -e '1iusing System.Text.RegularExpressions;' -e '151,+16p' \
  content/\*pt2.rst  | csharp --
Match: "3:10pm"
       Groups[0]: 3:10pm
       Groups[1]: 3
       Groups[2]: 3
       Groups[3]: 10
       Groups[4]: 0
                  Captures: 1, 0
       Groups[5]: pm

For most regular expression matching, it isn't necessary to lean on Captures, and Match and Match.Groups should be entirely sufficient. In short, if a match is successful, then match.Groups[0] should always return something and the following should always be true:

match.Groups[0].ToString() == match.ToString()

If you're wanting to pull groups out of a match, then obviously, use match.Groups, but otherwise, match.ToString() should be just fine. It is a little disappointing that one has to constantly do Cast<T> operations in order to benefit from Linq on matches, groups and captures.

var m = Regex.Match("The train has arrived ... 3:10pm", @"((\d)+):((\d)+)(pm|am)");
Console.WriteLine(string.Format("{0} == {1}", m, m.Groups[0]));
$ sed -rn -e '1iusing System.Text.RegularExpressions;' -e '201,+1p' \
  content/\*pt2.rst  | csharp --

3:10pm == 3:10pm

Comments !

social