A few days ago I posted to the list a regular expression that matches a balanced nest of brackets. The OP expressed curiosity as to how it works, and since he probably isn't the only one who found it confusing, I thought I'd post an explanation to the list.
If you have ever studied automata theory, perhaps you remember that nested brackets are the classic example of something that *can't* be matched by a regular expression. So you might be surprised (or disbelieving) to see a regular expression that does exactly that. The explanation, of course, is that theory and practice are closer in theory than in practice: Onigurama regular expressions, in common with many other flavours, are more powerful than the things that computer scientists call "regular expressions".
The key in this case is the recursion (or "subexpression call") feature. (For those of you who know about such things, this feature makes it possible to represent any context-free language — a step up the Chomsky hierarchy.) I like to think that I invented this feature; certainly I wrote the initial implementation for the PCRE engine, which as far as I know is the first place that it appeared. However, the Onigurama implementation may be a case of independent rediscovery: the manual describes it as "Tanaka Akira special".
Because the expressions quickly become complicated, it is very helpful to use the (?x: ... ) container, within which whitespace and comments are ignored, so that the expression can be laid out more clearly. (That is why you cannot put a newline *after* the closing parenthesis: outside the ?x group, whitespace is matched literally.)
For example, here is an expression to match a parenthesised string (which may have (nested) parentheses inside it). I hope the comments make it self-explanatory: please ask, if not!
(?x: ( # match the initial opening parenthesis
# Now make a named group 'balanced' which matches # a balanced substring. (?<balanced>
# A balanced substring is either something that is not a parenthesis: [^()]
| # …or a parenthesised string:
( # A parenthesised string begins with an opening parenthesis \g<balanced>* # …followed by a sequence of balanced substrings ) # …and ends with a closing parenthesis
)* # Look for a sequence of balanced substrings
) # Finally, the outer closing parenthesis
)
(Remember not to put a newline after that final parenthesis!)
It's brain-twisting in the same way that recursive functions are brain-twisting when you first encounter them; but once you've got the hang of it, it is quite easy to do and very useful.
Robin