[TxMt] Search and replace matching brackets?

Robin Houston robin.houston at gmail.com
Sun Sep 2 12:02:17 UTC 2007


A few days ago I posted to the list a regular expression that matches
a balanced nest of brackets. The OP expressed curiosity as to how it
works, and since he probably isn't the only one who found it
confusing, I thought I'd post an explanation to the list.

If you have ever studied automata theory, perhaps you remember that
nested brackets are the classic example of something that *can't* be
matched by a regular expression. So you might be surprised (or
disbelieving) to see a regular expression that does exactly that. The
explanation, of course, is that theory and practice are closer in
theory than in practice: Onigurama regular expressions, in common with
many other flavours, are more powerful than the things that computer
scientists call "regular expressions".

The key in this case is the recursion (or "subexpression call")
feature. (For those of you who know about such things, this feature
makes it possible to represent any context-free language — a step up
the Chomsky hierarchy.) I like to think that I invented this feature;
certainly I wrote the initial implementation for the PCRE engine,
which as far as I know is the first place that it appeared. However,
the Onigurama implementation may be a case of independent rediscovery:
the manual describes it as "Tanaka Akira special".

Because the expressions quickly become complicated, it is very helpful
to use the (?x: ... ) container, within which whitespace and comments
are ignored, so that the expression can be laid out more clearly.
(That is why you cannot put a newline *after* the closing parenthesis:
outside the ?x group, whitespace is matched literally.)

For example, here is an expression to match a parenthesised string
(which may have (nested) parentheses inside it). I hope the comments
make it self-explanatory: please ask, if not!

(?x:
	\(  # match the initial opening parenthesis

	# Now make a named group 'balanced' which matches
	# a balanced substring.
	(?<balanced>

		# A balanced substring is either something that is not a parenthesis:
		[^()]

		| # …or a parenthesised string:

		\( # A parenthesised string begins with an opening parenthesis
			\g<balanced>* # …followed by a sequence of balanced substrings
		\) # …and ends with a closing parenthesis

	)*  # Look for a sequence of balanced substrings

	\)  # Finally, the outer closing parenthesis

)

(Remember not to put a newline after that final parenthesis!)

It's brain-twisting in the same way that recursive functions are
brain-twisting when you first encounter them; but once you've got the
hang of it, it is quite easy to do and very useful.

Robin



More information about the textmate mailing list