I recently read this post [1] where it shows how to define a regular expression (using the PCRE) syntax that looks very much like a proper grammar. A reduced example for the post looks like this:
/ (?(DEFINE) (?<addr_spec> (?&local_part) @ (?&domain) ) (?<local_part> (?&dot_atom) | (?"ed_string) | (?&obs_local_part) ) (?<domain> (?&dot_atom) | (?&domain_literal) | (?&obs_domain) ) ) ^(?&addr_spec)$ /x
The three capture groups “addr_spec”, “local_part” and “domain” would be the grammar rules. It uses the (?&name) syntax to refer to another subgroup. TextMate does not support that syntax but supports the following syntax: \g<name>, which the documentation refers to as Subexp call [2]. This syntax seems to have the same semantics. (DEFINE) is something that seems to be PCRE specific and basically means that the following patterns will not be tried to match. It basically gives a place to define subpatterns. I didn’t find anything corresponding in the TextMate regular expression syntax but defining an optional group can be used as a workaround.
Here’s an example where I tried this technique to match a module declaration in the D language:
(?: (?<module_declaration>(?<module>module)\s+\g<module_fully_qualified_name>\s*;) (?<module_fully_qualified_name>\g<module_name>|\g<packages>.\g<module_name>) (?<module_name>\g<identifier>) (?<packages>\g<package_name>|\g<package_name>.\g<packages>) (?<package_name>\g<identifier>) (?<identifier>\w+) )? \g<module_declaration>
This is exactly according to the specified grammar [3] and it seems to be working as expected. Not sure if the optional group workaround causes some performance implications.
This technique seems like it could be a viable alternative to supporting variables in the TextMate grammar as has been discussed before. What’s missing from this to make it really useful would be something like (DEFINE) in PCRE and a place in the TextMate grammar to place generic patterns used in multiple rules, like a pattern for identifiers.
[1] https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.htm... https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html [2] https://macromates.com/manual/en/regular_expressions https://macromates.com/manual/en/regular_expressions [3] https://dlang.org/spec/grammar.html#ModuleDeclaration
On 17 Oct 2018, at 17:36, Jacob Carlborg wrote:
This is exactly according to the specified grammar [3] and it seems to be working as expected. Not sure if the optional group workaround causes some performance implications.
This technique seems like it could be a viable alternative to supporting variables in the TextMate grammar as has been discussed before.
Just to be clear, you are talking about variables from the parsed language and highlighting them later in the scope, right?
So something like: `let foo = 42 in … something_with foo …` and here the latter instance of `foo` would be marked as a variable?
To be honest, I don’t follow this 100%; it seems highly impractical to have one big regular expression for the language because then we cannot save parser state and restart parser on the line that got edited, which would be a major performance issue.
I am also not sure how this variable thing would work in a typical language where you can define an arbitrary number of variables on basically any given line.
On 21 Oct 2018, at 08:54, Allan Odgaard mailinglist@textmate.org wrote:
On 17 Oct 2018, at 17:36, Jacob Carlborg wrote:
This is exactly according to the specified grammar [3] and it seems to be working as expected. Not sure if the optional group workaround causes some performance implications.
This technique seems like it could be a viable alternative to supporting variables in the TextMate grammar as has been discussed before.
Just to be clear, you are talking about variables from the parsed language and highlighting them later in the scope, right?
So something like: let foo = 42 in … something_with foo … and here the latter instance of foo would be marked as a variable?
No, I’m talking about having variables in the grammar with reusable regular expression snippets. Something like:
{ patterns = ( { name = 'meta.definition.class.d'; match = 'class $identifier’; // the “identifier” variable referenced/interpolated } ); repository = { variables = { identifier = '(\w+)'; }; }; }
In the above example “identifier” would be a variable that can be interpolated in other rules.
The above example would be one approach. Another approach to support variables is what I showed in my original post but using already existing features of regular expression.
{ patterns = ( { name = 'meta.definition.class.d'; match = ‘\g<class_declaration>’; // this is standard regular expression syntax that works today in TextMate we just need somewhere to define “class_declaration" }, { name = ‘meta.definition.struct.d’; match = ’struct\s+\g<identifier>’; // reusing the “identifier” sub expression in another pattern } ); repository = { defines = '(?x) // here sub expressions can be defined which can later be used in “patterns" (?<class_declaration>class\s+\g<identifier>) (?<identifier>\w+) ‘; }; }
This is all about how one can write/implement a TextMate grammar. This approach would allow to write a TextMate grammar that closely follows a “real” grammar for a programming language, like a BNF grammar. I hope this makes it a bit more clear.
-- /Jacob Carlborg
On 21 Oct 2018, at 22:32, Jacob Carlborg wrote:
No, I’m talking about having variables in the grammar with reusable regular expression snippets. Something like: […]
I see, well for that, I think it would be more consistent (with the existing feature set) to treat regular expressions as format strings, as we already use format strings in other places related to grammars, and as variables can be referenced is `${…}` it should not really cause an issue with escaping, as the only thing that should meaningfully go after `$` in a regexp is `\n` (or possibly '(').