Re: [TxMt] TM tokenizer is taking 100% CPU for long while using 100-200KB text files with single line

27 Feb 2008


      You missed the "line-based" part. This means that each regular  
expression must be checked against each entire line. Unless I'm  
mistaken, the time taken by any regexp mechanism is going to be much  
more than linear in the length of the string, probably exponential in  
the worst case. On the other hand, if you test it on strings of fixed  
length, it will be linear on the number of strings tested, which is  
the second case you described. In other words, when using a line- 
based system it takes longer to process a single string, that to  
process many strings of small length.
So it is not so much a question of compiling, the timing is very  
different because newlines do matter, the parser matches single lines  
via regular expressions. The longer the line, the harder it is to  
determine if one of the regular expressions matches the whole line or  
not.
If the parser was ignoring newlines, it would mean that every regexp  
would need to be tested against the entire document every single  
moment. This would be an even worse slowdown.
Allan is certainly aware of this slowdown on long lines, it is  
definitely in the top 5 complaints. If there was an easy solution,  
that doesn't sacrifice on the richness of the highlighting and the  
flexibility in user customization of it, it would probably have  
already been implemented.
Haris Skiadas
Department of Mathematics and Computer Science
Hanover College
On Feb 27, 2008, at 9:59 AM, Adam Strzelecki wrote:
...
Hello,
...
But in TextMate the syntax highlighter (and more) is line-based and
works with regular expressions and not a precompiled lexer/parser,  
so,
yes, the line length does matter.
I thought length doesn't matter ;)
Anyway I'm not convinced :) I agree that precompiled lexer/parser  
is simply faster, however I don't see the point that regexp  
tokenizer works 1000x slower on file that has single 200'000  
characters line, rather than 20000 x 10 character lines (while both  
files are exactly same size,... ouch, faster one is slightly bigger  
because of extra \n ;P)
I hope TM compiles its regular expressions just once, moreover  
execute single regexp made from merge of all regexp, rather all  
regular expressions separately.
With this belief and fact that compiled regexp is an automaton  
similar to the one in pre-compiled parsers & lexers, line length  
shouldn't IMHO matter.
So if it does, there's a place for optimization.
Cheers,
Adam Strzelecki |: nanoant.com :|

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [TxMt] TM tokenizer is taking 100% CPU for long while using 100-200KB text files with single line

Cheers,