TM tokenizer is taking 100% CPU for long while using 100-200KB text files with single line

List overview All Threads
Download

newer

older

"Find" doesn't work when...

Simple query...

Adam Strzelecki

27 Feb 2008 27 Feb '08

12:51 p.m.

Hello,

I've sent a mail to Allan regarding the problem, but I want to share also this problem with all of you subscribed to the TM list. Maybe you have also encountered this?

I'm working with few auto-generated FO (XML) files that have no line- breaks.

You can reproduce such dummy file with: $ cd /tmp $ perl -e '{print "<html></html>" x 20000}' > test.html $ mate test.html

As you see TM gonna take 100% CPU for a really long-long while until you get syntax highlight. Also any modification (line-break) causes TM to take again 100% CPU for a while.

I'm aware it is rare case to work with such files, but is there any remedy for this problem? I have soft-wrap turned on, also using System Monitor I can see that it is tokenizer thread that is taking the whole CPU.

Regards,

-- Adam Strzelecki |: nanoant.com :|

Show replies by date

Niels Kobschaetzki

27 Feb 27 Feb

1:03 p.m.

On Wed, Feb 27, 2008 at 12:51 PM, Adam Strzelecki ono@java.pl wrote:

...

Hello,

I've sent a mail to Allan regarding the problem, but I want to share also this problem with all of you subscribed to the TM list. Maybe you have also encountered this?

I'm working with few auto-generated FO (XML) files that have no line- breaks.

You can reproduce such dummy file with: $ cd /tmp $ perl -e '{print "<html></html>" x 20000}' > test.html $ mate test.html

As you see TM gonna take 100% CPU for a really long-long while until you get syntax highlight. Also any modification (line-break) causes TM to take again 100% CPU for a while.

Interesting -- TM should get optimized for two cores…most of the workload is done by 1 core ans from time to time one see peaks in the other core but never anything beyond 100% (I would have expected nearly 200% cpu-load)

Niels

Hans-Joerg Bibiko

1:11 p.m.

On 27 Feb 2008, at 12:51, Adam Strzelecki wrote:

...

I've sent a mail to Allan regarding the problem, but I want to share also this problem with all of you subscribed to the TM list. Maybe you have also encountered this?

I'm working with few auto-generated FO (XML) files that have no line-breaks.

You can reproduce such dummy file with: $ cd /tmp $ perl -e '{print "<html></html>" x 20000}' > test.html $ mate test.html

If I do this test.html pops up in a second. My settings are Soft wrap is On AND! Check Spelling as you type is OFF. But then you get into troubles while editing. In such a case where a XML file contains only one long line I use a tiny perl script which replaces any closing tag by the closing tag itself plus '\n' before I pipe the text to mate.

--Hans

Adam Strzelecki

1:47 p.m.

...

If I do this test.html pops up in a second. My settings are Soft wrap is On AND! Check Spelling as you type is OFF. But then you get into troubles while editing.

This is exactly what I'm trying to emphasize.

When I use: $ perl -e '{print "<html></html>" x 20000}' > test.html File loads immediately but TM takes 100% CPU for few minutes then syntax highlight appears.

When I use: $ perl -e '{print "<html></html>\n" x 20000}' > test.html (Note the \n) File loads immediately together with syntax highlight, no 100% CPU.

So I think there's definitely something wrong with syntax highlight (tokenizer). I remember compiler & parser construction lessons on my university, and the lexer & parser performance shouldn't matter of line breaks.

Regards,

-- Adam Strzelecki |: nanoant.com :|

Sven Axelsson

2:53 p.m.

On 27/02/2008, Adam Strzelecki ono@java.pl wrote:

...

When I use: $ perl -e '{print "<html></html>\n" x 20000}' > test.html (Note the \n) File loads immediately together with syntax highlight, no 100% CPU.

So I think there's definitely something wrong with syntax highlight (tokenizer). I remember compiler & parser construction lessons on my university, and the lexer & parser performance shouldn't matter of line breaks.

But in TextMate the syntax highlighter (and more) is line-based and works with regular expressions and not a precompiled lexer/parser, so, yes, the line length does matter.

-- Sven Axelsson ++++++++++[>++++++++++>+++++++++++>++++++++++>++++++ >++++<<<<<-]>++++.+.++++.>+++++.>+.<<-.>>+.>++++.<<. +++.>-.<<++.>>----.<++.>>>++++++.<<<<.>>++++.<----.

Adam Strzelecki

3:59 p.m.

Hello,

...

But in TextMate the syntax highlighter (and more) is line-based and works with regular expressions and not a precompiled lexer/parser, so, yes, the line length does matter.

I thought length doesn't matter ;)

Anyway I'm not convinced :) I agree that precompiled lexer/parser is simply faster, however I don't see the point that regexp tokenizer works 1000x slower on file that has single 200'000 characters line, rather than 20000 x 10 character lines (while both files are exactly same size,... ouch, faster one is slightly bigger because of extra \n ;P)

I hope TM compiles its regular expressions just once, moreover execute single regexp made from merge of all regexp, rather all regular expressions separately. With this belief and fact that compiled regexp is an automaton similar to the one in pre-compiled parsers & lexers, line length shouldn't IMHO matter. So if it does, there's a place for optimization.

Cheers,

-- Adam Strzelecki |: nanoant.com :|

Charilaos Skiadas

4:50 p.m.

You missed the "line-based" part. This means that each regular expression must be checked against each entire line. Unless I'm mistaken, the time taken by any regexp mechanism is going to be much more than linear in the length of the string, probably exponential in the worst case. On the other hand, if you test it on strings of fixed length, it will be linear on the number of strings tested, which is the second case you described. In other words, when using a line- based system it takes longer to process a single string, that to process many strings of small length.

So it is not so much a question of compiling, the timing is very different because newlines do matter, the parser matches single lines via regular expressions. The longer the line, the harder it is to determine if one of the regular expressions matches the whole line or not.

If the parser was ignoring newlines, it would mean that every regexp would need to be tested against the entire document every single moment. This would be an even worse slowdown.

Allan is certainly aware of this slowdown on long lines, it is definitely in the top 5 complaints. If there was an easy solution, that doesn't sacrifice on the richness of the highlighting and the flexibility in user customization of it, it would probably have already been implemented.

Haris Skiadas Department of Mathematics and Computer Science Hanover College

On Feb 27, 2008, at 9:59 AM, Adam Strzelecki wrote:

...

Hello,

...
But in TextMate the syntax highlighter (and more) is line-based and works with regular expressions and not a precompiled lexer/parser, so, yes, the line length does matter.

I thought length doesn't matter ;)

Anyway I'm not convinced :) I agree that precompiled lexer/parser is simply faster, however I don't see the point that regexp tokenizer works 1000x slower on file that has single 200'000 characters line, rather than 20000 x 10 character lines (while both files are exactly same size,... ouch, faster one is slightly bigger because of extra \n ;P)

I hope TM compiles its regular expressions just once, moreover execute single regexp made from merge of all regexp, rather all regular expressions separately. With this belief and fact that compiled regexp is an automaton similar to the one in pre-compiled parsers & lexers, line length shouldn't IMHO matter. So if it does, there's a place for optimization.

Cheers,

Adam Strzelecki |: nanoant.com :|

6389

days inactive

6389

days old

textmate@lists.macromates.com

6 comments

participants

tags (0)

participants (5)

Adam Strzelecki
Charilaos Skiadas
Hans-Joerg Bibiko
Niels Kobschaetzki
Sven Axelsson