On Nov 16, 2007, at 8:00 AM, Hans-Jörg Bibiko wrote:
On 16.11.2007, at 13:29, Thomas Aylott - subtleGradient wrote:
This runs into the problem I'd been having for 3 years. How do you get it to work when you have a tag nested inside the same kind of tag? Keeping it from matching the first close tag it finds, or the very last one.
<div> <div> <div> TEXT </div> </div> </div>
Of course, you're right. That is THE problem! And I also have no solution for it by using regexp.
One way I have in my mind is to write a character by character parser. If one has found the closing tag (e.g. 'p') it should be possible to go from the caret's position step by step to the right side to look for '</p>'. If one finds '<p...>' while doing this a counter would be set counter+1; if one finds '</p>' the counter would be set to counter-1; then if counter < 0 I found my closing tag (meaning index). As next the same from the caret's position to left side. If one writes this in perl/ruby/... and the entire text is stored as character array I can splice the array and finally I have the desired string. With that string I can execute a normal findNext and findPrevios macro.
I don't know whether it works but ... Maybe I find some time to try it out. The advantage would be that I don't have to parse the entire document. Or one would write it in Objective-C as plug-in, or Allan has a nice idea for it ;)
On the other hand I thought about to use an external HTML parser. This works but the parser is also very slow if one has a large HTML file. One could think about to restrict the area - 100 line above and below the current line - for parsing but this is also tricky.
Cheers,
--Hans
One idea is to remove the problem of all the nested identical tags by using 1 pass to make all tagnames unique. Something like what you said with a counter that goes up and down as it hits a duplicate tagname:
<div1> <div2> <div3> TEXT </div3> </div2> </div1>
Then you could do a simpler regex to find the balance of the tags.
Then it's just a matter of wrapping the selection with something unique... Fixing the document again... And then finding your selection again... And then removing that unique wrapper.
We'd have to come up with a nice way to limit the scope initially so you don't have to parse the whole document every time.
I'm sure there's a simple way to do it that we're just not seeing.
—Thomas Aylott – subtleGradient—