[TxMt] [ANN] Select Balanced HTML Tag!!!1!

Alex Ross alex.j.ross at gmail.com
Fri Nov 16 17:30:01 UTC 2007


On Nov 16, 2007, at 8:02 AM, Thomas Aylott - subtleGradient wrote:
> On Nov 16, 2007, at 8:00 AM, Hans-Jörg Bibiko wrote:
>
>> On 16.11.2007, at 13:29, Thomas Aylott - subtleGradient wrote:
>>> This runs into the problem I'd been having for 3 years.
>>> How do you get it to work when you have a tag nested inside the  
>>> same kind of tag?
>>> Keeping it from matching the first close tag it finds, or the very  
>>> last one.
>>> <div>
>>> 	<div>
>>> 		<div>
>>> 			TEXT
>>> 		</div>
>>> 	</div>
>>> </div>
>>
>> Of course, you're right. That is THE problem! And I also have no  
>> solution for it by using regexp.
>>
>> One way I have in my mind is to write a character by character  
>> parser. If one has found the closing tag (e.g. 'p') it should be  
>> possible to go from the caret's position step by step to the right  
>> side to look for '</p>'. If one finds '<p...>' while doing this a  
>> counter would be set counter+1; if one finds '</p>' the counter  
>> would be set to counter-1; then if counter < 0 I found my closing  
>> tag (meaning index). As next the same from the caret's position to  
>> left side. If one writes this in perl/ruby/... and the entire text  
>> is stored as character array I can splice the array and finally I  
>> have the desired string. With that string I can execute a normal  
>> findNext and findPrevios macro.
>>
>> I don't know whether it works but ...
>> Maybe I find some time to try it out. The advantage would be that I  
>> don't have to parse the entire document.
>> Or one would write it in Objective-C as plug-in, or Allan has a  
>> nice idea for it ;)
>>
>> On the other hand I thought about to use an external HTML parser.  
>> This works but the parser is also very slow if one has a large HTML  
>> file. One could think about to restrict the area - 100 line above  
>> and below the current line - for parsing but this is also tricky.
>>
>>
>> Cheers,
>>
>> --Hans
>
>
> One idea is to remove the problem of all the nested identical tags  
> by using 1 pass to make all tagnames unique.
> Something like what you said with a counter that goes up and down as  
> it hits a duplicate tagname:
>
> <div1>
> 	<div2>
> 		<div3>
> 			TEXT
> 		</div3>
> 	</div2>
> </div1>
>
> Then you could do a simpler regex to find the balance of the tags.
>
> Then it's just a matter of wrapping the selection with something  
> unique...
> Fixing the document again...
> And then finding your selection again...
> And then removing that unique wrapper.
>
> We'd have to come up with a nice way to limit the scope initially so  
> you don't have to parse the whole document every time.
>
> I'm sure there's a simple way to do it that we're just not seeing.
>
> —Thomas Aylott – subtleGradient—

I don't mean to get all nerdy on you guys but the problem that you're  
running into here is that HTML is context-free.  Regular expressions  
can't match context free languages.  There are mathematical proofs of  
this... so what you're trying to do is really truly impossible.

What you need is someway to remember some state, which is the counter  
that Hans mentioned or a transformation of the tags so that each has a  
unique (ordered) identifier.  Also, you can match it if you have the  
recursive subexpression from Oniguruma.  But you don't.  My uh...  
suggestion is to give up using plain-old regular expressions :).  It  
can never work.

–Alex






More information about the textmate mailing list