OT: Perl Compatible Regex problem I can't solve

List overview All Threads
Download

newer

older

Re: textmate Digest, Vol 7, Issue...

BUG: Printing page count

Mats Persson

7 Apr 2005 7 Apr '05

7:26 p.m.

OK, for way too many hours I have been trying to get this seemingly simple Perl compatible regex to work out, but I just can't seem to do it. I'm about to give up, move to a dark cave and shun computers for life. :(

The problem:

I have a string that is looking something like this: /a/b/c/d/ the string can be just /a/ or it can be /a/b/..../z/ very long.

I would like to catch all the various bits in this string [ /a/b/c/d/ ] as follows: $1 = /a/ $2 = b/ $3 = c/ $4 = d/

and so on for each added bit. The bits in between the "/" contains mainly [alphnums].

I've tried every regex version of this that I can think of and most don't return a damn thing, and others return the wrong things. A few days ago I thought I got this regex stuff, but now I'm in serious doubt.

I know that I can workaround the problem by doing other things, but it's become a bit of a burden on my mind. I'd like to know where I'm going wrong 'cause I can't see it at the moment and that drives me mad.

Extremely over the top grateful for any help. :)

Kind regards,

Mats

---- "TextMate, coding with an incredible sense of joy and ease" - www.macromates.com -

Show replies by date

Jonathan Chaffer

7 Apr 7 Apr

7:46 p.m.

On Apr 7, 2005, at 1:26 PM, Mats Persson wrote:

...

OK, for way too many hours I have been trying to get this seemingly simple Perl compatible regex to work out, but I just can't seem to do it. I'm about to give up, move to a dark cave and shun computers for life. :(

The problem:

I have a string that is looking something like this: /a/b/c/d/ the string can be just /a/ or it can be /a/b/..../z/ very long.

I would like to catch all the various bits in this string [ /a/b/c/d/ ] as follows: $1 = /a/ $2 = b/ $3 = c/ $4 = d/

and so on for each added bit. The bits in between the "/" contains mainly [alphnums].

I've tried every regex version of this that I can think of and most don't return a damn thing, and others return the wrong things. A few days ago I thought I got this regex stuff, but now I'm in serious doubt.

I don't think you can do this, as I mentioned on IRC. You can easily match an entire string of that form: /([^/]*/)+

The problem comes in getting the captures you want. Each set of parens will get you *one* capture, not many. As TextMate's help puts it: "On repetition, the last string captured is reported."

If you want to get multiple captures, you have to have multiple parens, which necessitates knowing how many (or at least the max number of) captures you have. For example:

/([^/]*/)([^/]*/)?([^/]*/)?([^/]*/)?([^/]*/)?([^/]*/)?([^/]*/)? ([^/]*/)*

Typically you'd solve this by having a regex that matches one time, and reporting multiple matches. In PHP, this would mean preg_match_all(), though you could solve this problem easier with explode().

Mats Persson

11 Apr 11 Apr

11:47 a.m.

Thanks everyone for your replies. Much appreciated!

Even though I've been working with regex on and off for many years I still find myself communicating with it in gestures and facial expressions rather than fluent talk. ;-)

Yes, I am very well aware of the various split on "/", explode() in PHP and so on implementations, but they are all slightly imperfect workarounds that I - falsely it turns out - assumed could be cut out with a single code line of clever regex. Hah! how stupid one can be! or is it the tools I'm working with ? ;-)

The strings I'm working with follows this general format: /a/b/c/d/ => (/a/)(b/)(c/)(d/) --> captures all of them /a/b/c/d => (/a/)(b/)(c/) --> ignore the last bit [ d ] as it does not have an ending slash

So the regex I would have wanted was something like this (simplified in my logic): (/\w+/)(\w+/)*? returning an array like this ( 1 => "/a/", 2 => "b/", 3 => "c/" ) (skipping /d )

I got most of the above working through PHP's preg_match_all() albeit slightly clumsily.

However, I realised - which is far too often the case when I'm encountering any problems - that my initial approach was wrong, and worked around the whole issue, so what I was trying to do is now done in less code just as I wanted in the first place. :)

The quote of the month was provided by Don Kalar on 7 Apr 2005, at 20:46:

...

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems." --Jamie Zawinski, in comp.lang.emacs

Kind regards,

Mats

---- "TextMate, coding with an incredible sense of joy and ease" - www.macromates.com -

Erich Ocean

7 Apr 7 Apr

8:23 p.m.

Why are you using a regex for this? Why not simply split the string using '/' as the delemiter, and get the results back in an array? Then you can used indexed access and get the "captures" that way. Or perhaps I'm missing something here...

Best, Erich

On Apr 7, 2005, at 10:26 AM, Mats Persson wrote:

...

OK, for way too many hours I have been trying to get this seemingly simple Perl compatible regex to work out, but I just can't seem to do it. I'm about to give up, move to a dark cave and shun computers for life. :(

The problem:

I have a string that is looking something like this: /a/b/c/d/ the string can be just /a/ or it can be /a/b/..../z/ very long.

I would like to catch all the various bits in this string [ /a/b/c/d/ ] as follows: $1 = /a/ $2 = b/ $3 = c/ $4 = d/

and so on for each added bit. The bits in between the "/" contains mainly [alphnums].

I've tried every regex version of this that I can think of and most don't return a damn thing, and others return the wrong things. A few days ago I thought I got this regex stuff, but now I'm in serious doubt.

I know that I can workaround the problem by doing other things, but it's become a bit of a burden on my mind. I'd like to know where I'm going wrong 'cause I can't see it at the moment and that drives me mad.

Extremely over the top grateful for any help. :)

Kind regards,

Mats

"TextMate, coding with an incredible sense of joy and ease"

www.macromates.com -

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Don Kalar

9:46 p.m.

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

--Jamie Zawinski, in comp.lang.emacs

I think Erich is right - a simple parser seems like the better answer for you in this case.

cheers,

-don

David Lee

8 Apr 8 Apr

7:12 a.m.

#!/usr/bin/perl while (<>) { while($_ =~ s/^/?([^/]+)/?//) { print $1,"\n"; } }

It doesn't match them all in one match, but loops destructively against each line. It's also less preferable than a split, unless you're expecting input which contains slashes and doesn't otherwise match your criteria.

I realised when attempting this I have no idea how to make captured matches not overwrite each other - eg running /^(?:(\w+) ?)$/ against 'foo bar baz woz' returns only $1=woz; previous matches (foo etc) are overwritten.

If anyone knows how to alleviate that, it's driving me mental ..

Cheers

-----Original Message----- From: Mats Persson [mailto:mats@imediatec.co.uk] Sent: Friday, 8 April 2005 3:27 AM To: TM Users Subject: [TxMt] OT: Perl Compatible Regex problem I can't solve

The problem:

I have a string that is looking something like this: /a/b/c/d/ the string can be just /a/ or it can be /a/b/..../z/ very long.

I would like to catch all the various bits in this string [ /a/b/c/d/ ] as follows: $1 = /a/ $2 = b/ $3 = c/ $4 = d/

and so on for each added bit. The bits in between the "/" contains mainly [alphnums].

Extremely over the top grateful for any help. :)

Kind regards,

Mats

---- "TextMate, coding with an incredible sense of joy and ease" - www.macromates.com -

______________________________________________________________________ For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Paul McCann

8:11 a.m.

Dave wrote...

...

It doesn't match them all in one match, but loops destructively against each line. It's also less preferable than a split, unless you're expecting input which contains slashes and doesn't otherwise match your criteria.

split is definitely the way to go for this problem (if shelling out isn't against the rules!).

...

I realised when attempting this I have no idea how to make captured matches not overwrite each other - eg running /^(?:(\w+) ?)$/ against 'foo bar baz woz' returns only $1=woz; previous matches (foo etc) are overwritten.

If anyone knows how to alleviate that, it's driving me mental ..

There is no way: the variables $1, $2 etc are reset on every successful use a regex, even when the regex does not capture output via parentheses. This is as it should be (otherwise you'd get all sorts of leftover nonsense down the track). Just grab what you want at the time the regex runs.

Trying to mimic your solution with some personal preferences mixed in produced

#!/usr/bin/perl -w my $string=('/a/b/c/d/efgh/joke/345/'); my @found=($string=~m|/?([^/]+)/?|g); print join "\n",@found;

(I've used | | as the regex delimiters to stop all the leaning toothpicks, and have used a global match rather than a substitution. In list context (as here) all the matches are returned, so I've captured them in the array @found.)

Best wishes, Paul

Sune Foldager

9:01 a.m.

No regex query will match stuff like this (since the number of captures is potentially infinite), short of recursive regex stuff (which I think PCRE might support, but I am not sure how it relates to captures). But like other people suggested, using split or similar would be much better in this case, I think.

-- Sune.

7432

days inactive

7436

days old

textmate@lists.macromates.com

7 comments

participants

tags (0)

participants (7)

David Lee
Don Kalar
Erich Ocean
Jonathan Chaffer
Mats Persson
Paul McCann
Sune Foldager