[TxMt] Better URL detection pattern

Juande Santander Vela juandesant at gmail.com
Wed Jul 28 08:58:05 UTC 2010


Cheers everyone!

I do not now if regular expressions are involved in the way TextMate detects URLs in text, but I'd gather they do. In that case, John Gruber has just compiled a regexp that seems to make an even better job at finding URLs embedded in plain text (even surrounded by parenthesis, or LaTeX code). First link contains the description, second link contains a text case page:

http://daringfireball.net/2010/07/improved_regex_for_matching_urls
http://daringfireball.net/misc/2010/07/url-matching-regex-test-data.text

The only problem I find with it is that the references to LaTeX parts, sections, chapters, etc., built from the LaTeX templates would be matched as well. So, using as inspiration the last expression he offers (only for http/https links), I have generalised it to include also ftp, sftp, smb, afp, and telnet:

(?xi)
\b
(                       # Capture 1: entire matched URL
  (?:
    https?://               # http or https protocol
    |                       #   or
    s?ftps?://              # sftp or ftp or ftps protocol
    |                       #   or
    smb://                  # smb protocol
    |                       #   or
    afp://                  # Apple file sharing protocol
    |                       #   or
    telnet://               # telnet protocol
    |                       #   or
    www\d{0,3}[.]           # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}/  # looks like domain name followed by a slash
  )
  (?:                       # One or more:
    [^\s()<>]+                  # Run of non-space, non-()<>
    |                           #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                       # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                               #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)

Can this be built into TextMate, or where should I change if I wanted it just for personal use?

Thanks!

--
Juande Santander Vela
Applied Scientist, Archive Management Group
Archive Department, Data Management & Operations Division
European Southern Observatory (Germany)

Felix Klein: Todo el mundo sabe lo que es una curva, hasta que estudia suficientes matemáticas como para confundirse con la innumerable cantidad de excepciones posibles.




More information about the textmate mailing list