[TxMt] onigrep : Help wanted (a bit off-topic)

Hans-Joerg Bibiko bibiko at eva.mpg.de
Fri Jun 29 08:36:40 UTC 2007


Dear all,

I know it is a bit off-topic but I believe it could also be  
interesting for some TM users ;)

I'm just writing a grep-like command line tool based on the Oniguruma  
library to work with UTF-8 data.
It works perfectly, and in many many cases it's faster than grep ;)

In order to be sure that this command line tool written in pure C  
works on other Macs as well, I'd be appreciate if someone has a bit  
time and a bit free hard disk space to check whether it runs for her/ 
him too. Especially whether it runs on a Intel Mac.

To run onigrep it is necessary to install the Oniguruma dylib in  
beforehand. To do this simply

- download the source code from http://www.geocities.jp/kosako3/ 
oniguruma/archive/onig-5.8.0.tar.gz
- untar it
- cd in that folder
- execute:
./configure
make
sudo make install

that's it.
Normally Oniguruma dylib is installed in /usr/local/lib.

[I believe to use the external dylib is the best choice because  
Oniguruma will be better and better. So you only have to upgrade the  
dylib and not onigrep.]

Now you can run onigrep. For help type 'onigrep --help'. Up to now it  
only reads UTF-8 data from stdin.
[Please note, if you did't copy onigrep in a folder listed in $PATH  
you have to write the entire path to onigrep or if you're in the  
folder where onigrep is located just type ./onigrep]

Some features in short terms:
- utf-8 support (that means a '.' is really one Unicode character)
- ignore case also works for all Unicode characters, not only for ASCII
- you can search across \n; multi-line mode
- ignore combining diacritics (for that you have to decompose  
accented characters according the Unicode canonical decomposition  
algorithm
   (I attached such a tool. It is called 'unorm'. For help run 'unorm  
--help'.)
    example:
    echo "Ag̀nes" | ./onigrep -id -i -o "a(.)n"

    will output 'g̀'

    echo "Ag̀nes" | ./onigrep -i -o "a(.)n"

    will output nothing because ǵ is written with two Unicode  
characters
- it is faster than grep in many cases:

   try:
   cat /usr/share/dict/web2 | ./onigrep "y$" -c
   cat /usr/share/dict/web2 | grep "y$" -c

- option -cl counts the matches per line
   example:
   onigrep "\w+" -cl -n
   How many words per line?

- you can write the regexp without escaping '(', ')', etc. as with grep

Please note, onigrep is still work in progress.

Many thanks in advanced und any feedback (suggestions, bugs, wishes)  
is welcomed!!

Hans

-------------- next part --------------
A non-text attachment was scrubbed...
Name: onigrep
Type: application/octet-stream
Size: 23268 bytes
Desc: not available
URL: <http://lists.macromates.com/textmate/attachments/20070629/d7a8f852/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unorm
Type: application/octet-stream
Size: 27132 bytes
Desc: not available
URL: <http://lists.macromates.com/textmate/attachments/20070629/d7a8f852/attachment-0001.obj>
-------------- next part --------------


PS  onigrep and unorm will be available for free.
PPS One possible meaning of the Japanese word "Oniguruma" is "Devil's  
wheel" like Textmate's icon ;)




More information about the textmate mailing list