[TxMt] Word Count

Allan Odgaard throw-away-2 at macromates.com
Wed May 28 17:30:08 UTC 2008


On 28 May 2008, at 18:55, Jonas Steverud wrote:

> [...]
> Yes, but I am not interested in the number of bytes, I would like to  
> know the number of characters, which is not the same thing.  
> Räksmörgås is ten characters but is reported as 13 bytes since åäö  
> are stored as multi-byte characters. I use the Statistics for  
> Document  / Selection (word count) command from the Text Bundle and  
> the ruby script uses wc -l for statistics, which is not Unicode  
> aware AFAIK.

Actually ‘wc’ _is_ multi-byte (encoding) aware. But for that, one has  
to use the -m[ulti-bytes] instead of -c[haracters].

So for a quick fix, change ‘wc -lwc’ in the command to ‘wc -lwm’ and  
it should work as you expect.

Of course this does not handle all the complex issues of pre/ 
decomposed unicode, diacritics, and ligatures that Hans-Joerg mentioned.




More information about the textmate mailing list