[TxMt] Word Count

Hans-Jörg Bibiko bibiko at eva.mpg.de
Wed May 28 17:49:08 UTC 2008


On 28.05.2008, at 19:38, Hans-Jörg Bibiko wrote:

>
> On 28.05.2008, at 19:30, Allan Odgaard wrote:
>
>> On 28 May 2008, at 18:55, Jonas Steverud wrote:
>>
>>> [...]
>>> Yes, but I am not interested in the number of bytes, I would like  
>>> to know the number of characters, which is not the same thing.  
>>> Räksmörgås is ten characters but is reported as 13 bytes since  
>>> åäö are stored as multi-byte characters. I use the Statistics for  
>>> Document  / Selection (word count) command from the Text Bundle  
>>> and the ruby script uses wc -l for statistics, which is not  
>>> Unicode aware AFAIK.
>>
>> Actually ‘wc’ _is_ multi-byte (encoding) aware. But for that, one  
>> has to use the -m[ulti-bytes] instead of -c[haracters].
>>
>> So for a quick fix, change ‘wc -lwc’ in the command to ‘wc -lwm’  
>> and it should work as you expect.
>
> Not for me on Mac ppc 10.4.11.
Oops. Of course, one has to set LC_ALL in the Ruby script.

In the bundle command 'Statistics for Doc/sel (Word Count) one should  
write:

...
counts = `export LC_ALL=en_GB.UTF-8;wc -lwm`.scan(/\d+/)
...


--Hans


More information about the textmate mailing list