[TxMt] Word Count

Hans-Jörg Bibiko bibiko at eva.mpg.de
Tue May 27 23:13:50 UTC 2008


On 28.05.2008, at 00:50, Hans-Jörg Bibiko wrote:

> On 28.05.2008, at 00:34, Hans-Jörg Bibiko wrote:
>> On 27.05.2008, at 23:24, Mark Eli Kalderon wrote:
>>> On May 27, 2008, at 10:11 PM, Jonas Steverud wrote:
>>>> 20 maj 2008 kl. 20.07 skrev Patrick McElhaney:
>>>>> CTRL+SHIFT+N. It's in the "Text" bundle.
>>>>
>>>> One should make a note though that C-S-N doesn't return the  
>>>> number of characters, but the number of bytes. This is only an  
>>>> issue if you use multi-byte character, which is commonly enough  
>>>> to make the C-S-N command a bit broken IMHO.
>>
>> Firstly only a short answer to count Unicode characters:
>>
>> cat | ruby -e 'print STDIN.read.split(//u).size'
>>
>> input: selection or doc
>> output: Show Tooltip
>
> Maybe better:
>
> #!/usr/bin/ruby
>
> bytes=chars=words=lines=0
>
> STDIN.read.each_line { |l|
> 	lines+=1
> 	bytes+=l.split(//).size
> 	chars+=l.split(//u).size
> 	words+=l.split(/ +/).size
> }
>
> puts("Bytes:      #{bytes}")
> puts("Characters: #{chars}")
> puts("Words:      #{words}")
> puts("Lines:      #{lines}")

Three Unicode problems aren't solved with that script.
1) combining diacritics
e.g. é can be written as one single code point U+00E9 and as e +  
combining ´ U+0065 U+0301
This can be solved by ignoring these combining diacritics.

2) n-grams
There are some glyphs which represent one phoneme but they are  
written as to characters.
E.g. dz U+01F3 (dz)

3) ligatures
E.g. the ligatur fi (fi)

2) and 3) could be solved by Unicode's canonical decomposition NKFD.

One could write such a script, but I guess Ruby is not able to do a  
NFKD, I mean one has to install a separate library. But it should  
work with Python. Maybe I find a bit time to write such a script  
tomorrow, because it's late ;)

Cheers,

--Hans




More information about the textmate mailing list