[TxMt] Word Count
Hans-Jörg Bibiko
bibiko at eva.mpg.de
Tue May 27 23:13:50 UTC 2008
On 28.05.2008, at 00:50, Hans-Jörg Bibiko wrote:
> On 28.05.2008, at 00:34, Hans-Jörg Bibiko wrote:
>> On 27.05.2008, at 23:24, Mark Eli Kalderon wrote:
>>> On May 27, 2008, at 10:11 PM, Jonas Steverud wrote:
>>>> 20 maj 2008 kl. 20.07 skrev Patrick McElhaney:
>>>>> CTRL+SHIFT+N. It's in the "Text" bundle.
>>>>
>>>> One should make a note though that C-S-N doesn't return the
>>>> number of characters, but the number of bytes. This is only an
>>>> issue if you use multi-byte character, which is commonly enough
>>>> to make the C-S-N command a bit broken IMHO.
>>
>> Firstly only a short answer to count Unicode characters:
>>
>> cat | ruby -e 'print STDIN.read.split(//u).size'
>>
>> input: selection or doc
>> output: Show Tooltip
>
> Maybe better:
>
> #!/usr/bin/ruby
>
> bytes=chars=words=lines=0
>
> STDIN.read.each_line { |l|
> lines+=1
> bytes+=l.split(//).size
> chars+=l.split(//u).size
> words+=l.split(/ +/).size
> }
>
> puts("Bytes: #{bytes}")
> puts("Characters: #{chars}")
> puts("Words: #{words}")
> puts("Lines: #{lines}")
Three Unicode problems aren't solved with that script.
1) combining diacritics
e.g. é can be written as one single code point U+00E9 and as e +
combining ´ U+0065 U+0301
This can be solved by ignoring these combining diacritics.
2) n-grams
There are some glyphs which represent one phoneme but they are
written as to characters.
E.g. dz U+01F3 (dz)
3) ligatures
E.g. the ligatur fi (fi)
2) and 3) could be solved by Unicode's canonical decomposition NKFD.
One could write such a script, but I guess Ruby is not able to do a
NFKD, I mean one has to install a separate library. But it should
work with Python. Maybe I find a bit time to write such a script
tomorrow, because it's late ;)
Cheers,
--Hans
More information about the textmate
mailing list