Re: [TxMt] Word Count

28 May 2008


      On 28.05.2008, at 00:50, Hans-Jörg Bibiko wrote:
...
On 28.05.2008, at 00:34, Hans-Jörg Bibiko wrote:
...
On 27.05.2008, at 23:24, Mark Eli Kalderon wrote:
...
On May 27, 2008, at 10:11 PM, Jonas Steverud wrote:
...
20 maj 2008 kl. 20.07 skrev Patrick McElhaney:
...
CTRL+SHIFT+N. It's in the "Text" bundle.
One should make a note though that C-S-N doesn't return the  
number of characters, but the number of bytes. This is only an  
issue if you use multi-byte character, which is commonly enough  
to make the C-S-N command a bit broken IMHO.
Firstly only a short answer to count Unicode characters:
cat | ruby -e 'print STDIN.read.split(//u).size'
input: selection or doc
output: Show Tooltip
Maybe better:
#!/usr/bin/ruby
bytes=chars=words=lines=0
STDIN.read.each_line { |l|
   lines+=1
   bytes+=l.split(//).size
   chars+=l.split(//u).size
   words+=l.split(/ +/).size
}
puts("Bytes:      #{bytes}")
puts("Characters: #{chars}")
puts("Words:      #{words}")
puts("Lines:      #{lines}")
Three Unicode problems aren't solved with that script.
1) combining diacritics
e.g. é can be written as one single code point U+00E9 and as e +  
combining ´ U+0065 U+0301
This can be solved by ignoring these combining diacritics.
2) n-grams
There are some glyphs which represent one phoneme but they are  
written as to characters.
E.g. ǳ U+01F3 (dz)
3) ligatures
E.g. the ligatur ﬁ (fi)
2) and 3) could be solved by Unicode's canonical decomposition NKFD.
One could write such a script, but I guess Ruby is not able to do a  
NFKD, I mean one has to install a separate library. But it should  
work with Python. Maybe I find a bit time to write such a script  
tomorrow, because it's late ;)
Cheers,
--Hans

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [TxMt] Word Count