On 28.05.2008, at 00:50, Hans-Jörg Bibiko wrote:
On 28.05.2008, at 00:34, Hans-Jörg Bibiko wrote:
On 27.05.2008, at 23:24, Mark Eli Kalderon wrote:
On May 27, 2008, at 10:11 PM, Jonas Steverud wrote:
20 maj 2008 kl. 20.07 skrev Patrick McElhaney:
CTRL+SHIFT+N. It's in the "Text" bundle.
One should make a note though that C-S-N doesn't return the number of characters, but the number of bytes. This is only an issue if you use multi-byte character, which is commonly enough to make the C-S-N command a bit broken IMHO.
Firstly only a short answer to count Unicode characters:
cat | ruby -e 'print STDIN.read.split(//u).size'
input: selection or doc output: Show Tooltip
Maybe better:
#!/usr/bin/ruby
bytes=chars=words=lines=0
STDIN.read.each_line { |l| lines+=1 bytes+=l.split(//).size chars+=l.split(//u).size words+=l.split(/ +/).size }
puts("Bytes: #{bytes}") puts("Characters: #{chars}") puts("Words: #{words}") puts("Lines: #{lines}")
Three Unicode problems aren't solved with that script. 1) combining diacritics e.g. é can be written as one single code point U+00E9 and as e + combining ´ U+0065 U+0301 This can be solved by ignoring these combining diacritics.
2) n-grams There are some glyphs which represent one phoneme but they are written as to characters. E.g. dz U+01F3 (dz)
3) ligatures E.g. the ligatur fi (fi)
2) and 3) could be solved by Unicode's canonical decomposition NKFD.
One could write such a script, but I guess Ruby is not able to do a NFKD, I mean one has to install a separate library. But it should work with Python. Maybe I find a bit time to write such a script tomorrow, because it's late ;)
Cheers,
--Hans