Word Count

List overview All Threads
Download

newer

older

Feature Request: Copy the content...

AS2- How do I target my root MC...

Marc Chanliau

20 May 2008 20 May '08

7:49 p.m.

I have a page of text edited in Text Mate. I want to know the number of characters in a specific paragraph (by highlighting that paragraph). Is this possible in Text Mate and if yes, how? Thanks in advance.

Attachments:

attachment.htm (text/html — 214 bytes)

Show replies by date

Patrick McElhaney

20 May 20 May

8:07 p.m.

CTRL+SHIFT+N. It's in the "Text" bundle.

On Tue, May 20, 2008 at 1:49 PM, Marc Chanliau marc.chanliau@gmail.com wrote:

...

I have a page of text edited in Text Mate. I want to know the number of characters in a specific paragraph (by highlighting that paragraph). Is this possible in Text Mate and if yes, how? Thanks in advance.

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

-- Patrick McElhaney 704.560.9117

Marc Chanliau

11:42 p.m.

Great! Thanks for the fast response.

On Tue, May 20, 2008 at 11:07 AM, Patrick McElhaney pmcelhaney@gmail.com wrote:

...

CTRL+SHIFT+N. It's in the "Text" bundle.

On Tue, May 20, 2008 at 1:49 PM, Marc Chanliau marc.chanliau@gmail.com wrote:

...
I have a page of text edited in Text Mate. I want to know the number of characters in a specific paragraph (by highlighting that paragraph). Is

this

...
possible in Text Mate and if yes, how? Thanks in advance.

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

-- Patrick McElhaney 704.560.9117

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Jonas Steverud

27 May 27 May

11:11 p.m.

20 maj 2008 kl. 20.07 skrev Patrick McElhaney:

...

CTRL+SHIFT+N. It's in the "Text" bundle.

One should make a note though that C-S-N doesn't return the number of characters, but the number of bytes. This is only an issue if you use multi-byte character, which is commonly enough to make the C-S-N command a bit broken IMHO.

I would be very grateful if anyone could point to a function that does the equivalent of C-S-N but returns the proper number of characters and not bytes (the ideal would be "full" statistics; words, characters and bytes). I made a quick hack but realised that I did not know how to tell Perl what character encoding there where, i.e. that it was UTF-8 or Latin-1.

Thanks.

/Jonas

Mark Eli Kalderon

11:24 p.m.

On May 27, 2008, at 10:11 PM, Jonas Steverud wrote:

...

20 maj 2008 kl. 20.07 skrev Patrick McElhaney:

...
CTRL+SHIFT+N. It's in the "Text" bundle.

One should make a note though that C-S-N doesn't return the number of characters, but the number of bytes. This is only an issue if you use multi-byte character, which is commonly enough to make the C-S-N command a bit broken IMHO.

I would be very grateful if anyone could point to a function that does the equivalent of C-S-N but returns the proper number of characters and not bytes (the ideal would be "full" statistics; words, characters and bytes). I made a quick hack but realised that I did not know how to tell Perl what character encoding there where, i.e. that it was UTF-8 or Latin-1.

The command in the text bundles does report the full statistics...lines, words, bytes. Perhaps you are using a modified word count command that uses the same keybinding.

Best, Mark

Hans-Jörg Bibiko

28 May 28 May

12:34 a.m.

On 27.05.2008, at 23:24, Mark Eli Kalderon wrote:

...

On May 27, 2008, at 10:11 PM, Jonas Steverud wrote:

...
20 maj 2008 kl. 20.07 skrev Patrick McElhaney:

...
CTRL+SHIFT+N. It's in the "Text" bundle.

One should make a note though that C-S-N doesn't return the number of characters, but the number of bytes. This is only an issue if you use multi-byte character, which is commonly enough to make the C-S-N command a bit broken IMHO.

Firstly only a short answer to count Unicode characters:

cat | ruby -e 'print STDIN.read.split(//u).size'

input: selection or doc output: Show Tooltip

--Hans

Hans-Jörg Bibiko

12:50 a.m.

On 28.05.2008, at 00:34, Hans-Jörg Bibiko wrote:

...

On 27.05.2008, at 23:24, Mark Eli Kalderon wrote:

...
On May 27, 2008, at 10:11 PM, Jonas Steverud wrote:

...
20 maj 2008 kl. 20.07 skrev Patrick McElhaney:

...
CTRL+SHIFT+N. It's in the "Text" bundle.

One should make a note though that C-S-N doesn't return the number of characters, but the number of bytes. This is only an issue if you use multi-byte character, which is commonly enough to make the C-S-N command a bit broken IMHO.

Firstly only a short answer to count Unicode characters:

cat | ruby -e 'print STDIN.read.split(//u).size'

input: selection or doc output: Show Tooltip

Maybe better:

#!/usr/bin/ruby

bytes=chars=words=lines=0

STDIN.read.each_line { |l| lines+=1 bytes+=l.split(//).size chars+=l.split(//u).size words+=l.split(/ +/).size }

puts("Bytes: #{bytes}") puts("Characters: #{chars}") puts("Words: #{words}") puts("Lines: #{lines}")

One could output it much more prettier ;)

--Hans

Hans-Jörg Bibiko

1:13 a.m.

On 28.05.2008, at 00:50, Hans-Jörg Bibiko wrote:

...

On 28.05.2008, at 00:34, Hans-Jörg Bibiko wrote:

...
On 27.05.2008, at 23:24, Mark Eli Kalderon wrote:

...
On May 27, 2008, at 10:11 PM, Jonas Steverud wrote:

...
20 maj 2008 kl. 20.07 skrev Patrick McElhaney:

...
CTRL+SHIFT+N. It's in the "Text" bundle.

One should make a note though that C-S-N doesn't return the number of characters, but the number of bytes. This is only an issue if you use multi-byte character, which is commonly enough to make the C-S-N command a bit broken IMHO.

Firstly only a short answer to count Unicode characters:

cat | ruby -e 'print STDIN.read.split(//u).size'

input: selection or doc output: Show Tooltip

Maybe better:

#!/usr/bin/ruby

bytes=chars=words=lines=0

STDIN.read.each_line { |l| lines+=1 bytes+=l.split(//).size chars+=l.split(//u).size words+=l.split(/ +/).size }

puts("Bytes: #{bytes}") puts("Characters: #{chars}") puts("Words: #{words}") puts("Lines: #{lines}")

Three Unicode problems aren't solved with that script. 1) combining diacritics e.g. é can be written as one single code point U+00E9 and as e + combining ´ U+0065 U+0301 This can be solved by ignoring these combining diacritics.

2) n-grams There are some glyphs which represent one phoneme but they are written as to characters. E.g. ǳ U+01F3 (dz)

3) ligatures E.g. the ligatur ﬁ (fi)

2) and 3) could be solved by Unicode's canonical decomposition NKFD.

One could write such a script, but I guess Ruby is not able to do a NFKD, I mean one has to install a separate library. But it should work with Python. Maybe I find a bit time to write such a script tomorrow, because it's late ;)

Cheers,

--Hans

Jonas Steverud

6:55 p.m.

27 maj 2008 kl. 23.24 skrev Mark Eli Kalderon:

...

On May 27, 2008, at 10:11 PM, Jonas Steverud wrote:

...
20 maj 2008 kl. 20.07 skrev Patrick McElhaney:

...
CTRL+SHIFT+N. It's in the "Text" bundle.

One should make a note though that C-S-N doesn't return the number of characters, but the number of bytes. This is only an issue if you use multi-byte character, which is commonly enough to make the C-S-N command a bit broken IMHO.

I would be very grateful if anyone could point to a function that does the equivalent of C-S-N but returns the proper number of characters and not bytes (the ideal would be "full" statistics; words, characters and bytes). I made a quick hack but realised that I did not know how to tell Perl what character encoding there where, i.e. that it was UTF-8 or Latin-1.

The command in the text bundles does report the full statistics...lines, words, bytes. Perhaps you are using a modified word count command that uses the same keybinding.

Yes, but I am not interested in the number of bytes, I would like to know the number of characters, which is not the same thing. Räksmörgås is ten characters but is reported as 13 bytes since åäö are stored as multi-byte characters. I use the Statistics for Document / Selection (word count) command from the Text Bundle and the ruby script uses wc -l for statistics, which is not Unicode aware AFAIK.

/Jonas

Allan Odgaard

7:30 p.m.

On 28 May 2008, at 18:55, Jonas Steverud wrote:

...

[...] Yes, but I am not interested in the number of bytes, I would like to know the number of characters, which is not the same thing. Räksmörgås is ten characters but is reported as 13 bytes since åäö are stored as multi-byte characters. I use the Statistics for Document / Selection (word count) command from the Text Bundle and the ruby script uses wc -l for statistics, which is not Unicode aware AFAIK.

Actually ‘wc’ _is_ multi-byte (encoding) aware. But for that, one has to use the -m[ulti-bytes] instead of -c[haracters].

So for a quick fix, change ‘wc -lwc’ in the command to ‘wc -lwm’ and it should work as you expect.

Of course this does not handle all the complex issues of pre/ decomposed unicode, diacritics, and ligatures that Hans-Joerg mentioned.

Hans-Jörg Bibiko

7:38 p.m.

On 28.05.2008, at 19:30, Allan Odgaard wrote:

...

On 28 May 2008, at 18:55, Jonas Steverud wrote:

...
[...] Yes, but I am not interested in the number of bytes, I would like to know the number of characters, which is not the same thing. Räksmörgås is ten characters but is reported as 13 bytes since åäö are stored as multi-byte characters. I use the Statistics for Document / Selection (word count) command from the Text Bundle and the ruby script uses wc -l for statistics, which is not Unicode aware AFAIK.

Actually ‘wc’ _is_ multi-byte (encoding) aware. But for that, one has to use the -m[ulti-bytes] instead of -c[haracters].

So for a quick fix, change ‘wc -lwc’ in the command to ‘wc -lwm’ and it should work as you expect.

Not for me on Mac ppc 10.4.11.

--Hans

Hans-Jörg Bibiko

7:49 p.m.

On 28.05.2008, at 19:38, Hans-Jörg Bibiko wrote:

...

On 28.05.2008, at 19:30, Allan Odgaard wrote:

...
On 28 May 2008, at 18:55, Jonas Steverud wrote:

...
[...] Yes, but I am not interested in the number of bytes, I would like to know the number of characters, which is not the same thing. Räksmörgås is ten characters but is reported as 13 bytes since åäö are stored as multi-byte characters. I use the Statistics for Document / Selection (word count) command from the Text Bundle and the ruby script uses wc -l for statistics, which is not Unicode aware AFAIK.

Actually ‘wc’ _is_ multi-byte (encoding) aware. But for that, one has to use the -m[ulti-bytes] instead of -c[haracters].

So for a quick fix, change ‘wc -lwc’ in the command to ‘wc -lwm’ and it should work as you expect.

Not for me on Mac ppc 10.4.11.

Oops. Of course, one has to set LC_ALL in the Ruby script.

In the bundle command 'Statistics for Doc/sel (Word Count) one should write:

... counts = `export LC_ALL=en_GB.UTF-8;wc -lwm`.scan(/\d+/) ...

--Hans

Allan Odgaard

7:59 p.m.

On 28 May 2008, at 19:49, Hans-Jörg Bibiko wrote:

...

[...] Oops. Of course, one has to set LC_ALL in the Ruby script.

TextMate sets LC_CTYPE for the programs it executes.

So on a normal system it should not be necessary to set this up in the script. However, other locale variables take precedence over LC_CTYPE, so most likely you have anther one set (to something other than UTF-8).

Hans-Jörg Bibiko

8:20 p.m.

On 28.05.2008, at 19:59, Allan Odgaard wrote:

...

On 28 May 2008, at 19:49, Hans-Jörg Bibiko wrote:

...
[...] Oops. Of course, one has to set LC_ALL in the Ruby script.

TextMate sets LC_CTYPE for the programs it executes.

Yes, I know, but I do not know if Ruby by itself starts a system command via `` whether that command inherits the locale settings?

...

So on a normal system it should not be necessary to set this up in the script. However, other locale variables take precedence over LC_CTYPE, so most likely you have anther one set (to something other than UTF-8).

Actually no.

--Hans

Hans-Jörg Bibiko

8:32 p.m.

On 28.05.2008, at 20:20, Hans-Jörg Bibiko wrote:

...

On 28.05.2008, at 19:59, Allan Odgaard wrote:

...
On 28 May 2008, at 19:49, Hans-Jörg Bibiko wrote:

...
[...] Oops. Of course, one has to set LC_ALL in the Ruby script.

TextMate sets LC_CTYPE for the programs it executes.

Yes, I know, but I do not know if Ruby by itself starts a system command via `` whether that command inherits the locale settings?

...

...
So on a normal system it should not be necessary to set this up in the script. However, other locale variables take precedence over LC_CTYPE, so most likely you have anther one set (to something other than UTF-8).

Actually no.

If I write a tmcommand with:

- a shell script: env | sort

I see LC_CTYPE

- a Ruby/Perl script: #!/usr/bin/ruby print `env|sort`

I do not see LC_CTYPE

Can you confirm this behaviour? Or am I wrong?

--Hans

Allan Odgaard

9:34 p.m.

On 28 May 2008, at 20:32, Hans-Jörg Bibiko wrote:

...

[...] If I write a tmcommand with:

a shell script:

env | sort

I see LC_CTYPE

a Ruby/Perl script:

#!/usr/bin/ruby print `env|sort`

I do not see LC_CTYPE

Can you confirm this behaviour? Or am I wrong?

I actually thought I set LC_CTYPE in code, but turns out only in Support/lib/bash_init.sh.

It worked for me because I also set LC_CTYPE in ~/MacOSX/ environment.plist.

So yes, for this to work, the ruby script will have to set the encoding variable.

6274

days inactive

6282

days old

textmate@lists.macromates.com

15 comments

participants

tags (0)

participants (6)

Allan Odgaard
Hans-Jörg Bibiko
Jonas Steverud
Marc Chanliau
Mark Eli Kalderon
Patrick McElhaney