Word Count ?

List overview All Threads
Download

newer

older

Calling all Flash developers...

Anybody Try a Haml Language...

Francois

25 Nov 2006 25 Nov '06

12:10 p.m.

Hello All,

I am writing a document un LaTeX and I would like to make a word count. I search online manuals but did not find any word count function... does-it exist ?

Thanks

Francois

Show replies by date

Mark Eli Kalderon

25 Nov 25 Nov

12:32 p.m.

You need to generate a pdf to use the enclosed command. Hope this helps, Mark

On 25 Nov 2006, at 11:10, Francois wrote:

...

Hello All,

I am writing a document un LaTeX and I would like to make a word count. I search online manuals but did not find any word count function... does-it exist ?

Thanks

Francois

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Mark Eli Kalderon

29 Nov 29 Nov

8:28 p.m.

Sorry Francois, I should have tested this first before sending it to you. I got the conditional construction wrong (don't know what I was thinking)---now you get a nice error message if there is no pdf in the directory. Best, Mark

On 25 Nov 2006, at 11:32, Mark Eli Kalderon wrote:

...

You need to generate a pdf to use the enclosed command. Hope this helps, Mark

<Word Count.tmCommand>

On 25 Nov 2006, at 11:10, Francois wrote:

...
Hello All,

I am writing a document un LaTeX and I would like to make a word count. I search online manuals but did not find any word count function... does-it exist ?

Thanks

Francois

_ For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Michael Sheets

30 Nov 30 Nov

12:29 a.m.

...

I am writing a document un LaTeX and I would like to make a word count. I search online manuals but did not find any word count function... does-it exist ?

Text Bundle -> Statistics for Selection (Word Count)

Next time your looking for a feature open up the bundle items window and type it in, you might just be surprised ;) (⌃⌘T)

Kevin Ballard

12:52 a.m.

On Nov 29, 2006, at 6:29 PM, Michael Sheets wrote:

...

...
I am writing a document un LaTeX and I would like to make a word count. I search online manuals but did not find any word count function... does-it exist ?

Text Bundle -> Statistics for Selection (Word Count)

Next time your looking for a feature open up the bundle items window and type it in, you might just be surprised ;) (⌃⌘T)

I assume he only wants to count *typeset* words, not latex commands.

-- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com

Mark Eli Kalderon

1:30 a.m.

On 29 Nov 2006, at 23:52, Kevin Ballard wrote:

...

On Nov 29, 2006, at 6:29 PM, Michael Sheets wrote:

...
...
I am writing a document un LaTeX and I would like to make a word count. I search online manuals but did not find any word count function... does-it exist ?

Text Bundle -> Statistics for Selection (Word Count)

Next time your looking for a feature open up the bundle items window and type it in, you might just be surprised ;) (⌃⌘T)

I assume he only wants to count *typeset* words, not latex commands.

Exactly.

Tim Lahey

1:33 a.m.

Probably the easiest thing to do is make a command that runs detex on the document (which strips the commands out) and pipe that through the word count.

On 11/29/06, Mark Eli Kalderon eli@markelikalderon.com wrote:

...

On 29 Nov 2006, at 23:52, Kevin Ballard wrote:

...
On Nov 29, 2006, at 6:29 PM, Michael Sheets wrote:

...
...
I am writing a document un LaTeX and I would like to make a word count. I search online manuals but did not find any word count function... does-it exist ?

Text Bundle -> Statistics for Selection (Word Count)

Charilaos Skiadas

1:50 a.m.

On Nov 29, 2006, at 7:33 PM, Tim Lahey wrote:

...

Probably the easiest thing to do is make a command that runs detex on the document (which strips the commands out) and pipe that through the word count.

That assumes one is not using commands that automatically generate text. I.e. what happens if I do:

\newcommand{\this}{Hi there! }

and then the document contains:

\this\this\this

I would guess detex will not count those 6 words, will it?

Haris

Jacob Rus

1:59 a.m.

Charilaos Skiadas wrote:

...

On Nov 29, 2006, at 7:33 PM, Tim Lahey wrote:

...
Probably the easiest thing to do is make a command that runs detex on the document (which strips the commands out) and pipe that through the word count.

That assumes one is not using commands that automatically generate text. I.e. what happens if I do:

\newcommand{\this}{Hi there! }

and then the document contains:

\this\this\this

I would guess detex will not count those 6 words, will it?

So really the only thing we can do is actually typeset the document? This seems problematical to me, especially if we want general solutions that we can use with languages other than TeX.

Tim Lahey

2:08 a.m.

No,

I just checked and you're right. Plus, some things are still there that one wouldn't want. I did however find,

http://catdvi.sourceforge.net/

which is a DVI to plaintext translator. I don't know how well it works though.

Cheers,

Tim.

On 11/29/06, Charilaos Skiadas skiadas@hanover.edu wrote:

...

On Nov 29, 2006, at 7:33 PM, Tim Lahey wrote:

...
Probably the easiest thing to do is make a command that runs detex on the document (which strips the commands out) and pipe that through the word count.

That assumes one is not using commands that automatically generate text. I.e. what happens if I do:

\newcommand{\this}{Hi there! }

and then the document contains:

\this\this\this

I would guess detex will not count those 6 words, will it?

Haris

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Mark Eli Kalderon

2:14 a.m.

On 30 Nov 2006, at 01:08, Tim Lahey wrote:

...

No,

I just checked and you're right. Plus, some things are still there that one wouldn't want. I did however find,

http://catdvi.sourceforge.net/

which is a DVI to plaintext translator. I don't know how well it works though.

Cheers,

Tim.

Well since most users are generating PDFs, it makes sense to use ps2ascii, instead.

Best, Mark

Paul McCann

2:12 a.m.

Haris wrote...

...

\newcommand{\this}{Hi there! } and then the document contains: \this\this\this

...

I would guess detex will not count those 6 words, will it?

No: but the alternative (typesetting and counting using something like ps2ascii) is unworkably slow on large documents, so it's going to depend on what sort of accuracy you're seeking (and what each method considers to be a "word"). It'd be nice to find something fast- ish and not-too-inaccurate, but I'm not holding my breath!

Re: speed. I just tried on my brother's thesis, a 250 page long lump of French that I typeset in LaTeX. Going the ps2ascii route (ie, using the Word Count command Mark posted) took about 60 seconds on my slow-but-serviceable eMac, with ye olde spinning wheel in the meantime. Not nice. The detex route was pretty much instant. Word counts differ by about 1-2%, so "choose your poison". In particular, test what comes out the other end and see if it matches your definition of "word". That is, try a few simple documents and use the commands without the pipe to "wc -w" on the end.

Cheers, Paul

Mark Eli Kalderon

2:26 a.m.

On 30 Nov 2006, at 01:12, Paul McCann wrote:

...

Haris wrote...

...
\newcommand{\this}{Hi there! } and then the document contains: \this\this\this

...
I would guess detex will not count those 6 words, will it?

No: but the alternative (typesetting and counting using something like ps2ascii) is unworkably slow on large documents, so it's going to depend on what sort of accuracy you're seeking (and what each method considers to be a "word"). It'd be nice to find something fast-ish and not-too-inaccurate, but I'm not holding my breath!

There is a limit to how fast an accurate utility could be if it is to accommodate the kind of procedural aspects of LaTeX syntax that Haris was referring to. Best, Mark

Tim Lahey

3:16 a.m.

I guess a couple of commands could be added to the LaTeX bundle. One would give a word count "estimate" using detex and wc while the other would give a more accurate count using ps2ascii and wc.

On 11/29/06, Mark Eli Kalderon eli@markelikalderon.com wrote:

...

There is a limit to how fast an accurate utility could be if it is to accommodate the kind of procedural aspects of LaTeX syntax that Haris was referring to. Best, Mark

For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate

Paul McCann

7:23 a.m.

Mark Eli Kalderon wrote:

...

There is a limit to how fast an accurate utility could be if it is to accommodate the kind of procedural aspects of LaTeX syntax that Haris was referring to.

Indeed: but all that work has already been done, as the command is operating on the pdf file. So the real question becomes: why is ps2ascii (aka ghostscript) so slow? (Just checked on my work machine, a 2GHz intel iMac with 2G of memory, and it's still about 30 seconds on a 250 page, 1.2MB pdf file.)

I guess it's just an irreducibly difficult procedure... Moral of this story? Don't count words very often!

Cheers, Paul

Charilaos Skiadas

7:38 a.m.

On Nov 30, 2006, at 1:23 AM, Paul McCann wrote:

...

Indeed: but all that work has already been done, as the command is operating on the pdf file. So the real question becomes: why is ps2ascii (aka ghostscript) so slow? (Just checked on my work machine, a 2GHz intel iMac with 2G of memory, and it's still about 30 seconds on a 250 page, 1.2MB pdf file.)

I guess the question is how are you going to get the words out of the pdf or ps file? If you look at the pdf/ps source file, it is filled with special commands and things. I suppose if you could export the pdf file to a txt file, then you could count the words there with ease. Otherwise, we are talking about parsing what seems to me to be code even more complicated that LaTeX. You are better off with the small error from counting words in the latex source instead.

Unless I am much mistaken.

...

I guess it's just an irreducibly difficult procedure... Moral of this story? Don't count words very often!

Or ever I would say. Why is it important how many words you have? I guess some things have word limits, but surely this is not a check you would have to do too often.

...

Cheers, Paul

Haris

Paul McCann

8:19 a.m.

Hi again,

...

If you look at the pdf/ps source file, it is filled with special commands and things. I suppose if you could export the pdf file to a txt file, then you could count the words there with ease.

Believe me, I've had a good look inside pdf files. Not somewhere I'd like to live! Re exporting: nothing obvious rears its head up here, and Preview only allows you to export to graphical formats from pdf.

...

You are better off with the small error from counting words in the latex source instead.

Agreed: that's definitely what I would do if faced with such a task, so a simple command with . Thankfully I've rarely encountered a task in which the exact number of words is going to be crucial, but I imagine some essays/projects might have strict limits. Still, if you submit a pdf file the assessor is going to have to find a way to determine the word count!

Just for completeness' sake I tried Adobe's online pdf conversion tool on the same file I mentioned earlier (250 page pdf).

http://www.adobe.com/products/acrobat/access_onlinetools.html

It's been literally 5 minutes now and the thing is still spinning its wheels ("Please wait while the requested PDF file is being converted"), so I don't think that's going to come charging in to rescue the situation!

Cheers, Paul

Mark Eli Kalderon

5:59 p.m.

On 30 Nov 2006, at 06:23, Paul McCann wrote:

...

Mark Eli Kalderon wrote:

...
There is a limit to how fast an accurate utility could be if it is to accommodate the kind of procedural aspects of LaTeX syntax that Haris was referring to.

Indeed: but all that work has already been done, as the command is operating on the pdf file. So the real question becomes: why is ps2ascii (aka ghostscript) so slow? (Just checked on my work machine, a 2GHz intel iMac with 2G of memory, and it's still about 30 seconds on a 250 page, 1.2MB pdf file.)

I guess it's just an irreducibly difficult procedure... Moral of this story? Don't count words very often!

Cheers,

Sorry, Paul, I thought you were gesturing toward something that would work directly on the LaTeX source.

Perhaps, there are faster ways to extract text from a PDF, but I don't know what they would be. There are Java libraries for doing this (I think its called jPDFText or something close to it). But as I know no Java I can't write and test the script against ps2ascii. Now if only the Ruby PDF::Reader module (which was meant to have text extraction capabilities) would materialize...

Accurate word counts may not matter to all but they do matter to some. I am constantly working against editorially imposed word limits. Most journals in the Humanities have word limits that are strictly enforced---unless your famous. ;) But I do take your point that for many use cases ps2ascii is way too slow.

Best, Mark

Mark Eli Kalderon

1 Dec 1 Dec

11:05 p.m.

I rewrote the command with pdftotext, an open source text extraction utility that is included with xpdf.[1] It is *significantly* faster than the command using ps2ascii. However, it returns a different word count from the previous command. I compared the text generated by both ps2ascii and pdftotext. Neither will be the basis of an accurate word count. ps2ascii includes page numbers and inserts whitespace between the two parts of hyphenated words. pdftotext tends to subtitute whitespace for ligatures. There are other anomalies with each as well. Based on a small amount of experimentation with some of my documents, it seems that pdftotext gives a slightly better estimate. So the situation is this: with detex you get a low estimate and with pdf text extraction you get a high and slightly more accurate estimate. Disappointing, but perhaps there is a better way. All the best, Mark

[1] Mac OS X binary for xpdf (xpdf-tools-3.dmg) can be found here: http://users.phg-online.de/tk/MOSXS/

6812

days inactive

6818

days old

textmate@lists.macromates.com

18 comments

participants

tags (0)

participants (8)

Charilaos Skiadas
Francois
Jacob Rus
Kevin Ballard
Mark Eli Kalderon
Michael Sheets
Paul McCann
Tim Lahey