Hello All,
I am writing a document un LaTeX and I would like to make a word count. I search online manuals but did not find any word count function... does-it exist ?
Thanks
Francois
Sorry Francois, I should have tested this first before sending it to you. I got the conditional construction wrong (don't know what I was thinking)---now you get a nice error message if there is no pdf in the directory. Best, Mark
On 25 Nov 2006, at 11:32, Mark Eli Kalderon wrote:
On Nov 29, 2006, at 6:29 PM, Michael Sheets wrote:
I assume he only wants to count *typeset* words, not latex commands.
Probably the easiest thing to do is make a command that runs detex on the document (which strips the commands out) and pipe that through the word count.
On 11/29/06, Mark Eli Kalderon eli@markelikalderon.com wrote:
On Nov 29, 2006, at 7:33 PM, Tim Lahey wrote:
That assumes one is not using commands that automatically generate text. I.e. what happens if I do:
\newcommand{\this}{Hi there! }
and then the document contains:
\this\this\this
I would guess detex will not count those 6 words, will it?
Haris
No,
I just checked and you're right. Plus, some things are still there that one wouldn't want. I did however find,
http://catdvi.sourceforge.net/
which is a DVI to plaintext translator. I don't know how well it works though.
Cheers,
Tim.
On 11/29/06, Charilaos Skiadas skiadas@hanover.edu wrote:
Haris wrote...
I would guess detex will not count those 6 words, will it?
No: but the alternative (typesetting and counting using something like ps2ascii) is unworkably slow on large documents, so it's going to depend on what sort of accuracy you're seeking (and what each method considers to be a "word"). It'd be nice to find something fast- ish and not-too-inaccurate, but I'm not holding my breath!
Re: speed. I just tried on my brother's thesis, a 250 page long lump of French that I typeset in LaTeX. Going the ps2ascii route (ie, using the Word Count command Mark posted) took about 60 seconds on my slow-but-serviceable eMac, with ye olde spinning wheel in the meantime. Not nice. The detex route was pretty much instant. Word counts differ by about 1-2%, so "choose your poison". In particular, test what comes out the other end and see if it matches your definition of "word". That is, try a few simple documents and use the commands without the pipe to "wc -w" on the end.
Cheers, Paul
I guess a couple of commands could be added to the LaTeX bundle. One would give a word count "estimate" using detex and wc while the other would give a more accurate count using ps2ascii and wc.
On 11/29/06, Mark Eli Kalderon eli@markelikalderon.com wrote:
Mark Eli Kalderon wrote:
Indeed: but all that work has already been done, as the command is operating on the pdf file. So the real question becomes: why is ps2ascii (aka ghostscript) so slow? (Just checked on my work machine, a 2GHz intel iMac with 2G of memory, and it's still about 30 seconds on a 250 page, 1.2MB pdf file.)
I guess it's just an irreducibly difficult procedure... Moral of this story? Don't count words very often!
Cheers, Paul
On Nov 30, 2006, at 1:23 AM, Paul McCann wrote:
I guess the question is how are you going to get the words out of the pdf or ps file? If you look at the pdf/ps source file, it is filled with special commands and things. I suppose if you could export the pdf file to a txt file, then you could count the words there with ease. Otherwise, we are talking about parsing what seems to me to be code even more complicated that LaTeX. You are better off with the small error from counting words in the latex source instead.
Unless I am much mistaken.
I guess it's just an irreducibly difficult procedure... Moral of this story? Don't count words very often!
Or ever I would say. Why is it important how many words you have? I guess some things have word limits, but surely this is not a check you would have to do too often.
Cheers, Paul
Haris
Hi again,
Believe me, I've had a good look inside pdf files. Not somewhere I'd like to live! Re exporting: nothing obvious rears its head up here, and Preview only allows you to export to graphical formats from pdf.
You are better off with the small error from counting words in the latex source instead.
Agreed: that's definitely what I would do if faced with such a task, so a simple command with . Thankfully I've rarely encountered a task in which the exact number of words is going to be crucial, but I imagine some essays/projects might have strict limits. Still, if you submit a pdf file the assessor is going to have to find a way to determine the word count!
Just for completeness' sake I tried Adobe's online pdf conversion tool on the same file I mentioned earlier (250 page pdf).
http://www.adobe.com/products/acrobat/access_onlinetools.html
It's been literally 5 minutes now and the thing is still spinning its wheels ("Please wait while the requested PDF file is being converted"), so I don't think that's going to come charging in to rescue the situation!
Cheers, Paul
On 30 Nov 2006, at 06:23, Paul McCann wrote:
Sorry, Paul, I thought you were gesturing toward something that would work directly on the LaTeX source.
Perhaps, there are faster ways to extract text from a PDF, but I don't know what they would be. There are Java libraries for doing this (I think its called jPDFText or something close to it). But as I know no Java I can't write and test the script against ps2ascii. Now if only the Ruby PDF::Reader module (which was meant to have text extraction capabilities) would materialize...
Accurate word counts may not matter to all but they do matter to some. I am constantly working against editorially imposed word limits. Most journals in the Humanities have word limits that are strictly enforced---unless your famous. ;) But I do take your point that for many use cases ps2ascii is way too slow.
Best, Mark
I rewrote the command with pdftotext, an open source text extraction utility that is included with xpdf.[1] It is *significantly* faster than the command using ps2ascii. However, it returns a different word count from the previous command. I compared the text generated by both ps2ascii and pdftotext. Neither will be the basis of an accurate word count. ps2ascii includes page numbers and inserts whitespace between the two parts of hyphenated words. pdftotext tends to subtitute whitespace for ligatures. There are other anomalies with each as well. Based on a small amount of experimentation with some of my documents, it seems that pdftotext gives a slightly better estimate. So the situation is this: with detex you get a low estimate and with pdf text extraction you get a high and slightly more accurate estimate. Disappointing, but perhaps there is a better way. All the best, Mark
[1] Mac OS X binary for xpdf (xpdf-tools-3.dmg) can be found here: http://users.phg-online.de/tk/MOSXS/