On 30 Nov 2006, at 06:23, Paul McCann wrote:
Mark Eli Kalderon wrote:
There is a limit to how fast an accurate utility could be if it is to accommodate the kind of procedural aspects of LaTeX syntax that Haris was referring to.
Indeed: but all that work has already been done, as the command is operating on the pdf file. So the real question becomes: why is ps2ascii (aka ghostscript) so slow? (Just checked on my work machine, a 2GHz intel iMac with 2G of memory, and it's still about 30 seconds on a 250 page, 1.2MB pdf file.)
I guess it's just an irreducibly difficult procedure... Moral of this story? Don't count words very often!
Cheers,
Sorry, Paul, I thought you were gesturing toward something that would work directly on the LaTeX source.
Perhaps, there are faster ways to extract text from a PDF, but I don't know what they would be. There are Java libraries for doing this (I think its called jPDFText or something close to it). But as I know no Java I can't write and test the script against ps2ascii. Now if only the Ruby PDF::Reader module (which was meant to have text extraction capabilities) would materialize...
Accurate word counts may not matter to all but they do matter to some. I am constantly working against editorially imposed word limits. Most journals in the Humanities have word limits that are strictly enforced---unless your famous. ;) But I do take your point that for many use cases ps2ascii is way too slow.
Best, Mark