[TxMt] Word Count ?
Mark Eli Kalderon
eli at markelikalderon.com
Thu Nov 30 16:59:46 UTC 2006
On 30 Nov 2006, at 06:23, Paul McCann wrote:
> Mark Eli Kalderon wrote:
>
>> There is a limit to how fast an accurate utility could be if it is
>> to accommodate the kind of procedural aspects of LaTeX syntax that
>> Haris was referring to.
>
> Indeed: but all that work has already been done, as the command is
> operating on the pdf file. So the real question becomes: why is
> ps2ascii (aka ghostscript) so slow? (Just checked on my work
> machine, a 2GHz intel iMac with 2G of memory, and it's still about
> 30 seconds on a 250 page, 1.2MB pdf file.)
>
> I guess it's just an irreducibly difficult procedure... Moral of
> this story? Don't count words very often!
>
> Cheers,
Sorry, Paul, I thought you were gesturing toward something that would
work directly on the LaTeX source.
Perhaps, there are faster ways to extract text from a PDF, but I
don't know what they would be. There are Java libraries for doing
this (I think its called jPDFText or something close to it). But as I
know no Java I can't write and test the script against ps2ascii. Now
if only the Ruby PDF::Reader module (which was meant to have text
extraction capabilities) would materialize...
Accurate word counts may not matter to all but they do matter to
some. I am constantly working against editorially imposed word
limits. Most journals in the Humanities have word limits that are
strictly enforced---unless your famous. ;) But I do take your point
that for many use cases ps2ascii is way too slow.
Best, Mark
More information about the textmate
mailing list