[TxMt] Word Count ?

Mark Eli Kalderon eli at markelikalderon.com
Thu Nov 30 16:59:46 UTC 2006


On 30 Nov 2006, at 06:23, Paul McCann wrote:

> Mark Eli Kalderon wrote:
>
>> There is a limit to how fast an accurate utility could be if it is  
>> to accommodate the kind of procedural aspects of LaTeX syntax that  
>> Haris was referring to.
>
> Indeed: but all that work has already been done, as the command is  
> operating on the pdf file. So the real question becomes: why is  
> ps2ascii (aka ghostscript) so slow? (Just checked on my work  
> machine, a 2GHz intel iMac with 2G of memory, and it's still about  
> 30 seconds on a 250 page, 1.2MB pdf file.)
>
> I guess it's just an irreducibly difficult procedure... Moral of  
> this story? Don't count words very often!
>
> Cheers,

Sorry, Paul, I thought you were gesturing toward something that would  
work directly on the LaTeX source.

Perhaps, there are faster ways to extract text from a PDF, but I  
don't know what they would be. There are Java libraries for doing  
this (I think its called jPDFText or something close to it). But as I  
know no Java I can't write and test the script against ps2ascii. Now  
if only the Ruby PDF::Reader module (which was meant to have text  
extraction capabilities) would materialize...

Accurate word counts may not matter to all but they do matter to  
some. I am constantly working against editorially imposed word  
limits. Most journals in the Humanities have word limits that are  
strictly enforced---unless your famous. ;) But I do take your point  
that for many use cases ps2ascii is way too slow.

Best, Mark



More information about the textmate mailing list