[TxMt] Word Count ?
Mark Eli Kalderon
eli at markelikalderon.com
Fri Dec 1 22:05:41 UTC 2006
I rewrote the command with pdftotext, an open source text extraction
utility that is included with xpdf.[1] It is *significantly* faster
than the command using ps2ascii. However, it returns a different word
count from the previous command. I compared the text generated by
both ps2ascii and pdftotext. Neither will be the basis of an accurate
word count. ps2ascii includes page numbers and inserts whitespace
between the two parts of hyphenated words. pdftotext tends to
subtitute whitespace for ligatures. There are other anomalies with
each as well. Based on a small amount of experimentation with some of
my documents, it seems that pdftotext gives a slightly better
estimate. So the situation is this: with detex you get a low estimate
and with pdf text extraction you get a high and slightly more
accurate estimate. Disappointing, but perhaps there is a better way.
All the best, Mark
[1] Mac OS X binary for xpdf (xpdf-tools-3.dmg) can be found here:
<http://users.phg-online.de/tk/MOSXS/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Word Count (pdftotext).tmCommand.zip
Type: application/zip
Size: 794 bytes
Desc: not available
URL: <http://lists.macromates.com/textmate/attachments/20061201/a0a86578/attachment.zip>
More information about the textmate
mailing list