[TxMt] Word Count ?

Mark Eli Kalderon eli at markelikalderon.com
Fri Dec 1 22:05:41 UTC 2006


I rewrote the command with pdftotext, an open source text extraction  
utility that is included with xpdf.[1] It is *significantly* faster  
than the command using ps2ascii. However, it returns a different word  
count from the previous command. I compared the text generated by  
both ps2ascii and pdftotext. Neither will be the basis of an accurate  
word count. ps2ascii includes page numbers and inserts whitespace  
between the two parts of hyphenated words. pdftotext tends to  
subtitute whitespace for ligatures. There are other anomalies with  
each as well. Based on a small amount of experimentation with some of  
my documents, it seems that pdftotext gives a slightly better  
estimate. So the situation is this: with detex you get a low estimate  
and with pdf text extraction you get a high and slightly more  
accurate estimate. Disappointing, but perhaps there is a better way.  
All the best, Mark

[1] Mac OS X binary for xpdf (xpdf-tools-3.dmg) can be found here:  
<http://users.phg-online.de/tk/MOSXS/>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Word Count (pdftotext).tmCommand.zip
Type: application/zip
Size: 794 bytes
Desc: not available
URL: <http://lists.macromates.com/textmate/attachments/20061201/a0a86578/attachment.zip>


More information about the textmate mailing list