Hi,
I often have some Project Folders with multiple HTML files.
I need a Shell Script to concatenate them all together into one single file, to use in a command.
Possibly to strip also Header and Body Tags. File names have consistent naming convention.
Any Ideas ?
regards, marios
5 jun 2008 kl. 17.13 skrev marios:
Hi,
I often have some Project Folders with multiple HTML files.
I need a Shell Script to concatenate them all together into one single file, to use in a command.
% cat file1 file2 file3 > targetfile
Possibly to strip also Header and Body Tags.
This one is a bit trickier, might be some scripts out there that can process HTML files and do such things as the filtering, but if we for the sake of the argument assumes that you only want to remove '<head>' and '<body>', then this should do it (N.B. The apostrofes and backslashes are important and this assumes that the tags are the only content on those lines):
% grep -iv '</?(head|body)>' file1 file2 file3 > targetfile
Depending on the edition of the grep command, further parameters might be needed to make grep keep quiet and only report the (non-) matching lines and not which file they where found in etc.
HTH.
/Jonas
Hey marios,
I wish I had more time to investigate this (I'm already up way, way too late) but this sounded like a great excuse for me to check out Ruby's Hpricot gem: http://code.whytheluckystiff.net/hpricot/
Install with: sudo gem install hpricot
If you're comfortable with Ruby you can write a script that will loop through each of your html files and print out their content. You can easily do some nifty stuff with the Hpricot ruby gem. For starters I wrote this ruby script which you can open up in TextMate, change the DIR_PATH variable, and run with command+R to see what kind of output you would get. Note that in my script (linked below) I don't strip out a footer from each html file, that would take a pinch more work, take a look at (hpricot/path).remove function... so if they all have a #footer you can remove that before printing the body!) http://pastie.textmate.org/private/fvwjqdr42iwlreqsgxqoa
There are some issues you'll have to look out for to end up with a "valid" html document in the end. Make sure you don't reuse id's, don't have duplicate <script>s near the bottom of the page, the list goes on. But I think you are aware of these issues.
If you had a shell script in mind I guess you could string together files with cat, but that won't strip out the headers/footers. You could write a quick awk/sed script but this seems much less effective then using an HTML parser like Hpricot.
I hope this helps, - Joe P
On Jun 5, 2008, at 11: 13AM, marios wrote:
Hi,
I often have some Project Folders with multiple HTML files.
I need a Shell Script to concatenate them all together into one single file, to use in a command.
Possibly to strip also Header and Body Tags. File names have consistent naming convention.
Any Ideas ?
regards, marios
For new threads USE THIS: textmate@lists.macromates.com (threading gets destroyed and the universe will collapse if you don't) http://lists.macromates.com/mailman/listinfo/textmate