[TxMt] stripping time stamps out of YouTube transcripts

Gradivus gradivus at optonline.net
Fri Jul 22 01:01:05 UTC 2016


Hi guys

I wanted to know if there was a way in Textmate to do a find and replace on text files generated as a youtube transcript. These are text files downloaded containing the closed captioning text.

The time stamp lines are formatted with quasi timecode as start,end: 0:00:10.100,0:00:11.191

So there would be a line of timecode, then 1 or more lines of text, then a blank line, and then it starts over again on a new line with the next timecode start.

Also, if there is a way to remove white space after commas, but keep sentences in tact, that would save heaps of time.

These transcript files are long, and are taken from videos that are 25minutes to 1 hour duration, so doing it manually would be hell. At least if there is an easy way to strip out this stuff, manually separating paragraphs would be pretty fast.

Any advice is welcomed.

thanks


More information about the textmate mailing list