[TxMt] Re: stripping time stamps out of YouTube transcripts

Graham Heath graham.p.heath at gmail.com
Fri Jul 22 02:04:59 UTC 2016


It took me a while to get a sample that matched what you’re describing, so
I’ll share it here incase anyone else wants to help;

Sample Transcript:
https://gist.github.com/loadedsith/15b87f873a5abe0546c095874051d195

I’ve created a macro to help save you steps. With any luck you’ll be able
to add this macro and use it to accomplish what you want.

StripTimestamps.tmMacro:
https://gist.github.com/loadedsith/5add3a739777ee11aa20c8656d9b515e

The macro is, like all macros, simply a replay of my commands those steps
are;

A) Remove the time codes

   1. Open the find window (Command + F)
   2. Check "Regular Expression"
   3. *Set Find to '\d+:\d+:\d+.\d+,\d+:\d+:\d+.\d+'
   4. Set Replace to nothing, just an empty textbox
   5. Click Replace All

B) Remove the extra lines

   1. Open the find window (Command + F)
   2.
   3. "Regular Expression" should still be checked
   4. *Set Find to: '\n{2,}'
   5. *Set Replace to: '\n'
   6. Click Replace All

C) Remove whitespace after commas

   1. Open the find window (Command + F)
   2. "Regular Expression" should still be checked
   3. *Set Find to ',\s+'
   4. *Set Replace to ','
   5. Click Replace All

*: In each of these steps regular expressions are wrapped with single
quotes, these are not part of the expression, they are simply marking the
start and end of the expression.

Regex explained:
  Step A-3: https://regex101.com/r/mC8kU6/1
  Step B-4: https://regex101.com/r/iT4jD2/1
  Step C-3: https://regex101.com/r/aS9rE8/1

Good luck!

Graham Heath


On July 21, 2016 at 6:05:45 PM, Gradivus (gradivus at optonline.net) wrote:

Hi guys

I wanted to know if there was a way in Textmate to do a find and replace on
text files generated as a youtube transcript. These are text files
downloaded containing the closed captioning text.

The time stamp lines are formatted with quasi timecode as start,end:
0:00:10.100,0:00:11.191

So there would be a line of timecode, then 1 or more lines of text, then a
blank line, and then it starts over again on a new line with the next
timecode start.

Also, if there is a way to remove white space after commas, but keep
sentences in tact, that would save heaps of time.

These transcript files are long, and are taken from videos that are
25minutes to 1 hour duration, so doing it manually would be hell. At least
if there is an easy way to strip out this stuff, manually separating
paragraphs would be pretty fast.

Any advice is welcomed.

thanks

_______________________________________________
textmate mailing list
textmate at lists.macromates.com
http://lists.macromates.com/listinfo/textmate
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.macromates.com/textmate/attachments/20160721/f677ca74/attachment.html>


More information about the textmate mailing list