Hi guys
I wanted to know if there was a way in Textmate to do a find and replace on text files generated as a youtube transcript. These are text files downloaded containing the closed captioning text.
The time stamp lines are formatted with quasi timecode as start,end: 0:00:10.100,0:00:11.191
So there would be a line of timecode, then 1 or more lines of text, then a blank line, and then it starts over again on a new line with the next timecode start.
Also, if there is a way to remove white space after commas, but keep sentences in tact, that would save heaps of time.
These transcript files are long, and are taken from videos that are 25minutes to 1 hour duration, so doing it manually would be hell. At least if there is an easy way to strip out this stuff, manually separating paragraphs would be pretty fast.
Any advice is welcomed.
thanks
It took me a while to get a sample that matched what you’re describing, so I’ll share it here incase anyone else wants to help;
Sample Transcript: https://gist.github.com/loadedsith/15b87f873a5abe0546c095874051d195
I’ve created a macro to help save you steps. With any luck you’ll be able to add this macro and use it to accomplish what you want.
StripTimestamps.tmMacro: https://gist.github.com/loadedsith/5add3a739777ee11aa20c8656d9b515e
The macro is, like all macros, simply a replay of my commands those steps are;
A) Remove the time codes
1. Open the find window (Command + F) 2. Check "Regular Expression" 3. *Set Find to '\d+:\d+:\d+.\d+,\d+:\d+:\d+.\d+' 4. Set Replace to nothing, just an empty textbox 5. Click Replace All
B) Remove the extra lines
1. Open the find window (Command + F) 2. 3. "Regular Expression" should still be checked 4. *Set Find to: '\n{2,}' 5. *Set Replace to: '\n' 6. Click Replace All
C) Remove whitespace after commas
1. Open the find window (Command + F) 2. "Regular Expression" should still be checked 3. *Set Find to ',\s+' 4. *Set Replace to ',' 5. Click Replace All
*: In each of these steps regular expressions are wrapped with single quotes, these are not part of the expression, they are simply marking the start and end of the expression.
Regex explained: Step A-3: https://regex101.com/r/mC8kU6/1 Step B-4: https://regex101.com/r/iT4jD2/1 Step C-3: https://regex101.com/r/aS9rE8/1
Good luck!
Graham Heath
On July 21, 2016 at 6:05:45 PM, Gradivus (gradivus@optonline.net) wrote:
Hi guys
I wanted to know if there was a way in Textmate to do a find and replace on text files generated as a youtube transcript. These are text files downloaded containing the closed captioning text.
The time stamp lines are formatted with quasi timecode as start,end: 0:00:10.100,0:00:11.191
So there would be a line of timecode, then 1 or more lines of text, then a blank line, and then it starts over again on a new line with the next timecode start.
Also, if there is a way to remove white space after commas, but keep sentences in tact, that would save heaps of time.
These transcript files are long, and are taken from videos that are 25minutes to 1 hour duration, so doing it manually would be hell. At least if there is an easy way to strip out this stuff, manually separating paragraphs would be pretty fast.
Any advice is welcomed.
thanks
_______________________________________________ textmate mailing list textmate@lists.macromates.com http://lists.macromates.com/listinfo/textmate