Categories
Code challenges Programming

30 Days of Code #2

Made a translation/subtitling problem into a coding problem.

I have been working on figuring out different ways to do subtitling, because putting subtitles on speech-heavy videos longer than 3 minutes with Sony Vegas is s l o w. I searched around a bit and found a way to hardcode subtitles onto the video using a combination of Youtube’s fabulous subtitling tools and the free software Handbrake. I uploaded my transcribed Spanish text to Youtube, and they automatically assigned timestamps.

I had only one problem – I needed to translate the subtitles from Spanish to English. I downloaded the subtitle file from Youtube and pasted it in to Google translate to quickly give me a (very) rough translation, which I then thoroughly reviewed for errors.

The biggest error was something odd that happened to the formatting in the translation process. The timestamps in the Spanish version look like this:

1
00:00:00,770 --> 00:00:08,490
Muchas gracias, pastor Coby y toda la congregaciĆ³n
y liderazgo de tan distinguida iglesia. Le

2
00:00:08,490 --> 00:00:15,029
agradezco sinceramente que me permita dirigirme
a su congregaciĆ³n para compartir nuestras

When I translated this document into English using Google Translate, they looked like this:

1
00: 00: 00,770 -> 00: 00: 08,490
Thank you very much Pastor Coby and the entire congregation
and leadership of such a distinguished church.

2
00: 00: 08,490 -> 00: 00: 15,029
I sincerely thank you for allowing me to address
to your congregation to share our

After a few lines of manually fixing the format, I had the idea that this would be so much easier if I could just use code to reformat everything.

And so I did! Worked like a charm!

=begin
Input: text
Output: text
Rules: 
Problem: break up the text based on newlines. Every time there is a new line that starts with 0, fix the formatting.

DS: strings, array

Algo: 
define a method called format_timestamps, takes one parameter, text
split text into array of strings based on newline characters
iterate over strings in the array and if the string starts with '0',
  remove all spaces
  replace the '-' character with ' -'
  replace the '>' character with '> '

join the array back together as a string
=end

def format_timestamps(text)
  arr_text = text.split("\n")

  arr_text.map do |str|
    if str[0] == '0'
      str.gsub!(' ','')
      str.gsub!('-', ' --')
      str.gsub!('>', '> ')
    else
      str
    end
  end.join("\n")
  
end

My initial algorithm called for removing spaces after colons and then adding an extra dash before the arrow, but then I thought about removing all spaces and then adding spaces back in where needed. Much easier!

It seems to me like I remember that it is not a very good idea to have so many steps in one block like I have there in lines 6-8 of my format_timestamps method. It would probably be a better idea to make some helper methods that I can then call inside the block. If I were doing more to this program, I would also improve readability because I could name the helper methods descriptively, like #remove_spaces or #fix_arrow.

Another fun process I want to practice how to do again is opening up, edit, and close the .srt file without copy-pasting all the text into my program. For today, though, my text was short enough that it didn’t really matter.