Google Next Big Service: Plain-text transcriptions of Phone Call, Podcast, and Youtube Videos


Google provides some very nice services; Google Books is useful and technically interesting. They started with printed books, and figured out how to de-warp those images to produce nice flat pages.

Then they used Optical Character Recognition (OCR) to convert the photographs of text into actual computer-manageable text. This brings image data into the domain of structured data, and it makes the text searchable. On top of that, it’s effectively 500-to-1 compression of the text data (assuming that a 1 MB photograph of a document compresses to about 2 kB of text).

Google is poised to make a similar transformation by converting audio to text. Witness:

  • Google purchased “Grand Central” and called it “Google Voice”. Each Google Voice user gets voicemail service, and incoming voicemails are automatically transcribed to text for email.
  • Many people complained about the transcription quality, but last month, in August 2009, Google Voice announced improved transcription service. Therefore, Google has and is improving the technology to automatically transcribe spoken-word audio.
  • Google Voice functions as type of telephone service, allowing you to place and receive calls. All of the call audio can route through Google Voice’s system. This gives Google access to your call audio when using Google Voice.
  • Google owns Youtube, which has much useful and educational spoken-word material.
  • Google Books has demonstrated that Google is interested in converting from analog-domain data (printed books) to structured textual data.
  • Lots of useful, unique spoken-word information is presented in podcast and video form. A few podcasts are recordings of other written material (e.g., the Economist Magazine’s podcast and audio edition), but most is probably never published in written format.
  • Spoken word audio is not directly searchable by Google. Further, audio recordings are quite large; the number of bytes per word is very large. (The previous sentence in written form requires about 200 bytes of storage, including all the formatting around it. A 64 kbps MP3/Ogg Vorbis audio stream of the same sentence would require around 80,000 bytes of storage. The ratio is 400:1, similar to the ratio for printed books to printed text.)

Therefore, we should fully expect Google to implement new services:

  • Podcasts can be transcribed. This would unlock lots of useful information that is currently inaccessible without a huge commitment of human time.
  • Youtube videos can be transcribed. For ordinary, spoken-word presentation material, the benefits would be huge; the spoke-word material could be readily indexed, searched, and read. Google could use the temporal redundancy clues (i.e., whether this frame looks a lot like the previous frame; if there is motion in the video, the frames have little temporal redundancy) in the video to determine whether the video is changing often; if it’s just a speaker with some text beside his head, they could even show us what’s on the screen as the speaker speaks: At the beginnings of sentences, an individual frame can be extracted and shown along with the text.
  • Google voice telephone calls can be recorded and transcribed. This would be a great way to record notes on work-related phone calls, and then go back and review notes from old calls. Recordings of phone calls are tedious, but transcriptions of those recordings would be fabulously useful.