Biz & IT —

Adobe demos “photoshop for audio,” lets you edit speech as easily as text

VoCo tech ingests speech, deconstructs it, then creates new words from scratch.

Adobe's VoCo tech demonstrated.

Adobe has demonstrated tech that lets you edit recorded speech so that you can alter what that person said or create an entirely new sentence from their voice. It seems inevitable that it will eventually be referred to as "photoshop but for audio."

VoCo lets you copy and paste existing words...
Enlarge / VoCo lets you copy and paste existing words...
The tech, dubbed VoCo (voice conversion), presents the user with a text box. Initially the text box shows the spoken content of the audio clip. You can then move the words around, delete fragments, or type in entirely new words. When you type in a new word, there's a small pause while the word is constructed—then you can press play and listen to the new clip.

VoCo works by ingesting a large amount of voice data (about 20 minutes right now, but that'll be improved), breaking it down into phonemes (each of the distinct sounds that make up a spoken language), and then attempting to create a voice model of the speaker—presumably stuff like cadence, stresses, quirks, etc., but Adobe hasn't provided much detail yet.

... or just write whole new words.
Enlarge / ... or just write whole new words.
Then, when you edit someone's speech, VoCo either finds that word somewhere within the 20-minute clip—or if the word hasn't been uttered, it is constructed out of raw phonemes. At around the 4:30 mark in the video you can hear the phase "three times" being constructed from scratch; if you listen carefully, it does sound a bit synthetic, but it's not awful. Copying and pasting existing words sounds better.

VoCo reminds me of an image editing tech that Adobe first demonstrated many moons ago: content-aware fill, which also creates something out of nothing. At the time, I remember people being concerned about the nefarious possibilities of almost flawlessly adding or removing details from images. With VoCo, it seems Adobe is being a bit more conscientious: even though it's just prototype tech, the company is already talking about "watermarking and detection" to prevent fraudulent use.

VoCo was demonstrated at Adobe Max 2016, where the company usually shows off new tech a year or two before it's commercialised. If VoCo does make it out of the prototype stage, it would probably be added to Adobe Audition, where you could use it to edit podcasts and voiceovers, and, more importantly, to create really funny celebrity audio clips that you can share on Reddit. And to circumvent voice recognition tech currently being rolled out by banks. And to bombard your archenemy with heinous voicemails from their loved ones that sound legitimate, but were actually just made on your PC at home...

Channel Ars Technica