I hate labeling this a “cool stuff found.” I mean, it’s cool as &@#%, but it’s scary as all get-out, too. It’s a demonstration by Adobe of Project VoCo, which is sort of like a Photoshop for speech. Watch the video below, but the short version is that Project VoCo can analyzes speech, turns it into text that you  can then rearrange, edit, or even make up new stuff entirely. Seriously, the Adobe engineer typed in words that comedian Jordan Peele didn’t say, and it sounded he said them. That last feature requires about 20 minutes of recorded speech to analyze, but it’s the truly scary part of this software. Talk about not being able to trust video or audio ever again? The engineer demoing the software—which isn’t even close to a shipping product yet—said Adobe was looking for ways to digitally watermark altered recordings so they can be identified, but come on, that’ll get cracked faster than you can say “All hell broke loose.” Anyway, it’s fascinating, impressive, cool, scary, and it’s certainly black magic. I’d love to hear your thoughts.

Check It Out: Adobe Demonstrates Project VoCo, Photoshop for the Voice

8 Comments Add a comment

  1. Scott B in DC

    At least you can tell it’s been edited… that is until they learn how to change the person’s natural rhythm and smooth out the cuts.

  2. BurmaYank

    Perhaps the zen in me would like this turning of cultural history; if every treasured trusted souvenir of my cherished past existence(s) I’d keep must automatically now be expected to be bogus(able) & phony, perhaps we must finally get more accustomed to expecting all (our) reality(ies) to be as intrinsically ephemerally non-capturable/non-graspable as they always really were (before the advent of photography so greatly enabled our deluded attachments to our pipedreams of “the Past”, & its counterpart myth of “the Future”). Perhaps this loss of trustworthy keepsakes will make it easier for the next generations to carpe diem.

  3. BurmaYank

    how long will it be before any form of identity verification becomes utterly impossible (if such spoofing software as VocCo is given an adequate sampling of the original):
    – no more iris scanning?
    – no more  Touch ID?
    – no more 2-point (or 3-point, or 10,000-point) verification, if such spoofing can be set up to occur remotely simultaneously?

  4. Anna Lamont

    This feels very ‘Minority Report’, scary times indeed. We use voice as an identifier far more than most people realise since it has previously been something very difficult to modify or emulate others. Even the best impressionists are recognisable as not being the same person, and some banks have incorporated voice into their telephone security. My fear with this is the obvious integration to voice to text then back again as another voice, with increasing computing power and optimised software, the potential here is horrendous.
    I found myself shuddering as I watched the video, of the engineer being cheered and laughing. Some things shouldn’t be made.

  5. Bryan Chaffin

    palmac, AutoTune or Melodyne would do that once you had the words spoken. Of course, phrasing and intonation of singing is different from speaking, but still…

    Jamie, I hear you. And thanks for the thanks. I love our readers and listeners.

    Marcus, thanks for the thoughtful comment. The ripples from this stuff will be enormous.

  6. Jamie

    Sigh. It would seem to me that we have officially entered an era where we can’t trust anything at first glance. Beyond that, and to say stay vigilant, I don’t know what to say, other than bless you guys for keeping a real comments section going.

  7. palmac

    Now if only they could make the “new” voice sing, that would be a lot of fun. You could have Winston Churchill singing “Elenor Rigby,” remake “The Marriage of Figaro” with the cast of Futurama, or take a tone-deaf someone who has no sense of rhythm (like me) and have them “sing” a melody of Weird Al songs.

  8. MarcusNewton

    I agree, that is both very cool and very scary at the same time. I can understand people working with audio interviews, audio books, podcasts, etc, wanting a tool to help them edit audio segments without having to actual go out and re-record something; which might not be possible is some cases. As a light editing tool to make small edits, I imagine VoCo would be fine, but it is too easy to come up with scary scenarios.

    The first thing that comes to mind is political attack ads. Current attack ads already combine all kinds of splices and edits to make a particular candidate sound like they are saying something they are not, but with VoCo they could actually type out whatever the attack ad wanted to say. Once any politician or candidate did one speech, then there would be the 20 minutes of audio necessary for VoCo to make the edits.

    The other thing that came to mind was the recent strike by video game voice over actors & actresses. They were already complaining that the game industry is having them go well beyond the role of normal voice acting. Now with VoCo, potentially, all the video game maker would need is 20 minutes of audio, and then they would not need the actor or actress anymore. Not being called back in to do retakes and edits would mean loss of income for voice over professionals.

    When I hear the words “Adobe Security” I translate that to “Adobe Flash”. Also, back when Adobe sold perpetual software licenses, there was a whole cottage industry dedicated to cracking the software codes. So watermarks in the waveform gives me zero confidence.

    Finally, the scariest scenario I think would be what hackers could do with social engineering (the hacking of humans). Instead of calling and pretending to be someone else, they could get a recording of an individual they wanted to impersonate and then use VoCo to actually type out what they wanted to happen.

    For example, record a CEO during a conference call, and then call that person’s assistant or other employee with the spoofed recording saying something like, “hey its me, I lost my keys, can you let me into the building.” VoCo seems like it is fast enough that a hacker could probably type out the conversation live, so it would sound more natural instead of a recorded conversation out of sync.

Add a Comment

Log in to comment (TMO, Twitter, Facebook) or Register for a TMO Account