Meta has created a system that can embed hidden signals, known as watermarks, in AI-generated audio clips, which could help in detecting AI-generated content online.
The tool, called AudioSeal, is the first that can pinpoint which bits of audio in, for example, a full hourlong podcast might have been generated by AI. It could help to tackle the growing problem of misinformation and scams using voice cloning tools, says Hady Elsahar, a research scientist at Meta. Malicious actors have used generative AI to create audio deepfakes of President Joe Biden, and scammers have used deepfakes to blackmail their victims. Watermarks could in theory help social media companies detect and remove unwanted content.
However, there are some big caveats. Meta says it has no plans yet to apply the watermarks to AI-generated audio created using its tools. Audio watermarks are not yet adopted widely, and there is no single agreed industry standard for them. And watermarks for AI-generated content tend to be easy to tamper with—for example, by removing or forging them.
Fast detection, and the ability to pinpoint which elements of an audio file are AI-generated, will be critical to making the system useful, says Elsahar. He says the team achieved between 90% and 100% accuracy in detecting the watermarks, much better results than in previous attempts at watermarking audio.
AudioSeal is available on GitHub for free. Anyone can download it and use it to add watermarks to AI-generated audio clips. It could eventually be overlaid on top of AI audio generation models, so that it is automatically applied to any speech generated using them. The researchers who created it will present their work at the International Conference on Machine Learning in Vienna, Austria, in July.
AudioSeal is created using two neural networks. One generates watermarking signals that can be embedded into audio tracks. These signals are imperceptible to the human ear but can be detected quickly using the other neural network. Currently, if you want to try to spot AI-generated audio in a longer clip, you have to comb through the entire thing in second-long chunks to see if any of them contain a watermark. This is a slow and laborious process, and not practical on social media platforms with millions of minutes of speech.
AudioSeal works differently: by embedding a watermark throughout each section of the entire audio track. This allows the watermark to be “localized,” which means it can still be detected even if the audio is cropped or edited.
Ben Zhao, a computer science professor at the University of Chicago, says this ability, and the near-perfect detection accuracy, makes AudioSeal better than any previous audio watermarking system he’s come across.
“It’s meaningful to explore research improving the state of the art in watermarking, especially across mediums like speech that are often harder to mark and detect than visual content,” says Claire Leibowicz, head of AI and media integrity at the nonprofit Partnership on AI.
But there are some major flaws that need to be overcome before these sorts of audio watermarks can be adopted en masse. Meta’s researchers tested different attacks to remove the watermarks and found that the more information is disclosed about the watermarking algorithm, the more vulnerable it is. The system also requires people to voluntarily add the watermark to their audio files.
This places some fundamental limitations on the tool, says Zhao. “Where the attacker has some access to the [watermark] detector, it’s pretty fragile,” he says. And this means only Meta will be able to verify whether audio content is AI-generated or not.
Leibowicz says she remains unconvinced that watermarks will actually further public trust in the information they’re seeing or hearing, despite their popularity as a solution in the tech sector. That’s partly because they are themselves so open to abuse.
“I’m skeptical that any watermark will be robust to adversarial stripping and forgery,” she adds.