Microsoft’s new AI can simulate anyone’s voice with 3 seconds of audio

Text-to-speech model can preserve speaker's emotional tone and acoustic environment.

Ars Technica

@drewharwell "it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E."

I wish they had said clearly that they were going to build that and require it stay detectable. All the banks wish that too.