Mastodawn

This will only be a problem if someone gets a three-second recording of your voice

Text-to-speech model can preserve speaker's emotional tone and acoustic environment.

Ars Technica

@drewharwell "it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E."

I wish they had said clearly that they were going to build that and require it stay detectable. All the banks wish that too.