The artificial intelligence that manages to replicate your voice after just three seconds of listening

It’s called VALL-E and was created by Microsoft: it is able to maintain the timbre and emotional nuances of the original voice. For the text-to-speech pass, she was trained on a huge database of English audio

That’s literally enough three seconds of audio of a person to allow a‘artificial intelligence to recreate a realistic simulation of the voice and make it speak another sentence. The model VALL-E of Microsoft can represent a huge step forward compared to today’s text-to-speech technologiesbut also opens some itroubling questions on the uses that can be made of it.

As explained by the Redmond company itself, VALL-E is based on the model of neural codec languageusing EnCodec technology announced by Meta only last October. This system is based on a more complex mechanism than the traditional one which goes to break down the item into chunks of informationcalled tokens, on the basis of which they come synthesized wavelengths to reconstruct a new one very similar artificial voice.

A very realistic voice

One of the characteristics that distinguishes this artificial intelligence is the ability to faithfully respect the timbre of the voice and the emotions of the speaker. Further increasing the degree of realism is the fact that VALL-E is capable of also maintain the acoustic peculiarities of the context in which the sentence is pronounced, for example inside a chamber with a strong echo or during a phone call.

These elements allow the model developed by Microsoft to overcome one of the main limitations of today’s text-to-speech technologies, that is, to appear too metallic and artificial. On GitHub you can experience this difference thanks to the examples provided which allow the comparison between one synthesized voice by VALL-E and one made by another, so to speak, classical system.

The database for speech synthesis

So far we have talked about the technique with which to reconstruct the voice artificially and the elements that make it more realistic. Instead, to provide VALL-E with speech synthesis capabilities at the level of the contents of the sentences, the artificial intelligence was trained on the audio library made by Meta, LibriLight, which contains within it beyond 60,000 hours of speeches in English pronounced by seven thousand speakers. At the moment this model is therefore only able to give life to artificial voices that speak English.

(Below the episode of the podcast “Invisible Geniuses” on artificial intelligence)

Opportunities and fears

From an applications point of view, VALL-E can potentially be used in services that require high quality regarding the text-to-speech transformation, but equally it can be an excellent solution for the editing operations of already existing vocal contents.

However, this feature also opens up a whole series of worrying questions on the explosion of the vocal deepfake phenomenon. With a model that makes the synthesis of a voice so realistic starting from just three seconds of speech, one could very simply create speeches from scratch by attributing them to a person but he never uttered those words.

Microsoft is obviously aware of these possible downsides of the technology and not surprisingly highlighted a clear warning inside the paper dedicated to VALL-E. As can be read in the research, a hypothesis under study concerns the possibility of marking the products made by VALL-E, in order to make any vocal artifact immediately recognizable. In parallel to this, a system should then be introduced which requires the consent of the person whose voice is being used.

January 13, 2023 ( edit January 13, 2023 | 3:04 pm)


We wish to say thanks to the author of this post for this remarkable web content

The artificial intelligence that manages to replicate your voice after just three seconds of listening

Find here our social media profiles and other pages related to it.