In a technological breakthrough that blurs the line between machines and the human voice, Google has introduced a text-to-speech system, Tacotron 2, that boasts near-accurate reproduction of human sound.
The new system consists of two deep neural networks and is Google’s second generation of AI speech technology. The first network performs the task of translating text into a spectrogram that creates a visual representation of audio frequencies over time. This spectrogram is later fed to WaveNet to interpret the chart and create the correct audio elements. WaveNet is a raw audio generating AI developed by DeepMind, owned by Google’s parent company, Alphabet Inc.
What makes Tacotron 2 special is that it can even pronounce extremely difficult words and alter its pronunciation exactly the way a human being does. So, if in a sentence, a particular word needs to be stressed, then capitalizing the word will make the AI stress the word when converting it into speech.
Tacotron 2 is currently in test phase. Once ready for production, the system will bring vast improvements to Google’s voice assistant. The AI can only mimic a specific female voice for now. To reproduce male speech or the sound of a different female, the system needs to be trained again by Google.
The technology of WaveNet
WaveNet uses a deep Convoluted Neural Network (CNN) and consists of several layers of interconnected nodes that are similar to the neurons of the human brain. The CNN receives raw a waveform as the input and then produces the output one sample at a time. “As well as yielding more natural-sounding speech, using raw waveforms means that WaveNet can model any kind of audio, including music,” according to DeepMind.
WaveNet is able to accurately model different accents of a language. If it is fed with British English, the system will speak in the British accent. In contrast, if WaveNet is fed with English spoken by someone from Russia, then it will output speech with a Russian accent.
While the technology is exciting, some tech experts raise ethical objections to the use of machine speech technology. For instance, the ability to reproduce any human voice might one day enable kidnappers to entrap children by using the AI to mimic their parent’s speech.
Voysis embedded solutions
Despite the fact that WaveNet was developed by DeepMind, it is the Irish tech start-up Voysis that has developed the technology to use the audio-generating AI in smartphones.
“We believe that WaveNet will fundamentally change the way voice experiences are delivered… Unlike Google’s WaveNet that depends on custom TPU chips [created for machine learning], ViEW runs on commodity hardware, and it’s so small that it can even run natively on a mobile device,” David Andreasson, Head of Finance and Operations at Voysis, said to Extra.Ie.
Google had earlier claimed that it would need double the number of existing data centers to ensure that every Android device could run at least three minutes of WaveNet queries each day. But Voysis’s ViEW, with a team of just 35 members, only requires 50 MB of space in an Android device to function. It doesn’t have to connect to any data center. As such, many speech technology experts see ViEW as a game changer in the industry.