Unless you’ve been living under a rock, you’re probably familiar with Google Assistant at this point. Google has made a massive push into artificial intelligence and machine learning. It even states at its events that it has moved from a mobile-first strategy to an AI-first strategy. That means that it wants to train computers to always be delivering relevant and helpful information to you before you even know you need it.
AI assistants have become a large part of the smartphone and PC experience over the last couple of years, and the ever increasing competition in this space has led to a lot of improvements across …You may have noticed a difference in Google Assistant the last few days. That’s because Google has started using a technology called WaveNet from the DeepMind team. The goal of the new WaveNet technology is to move Assistant from synthesized speech to a more natural speech pattern. Synthesized speech like you’d get from Google Assistant or Apple’s Siri is normally stitched together using small bits of recorded speech. This is called “concatenative text-to-speech” and it’s why some answers can sound a bit off when they’re read back to you.
Since bits of speech are essentially glued together, it’s hard to account for emotion or inflection. To get around that, most voice models are trained with samples that have as little variance as possible. That lack of any variance in the speech pattern is why it can sound a bit robotic, which is where WaveNet comes in. Google and the DeepMind team are trying to get around that with this new technology.
WaveNet is a completely different approach. Instead of recording hours of words, phrases, and fragments and then linking them together, the technology uses real speech to train a neural network. WaveNet learned the underlying structure of speech like which tones followed others and which waveforms were realistic and which weren’t. Using that data, the network was then able to synthesize voice samples one at a time and take into account the voice sample before it. By being aware of the waveform before it, WaveNet was able to create speech patterns that sound more natural.
Virtual assistants have traditionally been given feminine personalities, for reasons that pertain to the way we perceive voices. We’re more likely to trust and accept directions from a woman, according to studies, and a feminine …The advantages of this new system are subtle, but you can definitely hear them. When speaking to another human, you’ll pick up on when they’re coming to the end of a thought because their voice starts to go down at the end of a sentence. If you ever sit and watch the news for a few minutes, you can always tell when a story is about to end because the anchor will start to slow down and the volume or tone of their voice lowers. Part of the reason that concatenative text-to-speech sounds less natural are subtleties like that. That’s a huge part of where the new WaveNet technology improves on the current system.