In this post, we talk about and compare Text to Speech voices from the two forerunners. The leading providers of synthetic voices – Amazon and Google. Also, understand how one can get started using these voices for creating audio.
Text to speech is a technology that allows computers to speak. You write text and the computer reads it out. Historically, the voices have always sounded very robotic and monotonous which made them generally not suitable for purposes other than for accessibility applications.
But this is not the case anymore. The application of Machine Learning to Text-to-speech has transformed the way computers speak. Thus making the voices unbelievably realistic and opening the doors to countless applications. For Example listening to news and articles, podcasts, gaming, public announcement systems, e-learning, telephony, IoT apps & devices, and personal assistants.
Polly is Amazon’s Text to Speech offering which they describe as “life-like” and Wavenet is Google’s Text to Speech offering.
Let’s look at how they compare.
1. Voices & Languages
Amazon Polly offers 74 Voices in 29 Languages
Among the 74 voices, 60 voices are standard voices and 14 of them are NTTS NTTS or a.ka. Neural Text to Speech, powered by Machine learning algorithms, and are superior in quality than the standard ones.
Here are some voice samples of standard & NTTS :
Joanna Standard
Joanna NTTS
Mathew Standard
Mathew NTTS
Google, on the other hand, offers 90 voices in 20 languages
Google offers standard and neural voices, but usees different algorithms or rather the technology used to create the neural voices. They call it the Wavenet technology which is based on Deepmind’s technology.
Here are some of the voice samples from Google Wavenet
Standard US English
Neural US English
Google Wavenet offers more voices than Amazon Polly, but when it comes to Speaking Styles, Amazon Polly is a complete game-changer and a major breakthrough in Text to Speech technology.
Check out this complete list of languages and voices with samples –https://play.ht/voices/
2. Amazon Polly Speaking Styles
Amazon offers 2 different speaking styles, Newscaster (synthesize speech for TV or Radio newscaster) & Conversational (synthesize speech to simulate the tone of a friendly conversation) While Google uses DeepMind’s research to generate speech that sounds more like Human interactions.
Highly useful to news publishers, product reviewers, and bloggers to have the ability to change the tonality of the audio to make it more conversational or make it more like a newscaster making it more engaging and highly receptive.
Here are a sample of Newscaster and Conversational
3.Synchronized Speech for an Enhanced Visual Experience
Amazon Polly provides something called Speech Marks, which is an additional stream of metadata that can provide information about when particular sentences, words, and sounds are being pronounced.
Such data can be used to sync the audio with facial animation or Karaoke-style word highlight animation which can be used by video/infographics creators and animators to synchronize their audio and video to provide a better visual experience.
Google Wavenet does not offer Speech Marks.
4. Custom Lexicons for custom word pronunciations
Custom Lexicons can help content creators create and reuse the right pronunciation of certain words such as company names, acronyms, or foreign words which otherwise would be mispronounced by the default Text to Speech voices.
Although SSML supports a standard way of changing the pronunciation of words with Text to Speech, Amazon goes a step further with Polly and offers an entirely custom way of creating and managing a “dictionary” like words with their own pronunciations.
Google on the other hand does not offer custom lexicons but rather relies on the standard SSML tags to change the pronunciation.
Looking at how important custom lexicons can be we built our own lexicon system called “Global Pronunciations” that allows users to store key-value pairs of words and their respective audio and use them across any voice from Amazon and Google.
5. Creating a unique custom voice – aka Voice Cloning
Voice cloning is arguably the forefront of Text to Speech technology where a Machine learning algorithm listens to a couple of voice samples of a person and learns how to speak like them which can help create a unique branded voice that can stand out from default voices.
Brand voice can actually help you create unique audio for all your blogs, content, or eLearning videos that would resonate with the brand and its values.
Amazon Polly offers an invite-only option for creating a Brand Voice where you can work with the Amazon Polly team of AI research scientists and linguists to build an exclusive, high-quality, Neural Text-to-Speech voice that represents your brand’s persona.
Here are a couple of samples from companies that have created their own branded voices –
KFC Voice Sample
NAB Voice Sample
Unfortunately, Google does not provide you any Voice Cloning solution as of now
6. How to start creating audio with Amazon Polly and Google Wavenet
Both Amazon Polly and Google Wavenet offer excellent APIs for creating audio but Amazon goes a step further and allows you to link S3 bucket to store the audio and makes it convenient to create and store the audio compared to Google.
But the best way to get started with creating audio using Polly and Wavenet is Play.ht. Our Dashboard gives you access to all the voices from these providers. It allows you to create audio using an audio editor by simply copy-pasting text or fetching from a URL. You can then embed the audio in an article or download it and use it as a voiceover in your video. You can also distribute the audio as podcasts on iTunes, Spotify, and Soundcloud.
Conclusion-
Even though Google Wavenet provides more voices and languages compared to Amazon Polly, we think Amazon Polly wins. This is because of the distinct Speaking Styles it provides with some of the voices an the option to create a Brand Voice. Also, the ability to create and reuse custom pronunciations throughout the text.
We certainly are in an era where AI Voices has a great role to play in. Apart from the obvious benefits of Text to Speech like accessibility, enhanced learning, mobility & freedom, and fast & affordable. AI voices will make a deeper impact in the field of Audio Publishing, Elearning & training, Customer Service, and Media & Entertainment in the coming future with multiple newer use-cases and applications.
Top comments (0)