A text-to-speech tool using AWS Polly
Last year I wrote about converting books to speech, where I investigated using open source and free tools to scan textbooks and convert them into audio files.
At that time, the weakest part of the process was the actual text-to-speech part. Festival, the open source solution, doesn’t have great voices, hasn’t been updated in years, and is hard to use. I ended up using the Cepstral software, which works fine, but it has a graphical interface and is mainly for Windows and OSX. What if I want to automate the process completely from the Linux command line interface (CLI)?
Last month, Amazon debuted Polly, the latest in its long line of web services. Using the AWS API, you can convert a snippet of text into speech in seconds. Like most AWS products, it is on-demand and low-cost — you get 5 million characters per month free for the first 12 months, and a million characters for $4.00 after that.
Submitting text to Polly is pretty easy, using the AWS CLI tool:
$ aws polly synthesize-speech --output-format mp3 --text "Here is my text" --voice-id Joanna output.mp3
The tricky part is that the API only allows ~1,500 characters of text per request. How do you convert a large amount of text, such as a book? Not finding a tool that would do this for me, I decided to create one.
aws-tts is a CLI tool that converts a text file into an audio file using AWS Polly. It’s designed to be simple to use and completely hands-off. All you need to do is specify the input file, and where you want the resulting audio file to be saved:
$ aws-tts my-test-file.txt resulting-speech.mp3
That’s it! If you want something other than an MP3 or the default voice (Joanna), you can specify those options on the command line.
All in all, the tool is only about 200 lines of code. The process is straightforward, but uses a few interesting libraries.
First, it splits the text into pieces. I used the textchunk module (which is basically sbd) to split the text into pieces small enough for AWS, without splitting the text in the middle of a word or a sentence.
Because even a moderate amount of text can be split into hundreds of pieces (textbook chapters for me divided into ~180 parts), we don’t want to exceed Amazon’s rate limits. To throttle the requests, I used the popular async module, specifically
Once all of the audio files are received, they are stitched together into one file using ffmpeg’s “concat” demuxer. I tried using the Emscripten port of ffmpeg so that users wouldn’t need ffmpeg installed on their machines, but I couldn’t get it to find the temporary files — it kept giving “file not found” errors, even when the normal ffmpeg binary worked fine. (If you have an idea of how to fix this, let me know.)
Finally, I used the nice ora spinner to provide feedback to the user while the process is running.
AWS is a great platform and the Polly service provides a great way to convert text to speech.
- The speech quality is as good as anything else out there, and will only get better.
- It is much faster than the Cepstral software; a book chapter will encode in about a minute, whereas Cepstral takes almost an hour.
- It provides a cross-platform, command-line way to convert text to speech.
- It encodes to MP3 out of the box — no need to convert WAV/PCM files yourself.
The main advantage of Cepstral is that it has a relatively small, one-time cost. If you plan on converting a lot of text over time, especially for personal use, then the added inconvenience may not outweigh the costs you would accrue with AWS.