A text-to-speech tool using AWS Polly

Posted on January 30, 2017

Last year I wrote about converting books to speech, where I investigated using open source and free tools to scan textbooks and convert them into audio files.

At that time, the weakest part of the process was the actual text-to-speech part. Festival, the open source solution, doesn’t have great voices, hasn’t been updated in years, and is hard to use. I ended up using the Cepstral software, which works fine, but it has a graphical interface and is mainly for Windows and OSX. What if I want to automate the process completely from the Linux command line interface (CLI)?

Enter Polly

Last month, Amazon debuted Polly, the latest in its long line of web services. Using the AWS API, you can convert a snippet of text into speech in seconds. Like most AWS products, it is on-demand and low-cost — you get 5 million characters per month free for the first 12 months, and a million characters for $4.00 after that.

Submitting text to Polly is pretty easy, using the AWS CLI tool:

$ aws polly synthesize-speech --output-format mp3 --text "Here is my text" --voice-id Joanna output.mp3

The tricky part is that the API only allows ~1,500 characters of text per request. How do you convert a large amount of text, such as a book? Not finding a tool that would do this for me, I decided to create one.

aws-tts

aws-tts is a CLI tool that converts a text file into an audio file using AWS Polly. It’s designed to be simple to use and completely hands-off. All you need to do is specify the input file, and where you want the resulting audio file to be saved:

$ aws-tts my-test-file.txt resulting-speech.mp3

aws-tts screen capture

That’s it! If you want something other than an MP3 or the default voice (Joanna), you can specify those options on the command line.

Technical Details

All in all, the tool is only about 200 lines of code. The process is straightforward, but uses a few interesting libraries.

First, it splits the text into pieces. I used the textchunk module (which is basically sbd) to split the text into pieces small enough for AWS, without splitting the text in the middle of a word or a sentence.

Because even a moderate amount of text can be split into hundreds of pieces (textbook chapters for me divided into ~180 parts), we don’t want to exceed Amazon’s rate limits. To throttle the requests, I used the popular async module, specifically eachOfLimit(). The requests use the official AWS SDK library for Javascript. The resulting audio files are stored in a temporary file using tempfile.

Once all of the audio files are received, they are stitched together into one file using ffmpeg’s “concat” demuxer. I tried using the Emscripten port of ffmpeg so that users wouldn’t need ffmpeg installed on their machines, but I couldn’t get it to find the temporary files — it kept giving “file not found” errors, even when the normal ffmpeg binary worked fine. (If you have an idea of how to fix this, let me know.)

The asynchronous stuff (the API requests and the ffmpeg process) are handled with Javascript promises. The fs-extra module was used to provide extra filesystem functionality seamlessly over the top of Node’s built-in filesystem commands.

Finally, I used the nice ora spinner to provide feedback to the user while the process is running.

Results

AWS is a great platform and the Polly service provides a great way to convert text to speech.

  • The speech quality is as good as anything else out there, and will only get better.
  • It is much faster than the Cepstral software; a book chapter will encode in about a minute, whereas Cepstral takes almost an hour.
  • It provides a cross-platform, command-line way to convert text to speech.
  • It encodes to MP3 out of the box — no need to convert WAV/PCM files yourself.

The main advantage of Cepstral is that it has a relatively small, one-time cost. If you plan on converting a lot of text over time, especially for personal use, then the added inconvenience may not outweigh the costs you would accrue with AWS.

Get my TTS tool here: github.com/eheikes/aws-tts. It’s brand new, so please submit bugs & features to the GitHub repository or in the comments to this post.

Leave a Reply

14 Responses to A text-to-speech tool using AWS Polly

  1.  

    |

  2. Hi Eric, thanks a lot for your magic codes. I don’t have much experience in coding. Could you please take a look at the error codes I had below? Thanks.

    aws-tts test.txt test.mp3
    ✔ Reading text
    ✔ Splitting text
    ✖ Convert to audio (4/282)
    HTTPError: Response code 403 (Forbidden)
    at EventEmitter.ee.on.res (/usr/local/lib/node_modules/aws-tts/node_modules/got/index.js:182:24)
    at emitOne (events.js:96:13)
    at EventEmitter.emit (events.js:188:7)
    at Immediate.setImmediate (/usr/local/lib/node_modules/aws-tts/node_modules/got/index.js:61:8)
    at runCallback (timers.js:672:20)
    at tryOnImmediate (timers.js:645:5)
    at processImmediate [as _immediateCallback] (timers.js:617:5)

  3. Hi have you figured out a way to overcome the 1500 character limit? can I make a loop on php? suggestions or code examples appreciated.

      • Hi again… As a way to find a solution to the character limit. I was able to make use of another TTS interface provided by IBM, with this second option. I was able to make use of additional new voices and free myself from the limit actually imposed by AWS. I will be holding a meeting online with them next week, so I will bring up that subject to either create that feature or have an API option or workaround to make it work. I hope this post is useful to you and other users who may have a similar scenario.

  4. Hi, great tool you created. I wonder if we can compress mp3 file size more.
    1 min of mp3 file cost 442 KB, that was awfully a lot.

  5. Hi,

    I’m trying to use your nice software but when I try to convert even a simple text file I always get this error:
    >aws-tts prova.txt prova.mp3
    V Reading text
    V Splitting text
    V Convert to audio (1/1)
    × Combine audio
    Error: ffmpeg returned an error (1)
    at ChildProcess.ffmpeg.on.code (C:\Users\Danilo\AppData\Roaming\npm\node_mod
    ules\aws-tts\lib.js:160:25)
    at emitTwo (events.js:106:13)
    at ChildProcess.emit (events.js:194:7)
    at maybeClose (internal/child_process.js:899:16)
    at Socket. (internal/child_process.js:342:11)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:191:7)
    at Pipe._handle.close [as _onclose] (net.js:511:12)
    Could you help me, please?
    Thanks

    • Hi Danilo,

      Thanks for reporting this. I’ll see if I can reproduce it myself. I think I know what the problem may be.

      In the meantime, can you verify that ffmpeg is working on your computer? Running `ffmpeg.exe -version` on the command line should print the ffmpeg details.

      Thanks,
      Eric

      • Hi Eric,

        I think it’s working. This is the version:

        ffmpeg version 3.2.4 Copyright (c) 2000-2017 the FFmpeg developers
        built with gcc 6.3.0 (GCC)
        configuration: –enable-gpl –enable-version3 –enable-d3d11va –enable-dxva2 —
        enable-libmfx –enable-nvenc –enable-avisynth –enable-bzlib –enable-fontconfi
        g –enable-frei0r –enable-gnutls –enable-iconv –enable-libass –enable-libblu
        ray –enable-libbs2b –enable-libcaca –enable-libfreetype –enable-libgme –ena
        ble-libgsm –enable-libilbc –enable-libmodplug –enable-libmp3lame –enable-lib
        opencore-amrnb –enable-libopencore-amrwb –enable-libopenh264 –enable-libopenj
        peg –enable-libopus –enable-librtmp –enable-libsnappy –enable-libsoxr –enab
        le-libspeex –enable-libtheora –enable-libtwolame –enable-libvidstab –enable-
        libvo-amrwbenc –enable-libvorbis –enable-libvpx –enable-libwavpack –enable-l
        ibwebp –enable-libx264 –enable-libx265 –enable-libxavs –enable-libxvid –ena
        ble-libzimg –enable-lzma –enable-zlib
        libavutil 55. 34.101 / 55. 34.101
        libavcodec 57. 64.101 / 57. 64.101
        libavformat 57. 56.101 / 57. 56.101
        libavdevice 57. 1.100 / 57. 1.100
        libavfilter 6. 65.100 / 6. 65.100
        libswscale 4. 2.100 / 4. 2.100
        libswresample 2. 3.100 / 2. 3.100
        libpostproc 54. 1.100 / 54. 1.100

        Thank you