A text-to-speech tool using AWS Polly

Posted on January 30, 2017

Last year I wrote about converting books to speech, where I investigated using open source and free tools to scan textbooks and convert them into audio files.

At that time, the weakest part of the process was the actual text-to-speech part. Festival, the open source solution, doesn’t have great voices, hasn’t been updated in years, and is hard to use. I ended up using the Cepstral software, which works fine, but it has a graphical interface and is mainly for Windows and OSX. What if I want to automate the process completely from the Linux command line interface (CLI)?

Enter Polly

Last month, Amazon debuted Polly, the latest in its long line of web services. Using the AWS API, you can convert a snippet of text into speech in seconds. Like most AWS products, it is on-demand and low-cost — you get 5 million characters per month free for the first 12 months, and a million characters for $4.00 after that.

Submitting text to Polly is pretty easy, using the AWS CLI tool:

$ aws polly synthesize-speech --output-format mp3 --text "Here is my text" --voice-id Joanna output.mp3

The tricky part is that the API only allows ~1,500 characters of text per request. How do you convert a large amount of text, such as a book? Not finding a tool that would do this for me, I decided to create one.

aws-tts

aws-tts is a CLI tool that converts a text file into an audio file using AWS Polly. It’s designed to be simple to use and completely hands-off. All you need to do is specify the input file, and where you want the resulting audio file to be saved:

$ aws-tts my-test-file.txt resulting-speech.mp3

aws-tts screen capture

That’s it! If you want something other than an MP3 or the default voice (Joanna), you can specify those options on the command line.

Technical Details

All in all, the tool is only about 200 lines of code. The process is straightforward, but uses a few interesting libraries.

First, it splits the text into pieces. I used the textchunk module (which is basically sbd) to split the text into pieces small enough for AWS, without splitting the text in the middle of a word or a sentence.

Because even a moderate amount of text can be split into hundreds of pieces (textbook chapters for me divided into ~180 parts), we don’t want to exceed Amazon’s rate limits. To throttle the requests, I used the popular async module, specifically eachOfLimit(). The requests use the official AWS SDK library for Javascript. The resulting audio files are stored in a temporary file using tempfile.

Once all of the audio files are received, they are stitched together into one file using ffmpeg’s “concat” demuxer. I tried using the Emscripten port of ffmpeg so that users wouldn’t need ffmpeg installed on their machines, but I couldn’t get it to find the temporary files — it kept giving “file not found” errors, even when the normal ffmpeg binary worked fine. (If you have an idea of how to fix this, let me know.)

The asynchronous stuff (the API requests and the ffmpeg process) are handled with Javascript promises. The fs-extra module was used to provide extra filesystem functionality seamlessly over the top of Node’s built-in filesystem commands.

Finally, I used the nice ora spinner to provide feedback to the user while the process is running.

Results

AWS is a great platform and the Polly service provides a great way to convert text to speech.

  • The speech quality is as good as anything else out there, and will only get better.
  • It is much faster than the Cepstral software; a book chapter will encode in about a minute, whereas Cepstral takes almost an hour.
  • It provides a cross-platform, command-line way to convert text to speech.
  • It encodes to MP3 out of the box — no need to convert WAV/PCM files yourself.

The main advantage of Cepstral is that it has a relatively small, one-time cost. If you plan on converting a lot of text over time, especially for personal use, then the added inconvenience may not outweigh the costs you would accrue with AWS.

Get my TTS tool here: github.com/eheikes/aws-tts. It’s brand new, so please submit bugs & features to the GitHub repository.

Leave a Reply

25 Responses to A text-to-speech tool using AWS Polly

  1.  

    |

  2. Hi Eric,

    Thank you for a wonderful tool. By the way, Google has just launched its own version of AWS Polly called Google Cloud Text-to-Speech API. From my point of view, voices with WaveNet are better than Polly. However, as far as I know, there is no command-line tool for converting books to speech by using Google Cloud Platform.

    I’m wondering if you’re planing to develop something similar to your aws-tts tool for Google Cloud Text-to-Speech Platform.

    Kind regards,
    Serge

    • Hi Serge,

      I haven’t given any thought to Google Cloud’s TTS. I only created this project to satisfy my own TTS needs, so I haven’t had to explore beyond Polly :)

      However, I understand how having a Google Cloud option would be useful to others. If people are interested in this option, I could certainly look into it. And the project is open source, so anyone can submit a patch to add Google support.

      I’ll take a look at their API in the next day or so to get an idea of the scope. Thanks for bringing it to my attention.

      Eric

  3. Hi,
    Thank you for developing this. Once I get it working, it will really hit the spot.
    I have essentially no homebrew experience but really want to have my schoiol materials transcribed using Polly. I’ll appreciate any help. Thanks.

    I installed homebrew. Node.js. ffmpeg.
    macOS High Sierra 10.13.3

    Running aws-tts seems to successfully read text, split text, and convert to audio (99/99). However, it fails to combine audio.

    I think this is the pertinent error:

    [mp3 @ 0x7f8796008600] Format mp3 detected only with low score of 1, misdetection possible!

    [mp3 @ 0x7f8796008600] Failed to read frame size: Could not seek to 1030.
    [concat @ 0x7f8796000000] Impossible to open ‘/var/folders/xl/mwfv7m896z1g2fnr6c3mrgy00000gn/T/bc1cd827-1d0c-419a-876d-c632eb7ca72a.mp3’

    /var/folders/xl/mwfv7m896z1g2fnr6c3mrgy00000gn/T/0b32618a-458b-47ea-92d2-7d18b0c67c9c.txt: Invalid argument

    I’ll appreciate any guidance. Thanks,
    AV

    • Hi AV!

      That’s an interesting error… It’s coming from ffmpeg. Can you provide me with any of the following info?

      1) Run `node -v; ffmpeg -version` (without the quotes) in the terminal and paste the output here.

      2) Run `file /var/folders/xl/mwfv7m896z1g2fnr6c3mrgy00000gn/T/bc1cd827-1d0c-419a-876d-c632eb7ca72a.mp3` and `cat /var/folders/xl/mwfv7m896z1g2fnr6c3mrgy00000gn/T/0b32618a-458b-47ea-92d2-7d18b0c67c9c.txt` in the terminal and paste the output here.

      3) What is the exact aws-tts command you are running?

      4) If possible, upload the text file to pastebin.com or somewhere and link to it here.

      Thanks,
      Eric

      • Thanks for the help. Here you go.

        1)
        v9.4.0
        ffmpeg version N-89776-gb94cd55155-tessus Copyright (c) 2000-2018 the FFmpeg developers
        built with Apple LLVM version 9.0.0 (clang-900.0.39.2)
        configuration: –cc=/usr/bin/clang –prefix=/opt/ffmpeg –extra-version=tessus –enable-avisynth –enable-fontconfig –enable-gpl –enable-libass –enable-libbluray –enable-libfreetype –enable-libgsm –enable-libmodplug –enable-libmp3lame –enable-libopencore-amrnb –enable-libopencore-amrwb –enable-libopus –enable-libsnappy –enable-libsoxr –enable-libspeex –enable-libtheora –enable-libvidstab –enable-libvo-amrwbenc –enable-libvorbis –enable-libvpx –enable-libwavpack –enable-libx264 –enable-libx265 –enable-libxavs –enable-libxvid –enable-libzmq –enable-libzvbi –enable-version3 –pkg-config-flags=–static –disable-ffplay
        libavutil 56. 7.100 / 56. 7.100
        libavcodec 58. 9.100 / 58. 9.100
        libavformat 58. 3.100 / 58. 3.100
        libavdevice 58. 0.100 / 58. 0.100
        libavfilter 7. 11.101 / 7. 11.101
        libswscale 5. 0.101 / 5. 0.101
        libswresample 3. 0.101 / 3. 0.101
        libpostproc 55. 0.100 / 55. 0.100

        2)
        /var/folders/xl/mwfv7m896z1g2fnr6c3mrgy00000gn/T/bc1cd827-1d0c-419a-876d-c632eb7ca72a.mp3: cannot open `/var/folders/xl/mwfv7m896z1g2fnr6c3mrgy00000gn/T/bc1cd827-1d0c-419a-876d-c632eb7ca72a.mp3′ (No such file or directory)

        cat: /var/folders/xl/mwfv7m896z1g2fnr6c3mrgy00000gn/T/0b32618a-458b-47ea-92d2-7d18b0c67c9c.txt: No such file or directory

        3) I used the command ‘aws-tts AuthorDatetxt2Polly.txt AuthorDate.mp3 —access-key [KEY] —format mp3’

        4) https://pastebin.com/SP8d3PXk

  4. Hi, this is an amazing tool. I’ve been trying to find a reliable way to convert text into quality TTS for a while, mainly so I can review my own writing on the drive to and from work.

    I was able to install the components and connect to my AWS without fail, and can generate the audio from a standard text file. However, I’m having a hard time getting it to parse the SSM Language. It keeps returning an error that it’s an Invalid SSML Request.

    I’ve tried many variations of the mark-up inside the file, and the file itself is a .xm1 file (and a .txt file, but neither worked).

    Do you have an example of a valid SSML file with text in it that I could use as a template? I’d love to be able to utilize tags. Thank you again – this is great, either way.

  5. Hi Eric, thanks a lot for your magic codes. I don’t have much experience in coding. Could you please take a look at the error codes I had below? Thanks.

    aws-tts test.txt test.mp3
    ✔ Reading text
    ✔ Splitting text
    ✖ Convert to audio (4/282)
    HTTPError: Response code 403 (Forbidden)
    at EventEmitter.ee.on.res (/usr/local/lib/node_modules/aws-tts/node_modules/got/index.js:182:24)
    at emitOne (events.js:96:13)
    at EventEmitter.emit (events.js:188:7)
    at Immediate.setImmediate (/usr/local/lib/node_modules/aws-tts/node_modules/got/index.js:61:8)
    at runCallback (timers.js:672:20)
    at tryOnImmediate (timers.js:645:5)
    at processImmediate [as _immediateCallback] (timers.js:617:5)

      • Hi Eric,

        Incidentally, I got the same error as above earlier today, when I tried using a lexicon that didn’t exist (well, I had the lexicon in a European region, and was using the default US region). Once I added a –region option to the correct place, it all worked again.

        Gareth

  6. Hi have you figured out a way to overcome the 1500 character limit? can I make a loop on php? suggestions or code examples appreciated.

      • Hi again… As a way to find a solution to the character limit. I was able to make use of another TTS interface provided by IBM, with this second option. I was able to make use of additional new voices and free myself from the limit actually imposed by AWS. I will be holding a meeting online with them next week, so I will bring up that subject to either create that feature or have an API option or workaround to make it work. I hope this post is useful to you and other users who may have a similar scenario.

  7. Hi, great tool you created. I wonder if we can compress mp3 file size more.
    1 min of mp3 file cost 442 KB, that was awfully a lot.

  8. Hi,

    I’m trying to use your nice software but when I try to convert even a simple text file I always get this error:
    >aws-tts prova.txt prova.mp3
    V Reading text
    V Splitting text
    V Convert to audio (1/1)
    × Combine audio
    Error: ffmpeg returned an error (1)
    at ChildProcess.ffmpeg.on.code (C:\Users\Danilo\AppData\Roaming\npm\node_mod
    ules\aws-tts\lib.js:160:25)
    at emitTwo (events.js:106:13)
    at ChildProcess.emit (events.js:194:7)
    at maybeClose (internal/child_process.js:899:16)
    at Socket. (internal/child_process.js:342:11)
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:191:7)
    at Pipe._handle.close [as _onclose] (net.js:511:12)
    Could you help me, please?
    Thanks

    • Hi Danilo,

      Thanks for reporting this. I’ll see if I can reproduce it myself. I think I know what the problem may be.

      In the meantime, can you verify that ffmpeg is working on your computer? Running `ffmpeg.exe -version` on the command line should print the ffmpeg details.

      Thanks,
      Eric

      • Hi Eric,

        I think it’s working. This is the version:

        ffmpeg version 3.2.4 Copyright (c) 2000-2017 the FFmpeg developers
        built with gcc 6.3.0 (GCC)
        configuration: –enable-gpl –enable-version3 –enable-d3d11va –enable-dxva2 —
        enable-libmfx –enable-nvenc –enable-avisynth –enable-bzlib –enable-fontconfi
        g –enable-frei0r –enable-gnutls –enable-iconv –enable-libass –enable-libblu
        ray –enable-libbs2b –enable-libcaca –enable-libfreetype –enable-libgme –ena
        ble-libgsm –enable-libilbc –enable-libmodplug –enable-libmp3lame –enable-lib
        opencore-amrnb –enable-libopencore-amrwb –enable-libopenh264 –enable-libopenj
        peg –enable-libopus –enable-librtmp –enable-libsnappy –enable-libsoxr –enab
        le-libspeex –enable-libtheora –enable-libtwolame –enable-libvidstab –enable-
        libvo-amrwbenc –enable-libvorbis –enable-libvpx –enable-libwavpack –enable-l
        ibwebp –enable-libx264 –enable-libx265 –enable-libxavs –enable-libxvid –ena
        ble-libzimg –enable-lzma –enable-zlib
        libavutil 55. 34.101 / 55. 34.101
        libavcodec 57. 64.101 / 57. 64.101
        libavformat 57. 56.101 / 57. 56.101
        libavdevice 57. 1.100 / 57. 1.100
        libavfilter 6. 65.100 / 6. 65.100
        libswscale 4. 2.100 / 4. 2.100
        libswresample 2. 3.100 / 2. 3.100
        libpostproc 54. 1.100 / 54. 1.100

        Thank you