Converting books to MP3 audio (text-to-speech)
My wife recently (re)started school at Iowa State University, which is an hour-long drive from where we live. To make the commute more productive, I looked into converting her paper and digital textbooks into audiobooks.
The first part is the most annoying, since most of us do not have an easy way to scan bound books. A flatbed scanner provides the best quality, but you can try a phone camera, as long as it is high resolution, fairly flat, and in focus. (In my experiments, my Samsung Galaxy S5 phone took a perfectly fine photo at its highest camera setting — 5312×2988 pixels, around 4.5MB — but the light and focus wasn’t as good as a scanner.) Save the scanned pages as image files; the software that comes with your scanner should make this easy.
Converting images to text
Tesseract is great OCR software that will convert your images into text. For Windows, you can download the installer from their homepage. OSX and Linux usually have packages available in Homebrew, apt, etc.
To convert an image file, run the following in the command line:
tesseract imagefile.jpg output -psm 1
imagefile.jpg is your image file, and
output is your desired name for the text file (without the file extension — it will be added automatically).
Tesseract understands many image formats, so you can convert JPG, PNG, TIF, and more. Its character recognition is very good; the only time I run into problems is with text that is very close to the book binding (and hard to scan), laid over background images, or italicized.
Tesseract has multiple “page segmentation modes” (PSMs) — how it detects and segments the page. In the command above, I’m specifying a PSM of
1, which automatically detects the page orientation and segments the page into its constituent blocks of text. This is great if your page was scanned upside-down, or if it contains columns or other visually-separate blocks of text.
Tesseract’s page segmentation is pretty great, but be warned that even if it detects the sidebars, captions, and other areas of text, it will probably insert them in the middle of the main text; you may want to leave out that supplemental content by covering them up or deleting them from the image.
Converting PDFs to text
If you have e-books and other digital materials, you’ll want to convert those files into plain text (i.e., without all the layout and text formatting).
You can always “select all” and copy+paste into a document, but why go to all that work? A free tool called “pdftotext” can do that for you:
pdftotext -enc UTF-8 -nopgbrk -layout chapter2.pdf chapter2.txt
The pdftotext tool is part of the Xpdf software suite. If you are using OSX or Linux, it is also usually available in a utility package (search for “pdftotext”, “poppler”, or “xpdf”).
Side note: Textract
If you want an all-in-one solution for extracting text from files, take a look at textract. You’ll need NodeJS to run it, but once you have that, you can install textract globally so you can run it from the command line. It uses the same tools like Tesseract and pdftotext under the hood, but also supports extraction for HTML, Word documents, CSV, PowerPoint, and more, while wrapping it all into a single tool. Unfortunately, there is little ability to customize the underlying tools if you want to tweak them.
Fixing the text
Unfortunately, Tesseract and pdftotext will generate some special characters that text-to-speech programs won’t understand, as well as including other cruft. We need to fix up the text so it creates the best audio.
First, use uni2ascii to convert fancy characters to their simple equivalents (7-bit ASCII, if you’re curious):
uni2ascii -B chapter2.txt > chapter2-fixed.txt
Finally, you’ll probably want to scan the text (with your eyes) just to catch any mistakes made by the OCR software. Some fonts, like heavily italicized text, are more difficult for it to recognize. Other things, like headings, benefit from adding a period or other punctuation to separate them from paragraphs. Most of these can’t be automated, but here are some common fixes that can be run in on the command line or in a script (these require sed and Perl):
# OCR often interprets "w" as "vv", # and "vv" is an uncommon combination in English. sed -ri s/vv/w/ig chapter2.txt # OCR sometimes interprets "&" as other characters. sed -ri 's/:5:/\&/ig' chapter2.txt sed -ri 's/8c/\&/ig' chapter2.txt # Some observed PDFs use "greek question mark" instead of semicolon. sed -ri s/0x037E/;/ig chapter2.txt # Combine words that are broken (with hyphens) across lines. perl -0777 -i -pe 's/(\S+)-\n\s*(\S+)/$1$2/ig' chapter2.txt
Converting text to speech
Festival is the de facto free software for converting text into speech. You can download it and some voices from festvox.org. It’s not the easiest software to install. If you’re using Linux, using a package manager is the easiest method. Installing voices is even more difficult; I used this old forum post to install some extra ones on my Linux machine.
Once you have Festival installed, you can convert the text file to speech using this command (using the “voice_us2_mbrola” voice):
text2wave chapter2.txt -o chapter2.wav -eval "(voice_us2_mbrola)"
Unfortunately, even the best free voices will still sound like Stephen Hawking. If you want something that sounds even close to Siri, you’ll have to pay for it. I use Cepstral, which for $35-$45 will get you a nice-sounding voice of your choice for use with their Windows/OSX/Linux software.
Encoding to MP3
Festival and Cepstral encodes the speech as raw WAV files, which are quite large. To make them more portable, you’ll want to convert them to MP3. Luckily there’s a free tool to do that called LAME. There are pre-built packages available for OSX and Linux; Windows users or those wanting a graphical interface (GUI) can check out their list of software links. (I use WinLAME myself.)
Because speech has less fidelity than music, you can use fairly low-quality settings. Mono (instead of stereo) should be fine too. LAME’s
-V5 option uses a variable bit rate quality of “medium”, or around 130kbps, which should be fine for speech:
lame -V5 chapter2.wav chapter2.mp3
Putting it all together
Depending on exactly which tools you use, your usage may vary from mine, but here’s the full process that you can customize.
apt-get install tesseract-ocr poppler-utils perl sed festival festvox-us2 lame
brew install tesseract poppler perl speech-tools lame, then install Festival, the festlex packages, and any voices from the download page.
- Windows: Most of the tools mentioned have Windows installers.
# (1) Scan your images. Remove sidebars and supplemental text. # (2) Run the textify script: ./textify.sh *.tif # assumes TIF images # (3) Look over the text files. Make any necessary fixes. # (4) Run the tts script: ./tts.sh *.txt
tts.sh script runs, you’ll have your final
out.mp3 audio file with your text as speech.