Getting Google Play Books (or any other non-OCR’ed PDF) Ready To Be Read on Kindle (or any other PDF reader) with Text Manipulation Functions

Skip the explanation part(s) if you know what you are doing, I have written this manual for intermediate users of UNIX-like systems.

Brief Explation

I like my Kindle Voyage and its PDF display capabilities but once in a while I am having a hard time enjoying the content because of non-OCR’ed PDF material. I recently bought a 40 USD 1950 edition old book from Google Play (I can’t take the risk of getting the hard copy through the customs in Turkey considering the actual book is around 100 USD) and it’s a shame I couldn’t read it on my Kindle without pulling all kinds of stunts. For starters, OCR is what makes a regular scanned document more than just an image version of it by detecting the text on it and making it an actual text document rather than a simple photocopy. Why is OCR important? If you want to do a text search, copy a text, do annotations or look up the meaning of a word inside your PDF, OCR is your single solution for older documents (or even the newer ones when the document is intentionally left as scanned). Unfortunately, most OCR software is very expensive, starting from around ten dollards going up to thousands of dollars. But the best part is most OCR software use the same open-source, free back-end, TESSERACT. TESSERACT was started by HP decades ago, now it is sponsored by Google. Actually, the famous Google Docs utilize it for the pdf’s you have on your Google Drive or even Gmail. The problem with TESSERACT for the regular users is that it is too powerful to be user-friendly. In fact, it does not even have a GUI or interface. So the following gives brief and simple (as much as possible) instructions on how to tame it in OS X.

Install Homebrew

Homebrew is an indispensable package manager for powerusers of OS X, similar to apt-get or yum in Linux distros. In fact, it is much like Gentoo’s package manager since it compiles most stuff directly on your computer instead of downloading binary executables.

To install it via Terminal, which is the only way actually, run:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Install Imagemagick

The almigthy Imagemagick has more juice and tricks than Photoshop if you know how to use it. No wonder why even Hollywood studios use it (Shrek, Toy Story, Stuart Little, The Incredibles are just a few examples of the movies it was used in), too. I honestly don’t know how to do anything with it other than simple image manipulation.

Run this:

brew install --with-libtiff --with-ghostscript imagemagick

 

Install Tesseract

Time to unleash the monster in all its glory. The current procedure will use a lot of CPU power and memory. So leave your computer be while the fans cry their lungs out:

brew install --with-all-languages tesseract

A necessary modification here. The latest version of tesseract (3.04.00) has a glitch with generating the fonts on the final pdf. I have contacted the development team, and it seems that a fix is on the way;however for now, you have to use an alternative font. Ken Sharp saves the day with his font. Click here to download it. Run this in the terminal and replace pdf.ttf with Sharp’s font:

open /usr/local/Cellar/tesseract/3.04.00/share/tessdata

Scanning and OCR’ing:

Run this (by replacing the INPUT part) to create tiff versions of your original pdf:

convert -density 300 INPUT.pdf -type Grayscale -compress lzw -background white +matte -depth 32 page_%05d.tif

OCR the tiffs with:

for i in page_*.tif; do echo $i; tesseract $i $(basename $i .tif) pdf; done

Now, merge the single page pdfs with CPDF, which is available here (CPDF was an expensive pdf toolkit before it was made open-source, now the back-end binaries are free to download but you still have to pay thousands of dollars for the GUI if you are a company):

./cpdf page_*.pdf -o MERGED.pdf

That’s it! You have a searchable pdf which you can use both on your computer and on Kindle.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s