The Nameless Horror

Character Recognition: Lessons In OCR

The other week I got back rights on my last three Penguin books; while the first one’s still available, and was digitised early, the others were all more or less abandoned not long after (or, in the case of The Darkness Inside, before) launch and fell out of print.

Hooray.

Unfortunately, clever chap that I am, I discovered some time ago that I’d managed to fail to keep my draft files of the second book, The Touch Of Ghosts, in the backups made as I’ve swapped from computer to computer down the years (which, given that it was written in… oooh… 2002/3 was about 5 computers ago). Which means either (a) typing out 85,000 words by hand, or (b) employing a camera/scanner and OCR (optical character recognition) software to turn image into editable text again. A year or two ago, a mate worked at a place with a mighty machine for such a thing - give it a stack of book pages and it would scan and OCR them for you, job done. Sadly he doesn’t work there any more. I don’t have a scanner, but I’ve got a phone with a good camera and a couple of promising OCR apps available, and a DSLR and a laptop on which other options abound. Which is good, because using a phone to photograph, upload and email results to yourself, one page at a time, a 280-page novel is a bastard of a chore. Especially when the best of the two I’ve tried does its processing server-side and only allows you to upload 10 images per hour. Give me something with batch processing and make me a cup of tea, right?

Or so I thought. Y’see, should you find yourself in my situation, you’ll discover that OCR is less than entirely reliable and the best option isn’t necessarily the big, fat, commercial one.

I used my Canon to photograph every page of the book. It’s not ideal - a phone gives you a better image because you’re closer (shorter focal length = wider viewport = closer you can get) - but it makes some spankingly sharp images which can be ensharpenated and clarified nicely before you run them, and it’s a lot, lot, lot faster. Set it up on a tripod. Click. Turn page. Click. Turn page. Click. Etc.

Our contenders are:

  • ImageToText: An iOS app, free. The server-side, 10 images max per hour one. Email the results to yourself or send to Evernote. Grindingly sluggish due to all the faff.
  • TextGrabber: iOS again, $3 as I recall. Processes on-phone, so no limit. Similar buggering around to get the results back onto your actual computer. Bit quicker but not so’s you’d really notice.
  • OCRTools: Mac App Store, $3, a shot in the dark (commercial OCR is normally very expensive). No faff ‘cos you’re on your computer already.
  • ABBYY FineReader Express: This puppy retails for $100 and has a good rep. It also has a 15-day free trial period, which is what I’m using because damned if I have a hundred Yanqui Dollars to throw at photos of a book I should’ve backed up properly. It’s the heavy-hitter in our study.
  • Google Docs: Yes, GDocs can read text from PDFs or images (convert on upload, or after). I didn’t know that.

I used the same sample image - from the Canon - with all of them. Page 31, as it happens. (I’d already done 1-30 in ImageToText the night before, so that was where I picked up with the camera.) We’ll be looking at the first few sentences. Here they are (slightly downsized to fit the blog):

And here are the results, starting from the bottom of the list above:

GDocs:

I can taste spruce bark in the air — dry, earthy, bitter. The scent is strong but not overpowering, mingling with those ofa hundred other plants, trees and flowers without desthem. The mixture reminds me of old winc casks, troymg ulcul.

Yes, it looks OK, but the later part of the page, after our sample, is incredibly garbled (“troymg ulcul” is a real harbinger of things to come). It also tries to put everything in rich text rather than unformatted and… well, the results are very strange to see.

ABBYY FineReader Express:

an taste spruce bark in the air - dry, earthy, bitter. The •cm is strong hut not overpowering, mingling wirh those o/a hundred other plants, trees and flowers without des- troying them. The mixture reminds me of old wine casks, though 1 have no clear idea why.

Sure, miss off the first couple of words for no reason. It’s better than GDocs, but not astounding, and like GDocs it gets worse further down. The bottom of the page was just squiggles.

OCRTools:

gym-Powering, mingling with those pggms, trees and flowers without des ’ . reminds me of old wine casks, idea why.

Three quid suddenly seems like a bit of a con. This is like reading the output of Word’s ‘Outline’ feature for a document. Or auto-scraped spam comment text. Speak to me, OCRTools, I know you’re trying to tell me something!

TextGrabber:

J can taste spruce bark in the air - dry, earthy, bitter. The is strong but not overpowering, mingling with those of a hiuxired other plants, trees and flowers wkhouE desthem. ‘1’he mixture reminds me of old wine casks, I have no clear idea why.

Not bad, if a bit ropy in the middle. Unlike some of its desktop brethren, TextGrabber didn’t have the same tendency to start strong and get wonkier; the mid section of the page was good and it only got a bit more like this again towards the end (where curvature and shadowing effect the image most). Still, bear in mind this is cheaper than OCRTools…

ImageToText:

J can taste spruce bark in the air - dry, earthy, bitter. The scent is strong but not overpowering, mingling with those of a hundred other plants, trees and flowers without des- troying them. The mixture reminds me of old wine casks, [hmuih I have no clear idea why.

Swaps J for I and messes up one word only. Mostly consistent further down, too.

So there you have it. Bearing in mind that the last two work better when you use the camera on the actual phone rather than the external sample image (and I’m sure the desktop ones would too, but if I’m sodding around with the phone I might as well use it for everything rather than faff even more moving photos from one place to another), ITT is a clear winner. I just wish (a) there was some way of removing the images/hour limit (I would happily pay, but sadly it’s free-only) and (b) it was possible to set it up to auto-send the results to email rather than having it open Mail for you to do by hand each time.

If you find yourself in the same ridiculous position I am, get a goddamn scanner and do it properly. If you can’t, your best bet is probably not your computer but an iPhone. And the patience of a motherfucking saint.