The Nameless Horror

Character Recognition: Lessons In OCR

The other week I got back rights on my last three Penguin books; while the first one’s still available, and was digitised early, the others were all more or less abandoned not long after (or, in the case of The Darkness Inside, before) launch and fell out of print.

Hooray.

Unfortunately, clever chap that I am, I discovered some time ago that I’d managed to fail to keep my draft files of the second book, The Touch Of Ghosts, in the backups made as I’ve swapped from computer to computer down the years (which, given that it was written in… oooh… 2002/3 was about 5 computers ago). Which means either (a) typing out 85,000 words by hand, or (b) employing a camera/scanner and OCR (optical character recognition) software to turn image into editable text again. A year or two ago, a mate worked at a place with a mighty machine for such a thing - give it a stack of book pages and it would scan and OCR them for you, job done. Sadly he doesn’t work there any more. I don’t have a scanner, but I’ve got a phone with a good camera and a couple of promising OCR apps available, and a DSLR and a laptop on which other options abound. Which is good, because using a phone to photograph, upload and email results to yourself, one page at a time, a 280-page novel is a bastard of a chore. Especially when the best of the two I’ve tried does its processing server-side and only allows you to upload 10 images per hour. Give me something with batch processing and make me a cup of tea, right?

Or so I thought. Y’see, should you find yourself in my situation, you’ll discover that OCR is less than entirely reliable and the best option isn’t necessarily the big, fat, commercial one.

I used my Canon to photograph every page of the book. It’s not ideal - a phone gives you a better image because you’re closer (shorter focal length = wider viewport = closer you can get) - but it makes some spankingly sharp images which can be ensharpenated and clarified nicely before you run them, and it’s a lot, lot, lot faster. Set it up on a tripod. Click. Turn page. Click. Turn page. Click. Etc.

Our contenders are:

  • ImageToText: An iOS app, free. The server-side, 10 images max per hour one. Email the results to yourself or send to Evernote. Grindingly sluggish due to all the faff.
  • TextGrabber: iOS again, $3 as I recall. Processes on-phone, so no limit. Similar buggering around to get the results back onto your actual computer. Bit quicker but not so’s you’d really notice.
  • OCRTools: Mac App Store, $3, a shot in the dark (commercial OCR is normally very expensive). No faff ‘cos you’re on your computer already.
  • ABBYY FineReader Express: This puppy retails for $100 and has a good rep. It also has a 15-day free trial period, which is what I’m using because damned if I have a hundred Yanqui Dollars to throw at photos of a book I should’ve backed up properly. It’s the heavy-hitter in our study.
  • Google Docs: Yes, GDocs can read text from PDFs or images (convert on upload, or after). I didn’t know that.

I used the same sample image - from the Canon - with all of them. Page 31, as it happens. (I’d already done 1-30 in ImageToText the night before, so that was where I picked up with the camera.) We’ll be looking at the first few sentences. Here they are (slightly downsized to fit the blog):

And here are the results, starting from the bottom of the list above:

GDocs:

I can taste spruce bark in the air — dry, earthy, bitter. The scent is strong but not overpowering, mingling with those ofa hundred other plants, trees and flowers without desthem. The mixture reminds me of old winc casks, troymg ulcul.

Yes, it looks OK, but the later part of the page, after our sample, is incredibly garbled (“troymg ulcul” is a real harbinger of things to come). It also tries to put everything in rich text rather than unformatted and… well, the results are very strange to see.

ABBYY FineReader Express:

an taste spruce bark in the air - dry, earthy, bitter. The •cm is strong hut not overpowering, mingling wirh those o/a hundred other plants, trees and flowers without des- troying them. The mixture reminds me of old wine casks, though 1 have no clear idea why.

Sure, miss off the first couple of words for no reason. It’s better than GDocs, but not astounding, and like GDocs it gets worse further down. The bottom of the page was just squiggles.

OCRTools:

gym-Powering, mingling with those pggms, trees and flowers without des ’ . reminds me of old wine casks, idea why.

Three quid suddenly seems like a bit of a con. This is like reading the output of Word’s ‘Outline’ feature for a document. Or auto-scraped spam comment text. Speak to me, OCRTools, I know you’re trying to tell me something!

TextGrabber:

J can taste spruce bark in the air - dry, earthy, bitter. The is strong but not overpowering, mingling with those of a hiuxired other plants, trees and flowers wkhouE desthem. ‘1’he mixture reminds me of old wine casks, I have no clear idea why.

Not bad, if a bit ropy in the middle. Unlike some of its desktop brethren, TextGrabber didn’t have the same tendency to start strong and get wonkier; the mid section of the page was good and it only got a bit more like this again towards the end (where curvature and shadowing effect the image most). Still, bear in mind this is cheaper than OCRTools…

ImageToText:

J can taste spruce bark in the air - dry, earthy, bitter. The scent is strong but not overpowering, mingling with those of a hundred other plants, trees and flowers without des- troying them. The mixture reminds me of old wine casks, [hmuih I have no clear idea why.

Swaps J for I and messes up one word only. Mostly consistent further down, too.

So there you have it. Bearing in mind that the last two work better when you use the camera on the actual phone rather than the external sample image (and I’m sure the desktop ones would too, but if I’m sodding around with the phone I might as well use it for everything rather than faff even more moving photos from one place to another), ITT is a clear winner. I just wish (a) there was some way of removing the images/hour limit (I would happily pay, but sadly it’s free-only) and (b) it was possible to set it up to auto-send the results to email rather than having it open Mail for you to do by hand each time.

If you find yourself in the same ridiculous position I am, get a goddamn scanner and do it properly. If you can’t, your best bet is probably not your computer but an iPhone. And the patience of a motherfucking saint.

Crawling Back

A year or so ago I delisted everything self-published I had on Amazon on the grounds that I didn’t like what they were doing to the publishing industry as a whole, the whole drive-price-down issue, etc. etc., in the interests of sticking primarily to the direct market on grounds of principle.

I still think all that, but with Submission Thing looking increasingly unlikely to find a home, I also have - to borrow a quote from Serenity - a powerful need to eat, and pragmatically I can’t afford to ignore the 95% of the market that Amazon represents if I’m going to be relying increasingly on self-published sales. (Assuming, that is, I manage to continue scraping a living as a writer in the first place.)

So stuff is up over there too, with just HBJC waiting in review still. If it shifts a little, we might, as the saying goes, be OK. (It’s also around on Kobo, and will eventually appear elsewhere via Smashwords when I summon up the enthusiasm to turn my perfectly good epub files into Word documents so they can be converted to epub files. I also need to update my own store versions with the new editions, and that too involves Work, but it’ll happen in the end.)

To Charity And Beyond!

There are nearly 50 writers contributing to this, all of whom are on Twitter and Facebook and InternationalPigFanciers.com and all the rest, so I imagine many of you will have seen a billion posts on this today - sorry - but all-proceeds-to-charity antho Off The Record 2: At The Movies is out today. According to Luca it clocks in at nearly 120,000 words so it’s a proper e-brick of a thing for a mere couple of quid, and features stories from a raft of awesome writers including but not limited to Steve Mosby, Will Carver, Claire McGowan, Matt Hilton, Helen Fitzgerald, Stav Sherez, Andrez Bergen and some chancer going under the name Sean Cregan. Every story, so the theme goes, has to be the title of a movie, and there’s some cool ones picked.

We got copies to proof-read a few days ago and the stuff I’ve read so far has been absolutely top notch. There’s some really strong work in there. And mine.

My contribution - which will obviously be THE BEST ONE - is The City Of Lost Children (I was tempted by Surf Nazis Must Die but went all serious instead). The opening paragraph is:

It is 11:05. Jenny stands at the junction near the little row of empty cafés. The big clock on the tower across Evergreen Park tells her it is 11:05, and since neither she nor most of the other kids in the City have a watch, she has come to rely on it. As she does at 11:05 every day, in this place without true days, she stands there and watches the ghosts, hoping with all her heart, as she does at 11:05 every day, that this time she will see her parents.

To read the rest, buy the book. Everything it makes after the distributor’s cut goes to two children’s literacy charities. And it’s good. Relentlessly grim, but good.

Yet, although the literary community – in the broadest sense – is part of this paradigm shift, it is odd, and slightly baffling, how little reference is made to it in poetry, drama or fiction. Jeanette Winterson published The Powerbook in 2000, exploiting emails as a genre. In India, Chetan Bhagat (One Night @ the Call Center) and Aravind Adiga (The White Tiger) have flirted with the socio-economic impact of the new technology on Indian life. Otherwise, I cannot think (perhaps readers can help out here) of a contemporary scene or character whose narrative or development owes much, if anything, to the new technology.

Apparently, Robert McCrum is reading very different books than I am.

And ignoring the here/gone dynamic in online services making mentions dated overnight (MySpace, anyone?).

And quoting Baroness Greenfield on computers, which is a bit like asking a frothing Puritan to give sensible, considered opinion about fornication.