Saturday, March 31, 2012

Optical Character Recognition with Linux

My company currently uses ABBYY products to perform OCR on PDF documents.  I thought it would be fun to see what open source alternatives existed out there for this purpose.  After a little bit of searching, I found that most people seemed to agree that Tesseract is one of the more accurate OCR programs available that is also open source.  I decided to give it a try on a Debian virtual machine.

I first tried using Tesseract on a TIFF file with cursive handwriting.  This failed miserably and just gave me a bunch of garbage as text output.  Maybe if I put forth the time and effort to "train" Tesseract I could get this to work somewhat.  But then again, everybody's cursive handwriting is different, so I can't ever see getting this to work reliably. 

I then tried a PDF document with typed text.  Tesseract wouldn't even read the PDF, as is.  I first had to convert it into an image file (I chose TIFF, again).  This worked much better.  I'd say it was probably about 90 - 95% accurate.  I then tried the same PDF in Windows using ABBYY Corporate Edition version 10.  ABBYY had much better results.  In fact, it was almost 100% accurate.  I have to give the point to ABBYY on this one.

Finally, I tried scanning a gas station receipt and OCR-ing it.  Using Tesseract produced more garbage.  This time, though, ABBYY also produced garbage.  Granted, though, this receipt was very crumpled and the text on the receipt was printed very lightly.  I had to strain my eyes to read it myself, so I can't blame the two pieces of software for not being able to properly OCR it.

The fact is that I don't really have any personal uses for OCR software.  That makes it difficult for me to think up more testing scenarios.  My results seem to show that if I needed to OCR documents for a business, I'd probably put my trust in the commercial ABBYY product line.  The Tesseract project does show a lot of promise, though.  Tesseract did seem to perform much faster and use far less resources than ABBYY.  With some improved out-of-the-box accuracy, my recommendation could certainly shift in favor of Tesseract.

No comments:

Post a Comment