OCR in BlitzMax

BlitzMax Forums/Brucey's Modules/OCR in BlitzMax

Brucey(Posted 2008) [#1]
Well, it's that time of the year (month/week/day) again !!
And here we are with another (not so) little module which adds OCR (Optical Character Recognition) capability to BlitzMax.

I know, I know... it's nothing to do with Games, but there just aren't that many game-related libraries out there... :-p

The module is BaH.tesseract, and it uses a lovely open-source library called Tesseract. (Thanks to xlsior for the heads-up, and nudging to get it done :-)

It claims OCR support of 6 languages as standard. English, German, French, Italian, Dutch and Spanish. I've only tried English so far, and it seems to work as intended.

It doesn't do page-layout recognition at all, so you'll need to sort that out yourself. Fortunately, you can choose a specific section of the page to process.

It takes a TPixmap as an image source, so you can use whatever image loader you like - The 2 examples provided use BaH.FreeImage to load some .tif images.
At its most basic, you can do this :
Local s:String = Tess.Rect(pix)


Still working on the documentation, but the engine itself is running on all three platforms!
(Was a right bugger to get it compiling in MinGW... yes, really!)

Currently available via SVN, at the usual place, until I get the docs finished. (after which time I will make a proper release)

As usual, any feedback is appreciated - Good/Bad/Ugly etc.


** Always on the lookout for any interesting non-GPL'd open-source libraries that might be fun for BlitzMax. **

:o)


plash(Posted 2008) [#2]
Both examples work for me, and at very high accuracy.

Unusual module, but cool nonetheless.


jkrankie(Posted 2008) [#3]
How well would this work for handwriting recognition? or wouldn't it?

Cheers
Charlie


Brucey(Posted 2008) [#4]
How well would this work for handwriting recognition?

Not very. It's more for reading printed text.

Which is a shame, as I have some late 18th/early 19th century documents that need transcribed... and they are bloody hard to read. Ho hum.


byo(Posted 2008) [#5]
Top notch!


jkrankie(Posted 2008) [#6]
ah, shame that.

Cheers
Charlie


xlsior(Posted 2008) [#7]
How well would this work for handwriting recognition? or wouldn't it?


Not at all, but that's pretty much standard for most OCR libs. Handwriting is much harder to decypher than printed text, since it is so much less consistent.

Tesseract is a few years old now, but when it first came out it scored very high among its competition. It's also pretty much the only open-source one I've come across.

In my experience, 200 dpi scans give the best results, 300 dpi is still pretty decent, but anything larger than that and accuracy goes down big time. So if you have huge scans, make sure to scale them down first. :-?

Anyway: Thanks again Brucey, this will definitely come in handy.