[Apc-foss] Finally, it got going... Tesseract

Frederick Noronha [फ़रेदरिक नोरोनया] fred at bytesforall.org
Mon Dec 17 18:23:25 GMT 2007


Derek was about to leave this afternoon. In typically Derekesque
style, he started off sounding very reluctant and diffident. When I
kept insisting, he agreed to be dragged back to my comp. After awhile,
I was seated before the comp, scanning the document (as if I knew
everything about it)... and he was actually helping me.

"Don't do it in JPG. Use TIFF instead. Take the higher resolution. 300
dpi," he kept guiding me.

Before we knew it, we found we were on track. The first time, the
emerging characters seemed mostly junk. At a higher resolution, it
worked.

Finally, OCR was working for me... on GNU/Linux (I've actually never
tried it out on any other operating system). Yay!

For details about what I'm talking of, see
http://en.wikipedia.org/wiki/Tesseract_%28software%29 :

OPENQUOTE: In computer software, Tesseract is a free optical character
recognition engine. It was originally developed at Hewlett-Packard
from 1985 until 1995. After ten years with no development, Hewlett
Packard and UNLV released it in 2005. Tesseract is currently developed
by Google and released under the Apache License, Version 2.0. The
current version of Tesseract is 2.01, released August 30, 2007.
CLOSEQUOTE

Now, I'm making my way through piles of neatly-OCRed text, in jstar
(the WordStar clone I still love using)... and it works fine. So
here's another tool some of you word-oriented guys might find useful.

Very simple to use....

QUOTE:

Tesseract is only the OCR engine, not a standalone application.
Tesseract runs from the command line and the usage of both Windows and
Linux versions is the same. Tesseract may be called from command line
using the following format:

tesseract <image.tif> <output> batch

The image file requires the extension .tif for its type to be
recognized correctly. If a file exists with the .tif extension
replaced by .uzn, then it will be interpreted as a UNLV-style zone
file. (See ISRI at UNLV (Information Science Research Institute at the
University of Nevada, Las Vegas) for details of the zone files.)

ENDQUOTE (again, Wikipedia on Tesseract)

Some links:

* Tesseract OCR (project page on Google Code)
  http://code.google.com/p/tesseract-ocr/

* Information Science Research Institute at the University of Nevada, Las Vegas
  http://www.isri.unlv.edu/

* http://www.ocropus.org/ - A high-performance handwriting recognizer developed
  in the mid-90's and deployed by the US Census bureau and novel
high-performance
  layout analysis framework, currently using Tesseract as the OCR plugin.


* http://tesseract-ocr.repairfaq.org/ - C/C++ structure of Tesseract extracted
  from Doxyfied source code (based on Tesseract V1.03)

* Archivista Box - A complete GPL document management system
  based on Tesseract and Linux.
  http://sourceforge.net/projects/archivista

* [1] - some patches for training on a 64-bit machine.
  http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html

Thanks to all those who made this tool available. It means (so many
words) for us. FN
--
Frederick Noronha http://fn.goa-india.org Ph +91-832-2409490
Links from Goa: http://goalinks.livejournal.com/

* Derek is a young, engineering student... we have a mutually
symbiotic relationship. I try to give him as much as I can to read
about FLOSS, and he maintains my comps superbly. Sometimes (often) we
can't understand each other's languages... my needs being very
practical and real world, and his knowledge being hi-tech. He's from
my village, Saligao.


More information about the Apc-foss mailing list