Simple OCR
2009-04-22 03:22:07.003959+00 by
Dan Lyke
3 comments
Okay, got some older PDF files that are scans of paper documents. Tried:
convert /home/danlyke/Desktop/fund-summaries-part-1-thru-s-68.pdf fundsummaries.pnm
gocr fundsummaries.pnm
and got some leet speak-ish text out of it, but nothing great. Anyone done this?
comments in ascending chronological order (reverse):
#Comment Re: made: 2009-04-22 09:30:38.216725+00 by:
DaveP
Have you tried using google's OCR? It's not super-speedy (you have to wait for googlebot to spider your
pdfs), but the quality seems to be pretty good: http://www.labnol.org/software/convert-scanned-pdf-
images-to-text-with-google-ocr/5158/
#Comment Re: made: 2009-04-22 15:02:37.649896+00 by:
Dan Lyke
Aha! Thanks, Dave. I'll need to dig a bit to try to find the particular documents I'm interested in (they've already been indexed), and I did find the "Comprehensive Annual Financial Report" for the "City of pßtalurr^", but other than that the first few I pulled up do seem a bit better.
Also, it seems like I need a -density 300 -units PixelsPerInch
in there somewhere, but my first pass created a 3 gig file that gocr
wouldn't read, so if I go down that route there are clearly some fine-tunings that need to happen.
#Comment Re: made: 2009-04-23 09:12:24.830951+00 by:
DaveP
[edit history]
Hey, whenever there's a "let someone else do the work" solution, I'm all over it. :)