• Please review our updated Terms and Rules here

Better Method of OCRing Old Documents

ldkraemer

Veteran Member
Joined
Mar 14, 2013
Messages
2,632
Location
Chaffee, MO
Try as I might, I can't take a PDF of an old manual and get as good results OCRing it as I can by printing it first.

My Method:

Print all the PDF pages you wish to scan as HIGH QUALITY. Scan all these pages at 600 DPI,
and as a color document, then save as a .tiff file. This creates a file that is
5048 x 7019 pixels, 24 Bit, 600 DPI.

Open each .TIFF document in rastervect (www.rastervect.com) and do the 1BPP conversion at -30 correction.
The file is now 5048 x 7019 pixels, 1 Bit, 600 DPI.

Save the document as a .TIF, repeat for all other .TIFF files.

Open each TIF in TextBridge Classic 2.0 and OCR the complete Document as a Newspaper article type,
save the .txt files.

It should scan at 94+%. Correct any errors, then reformat as needed.

Code:
cat {14..50}a.txt > Techref1978.txt


Larry
 
Last edited:
You'll probably find the print+scan process is effectively adjusting the contrast and reducing the noise in the image. If you have a good image editor you can almost certainly find settings that will do the same.

Personally I use ImageMagick as it's command-line based, so once you find the right settings you can automate the process, taking an input PDF, applying all your filters, spitting out an "OCR-ready" PDF, then continuing with the rest of your process.
 
I'll second ImageMagick. I use it to pre-process scans before converting to an OCRed PDF. It is a very complicated program but it is also very powerful. It can automatically adjust highlight/midtone (they have some other term for it), deskew, remove noise, grayscale, crop... there is probably an option to make toast in there somewhere.
 
The problem with using a PDF is that most PDF's are 72 DPI. But, for tesseract and Textbridge Classic you need at least 300 DPI to get a good conversion.

Convert (a ImageMagick program) will make a 300 DPI TIF file that Tesseract can handle.

Code:
convert -density 300 -monochrome 14.pdf 14y.tif
tesseract 14y.tif test-14y -psm

But, as you can see it's OCR isn't as correct as it should be:

9%

System Clock

The System Clock is shown on Sheet 2 of the
foid-out Schematics at the hack of this book.
YT is a “23.6445 iVin. fundamentaicut crystai.
it is in a series resonant circuit consisting of two
inverters. 242. pins 3 and 2, and 3 and 4. form
two inverting amplifiers. Feedback between the
inverters is suppiied by C43. 3 47 pF capacitor.
R45 and R52 force the inverters used in the
osciilator to operate in their linear region.

The waveform at pin 5 of 242 wiil resemble a
Sine wave at 19.6445 MHZ. The osciilator shouid
not be measured at this pornt. however. due to
the ioading effects test equipment wouid have at
this node. Z42, pin 6, is the output of the oscil-
lato:r buffer. Clock measurements may be made
at this pornt. The output of the buffer is applied
to three main sections: the CPU timing circuit,
the video divider chain. and the video processmg
circuit.

My posted method did using TextBridge Classic 2.0 :

System Clock

The System Clock is shown on Sheet 2 of the
fold-out Schematics at the back of this book.
Y1 is a 10.6445 MHz. fundamental-cut crystal.
It is in a series resonant circuit consisting of two
inverters. 242. pins I and 2. and 3 and 4. form
two inverting amplifiers. Feedback between the
inverters is supplied by C43, a 47 pF capacitor.
846 and 852 force the inverters used in the
oscillator to operate in their linear region.

The waveform at pin S of 742 will resemble a
sine wave at 10.6445 MHz. The oscillator should
not be measured at this point, however, due to
the loading effects test equipment would have at
this node. 742. pin 6! is the output of the oscil-
lator buffer. Clock measurements may be made
at this point, The output of the buffer is applied
to three main sections: the CPU timing circuit.
the video divider chain, and the video processing
circuit,


Irfanview can be used with the CAD Plugin (OCR_KADMOS) to also make a good OCR'd text document.


For anyone interested in my method of creating a .DOC from a multipage PDF using Irfanview with OCR_KADMOS.....:

1. Save the multipage PDF to a subdirectory, then burst the PDF into single pages with:

Code:
cd /path/to/subdir
pdftk m1ps.pdf burst

2. Convert the PDF pages to .tif format using the Imagemagick convert program:

Code:
convert -density 300 pg_0001.pdf pg_0001.tif
convert -density 300 pg_0002.pdf pg_0002.tif

3. Execute Irfanview (with the CAD plugin installed - OCR_KADMOS) and open the .tif:
Draw box around both sections of text on the page to be OCR'd, extending the bottom of the
box about two lines further down.

4. Start the OCR module:
Draw a box around the left half of text to be OCR'd, then save the .txt file, and then repeat for right half, exit OCR Plugin.

5. Combine the two .txt files and clean them up:

The OCR_KADMOS plugin module does about 93% accurate conversion as shown here:

System Clock
The Svstem Clock is shown on Sheet 2 of the
fold-out Schematlcs at the back of this book.
Yl is a 10.6445 MHz. fundamental-cut crvstal.
It is in a series resonant circuit consisting of two
inverters. Z42. pins 1 and 2. and 3 and 4. form
tvvo jnverting amplifiers. Feedback between the
inverters is supplied by C43, a 47 pF capacitor.
R46 and R52 force the inverters used in the
oscillator to operate tn their Iinear region.

The Wave~Orm at pin 5 of Z42 will resemble a
sine wave at 10.6445 MHz. The oscillator should
not be measured at this point. however. due to
the Ioading effects test equipment would have at
this node. Z42, p~n 6. is the output of the oscil-
Iator buffer. Clock measurements may be made
at this po~nt. The output of the birffer is applied
to three main sections: the CPU timing cjrcu~t,
the video divider chain„ and the video processlng
circuit~


Larry
 
Last edited:
Back
Top