PDF/TIFF/OCR Text Search Best Pratices.

Cecil · Oct 30, 2012

I have a business requirement to be able to search PDF documents from within our web application. I'm using Google's command line utility call tesseract, whichextracts text from a PDF/TIFF and it produces a simple text file which could be as large a 137KB for a 200 page PDF document). What I would like to know is what is the best way to store the text results in the database?

Ideas I've had are the following:

Try and store 32K chucks of the text file and have a word index on the field. CLOB data type fields can't be indexed.
Read each line of the text file and store it as a separate record. One record per line with word-index enabled.
Read each word of the text file and store each word as a separate record. One record per word.

Something else to consider is the fact that the information needs to be secure so the data will be encrypted using the OpenEdge's TDE.

INFO:
OS Linux CentOS 6.2
OE: 11.1

arronlee · May 5, 2014

Hi, Cecil.
Thanks for your nice sharing. I wonder whether the OCR Toolkit I am using now could do the PDF/TIFF search work for me. Do I need something alse, like a 3rd party PDF SDK to help me with that? I am alnmost a green hand on this problem. Any suggetsion will be appreciated. Thanks in advance.

Best regards,
Arron

arronlee · May 5, 2014

Hi, Cecil.
Thanks for your nice sharing. I wonder whether the OCR toolkit I am using now could do the PDF/TIFF search work for me. Do I need something alse, like a 3rd party PDF SDK to help me with that? I am alnmost a green hand on this problem. Any suggetsion will be appreciated. Thanks in advance.

Best regards,
Arron

Cecil · May 6, 2014

arronlee said:
Hi, Cecil.
Thanks for your nice sharing. I wonder whether the OCR toolkit I am using now could do the PDF/TIFF search work for me. Do I need something alse, like a 3rd party PDF SDK to help me with that? I am alnmost a green hand on this problem. Any suggetsion will be appreciated. Thanks in advance.

Best regards,
Arron

This feature of the project had been cancelled and no development was made.

PDF/TIFF/OCR Text Search Best Pratices.

Cecil

19+ years progress programming and still learning.

arronlee

New Member

arronlee

New Member

Cecil

19+ years progress programming and still learning.

Similar threads