PDF/TIFF/OCR Text Search Best Pratices.

Cecil

19+ years progress programming and still learning.
I have a business requirement to be able to search PDF documents from within our web application. I'm using Google's command line utility call tesseract, whichextracts text from a PDF/TIFF and it produces a simple text file which could be as large a 137KB for a 200 page PDF document). What I would like to know is what is the best way to store the text results in the database?

Ideas I've had are the following:
  1. Try and store 32K chucks of the text file and have a word index on the field. CLOB data type fields can't be indexed.
  2. Read each line of the text file and store it as a separate record. One record per line with word-index enabled.
  3. Read each word of the text file and store each word as a separate record. One record per word.

Something else to consider is the fact that the information needs to be secure so the data will be encrypted using the OpenEdge's TDE.


INFO:
OS Linux CentOS 6.2
OE: 11.1
 

arronlee

New Member
Hi, Cecil.
Thanks for your nice sharing. I wonder whether the OCR Toolkit I am using now could do the PDF/TIFF search work for me. Do I need something alse, like a 3rd party PDF SDK to help me with that? I am alnmost a green hand on this problem. Any suggetsion will be appreciated. Thanks in advance.

Best regards,
Arron
 

arronlee

New Member
Hi, Cecil.
Thanks for your nice sharing. I wonder whether the OCR toolkit I am using now could do the PDF/TIFF search work for me. Do I need something alse, like a 3rd party PDF SDK to help me with that? I am alnmost a green hand on this problem. Any suggetsion will be appreciated. Thanks in advance.

Best regards,
Arron
 

Cecil

19+ years progress programming and still learning.
Hi, Cecil.
Thanks for your nice sharing. I wonder whether the OCR toolkit I am using now could do the PDF/TIFF search work for me. Do I need something alse, like a 3rd party PDF SDK to help me with that? I am alnmost a green hand on this problem. Any suggetsion will be appreciated. Thanks in advance.

Best regards,
Arron


This feature of the project had been cancelled and no development was made.
 
Top