Cecil
19+ years progress programming and still learning.
I have a business requirement to be able to search PDF documents from within our web application. I'm using Google's command line utility call tesseract, whichextracts text from a PDF/TIFF and it produces a simple text file which could be as large a 137KB for a 200 page PDF document). What I would like to know is what is the best way to store the text results in the database?
Ideas I've had are the following:
Something else to consider is the fact that the information needs to be secure so the data will be encrypted using the OpenEdge's TDE.
INFO:
OS Linux CentOS 6.2
OE: 11.1
Ideas I've had are the following:
- Try and store 32K chucks of the text file and have a word index on the field. CLOB data type fields can't be indexed.
- Read each line of the text file and store it as a separate record. One record per line with word-index enabled.
- Read each word of the text file and store each word as a separate record. One record per word.
Something else to consider is the fact that the information needs to be secure so the data will be encrypted using the OpenEdge's TDE.
INFO:
OS Linux CentOS 6.2
OE: 11.1