Question Remove Non Iso-8859-1 Characters

Hello Everyone,
Let's say we are getting a comma separated feed file on one of our NR job (Existing Program) and we are reading it line by line by storing it to a character variable and processing the entries. We are receiving non iso-8859-1 characters through this feed file which doesn't hurt our application but hurts our downstream application. We are planning to fix the program by just stripping off non iso-8859-1 characters. I am sure there should be a simple way to remove non iso-8859-1 characters. Please advise.
 

RealHeavyDude

Well-Known Member
I'll take it that you recieve a text file. Most likely your text file is not encoded in iso8859-1. Unfortunately there is no way to determine the code page of a text file that holds water unless you know the code page in which it was created. If you know the code page in which the file is encoded and that code page is compatible with iso8859-1 then you could use the convert option in your input from statement.

Do you know in which code page the file is encoded?

Stripping characters which are not present in iso8859-1 can also screw your data by removing important content. I would not recommend you to just remove that charaters. Instead I would recommend you to either insist that whoever produces the file does so using iso8859-1 or a compatible code page. Otherwise you will lose data.
 

RealHeavyDude

Well-Known Member
On Unix/Linux I am aware of a technique to determine whether a file is UTF-8 encoded or not. It doesn't hold water but so far it didn't fail on me. But, as far as I know that's the only code page you can check.

Code:
iconv -f utf-8 -t ucs-4 ${p_file} > /dev/null 2>&1
  returncode=$?
  if [ ${returncode} -eq 0 ] ; then
    logMessage "File ${p_file} is encoded in utf-8."
  else
    logMessage "File ${p_file} is encoded in iso8859-1."
  fi
 
Awesome, I will use this to determine if a file is UTF8 or not. In continuation I have one more question - how do I clean up the existing mess? - meaning if i query couple of fields in a table (values were populated in this fields from that feed file) I see junk characters along with the actual string in multiple rows - fixing this manually is going to be a pain. Is there a better way I can correct it via fix program? Please advise.
 

RealHeavyDude

Well-Known Member
Frankly, I don't have a solution at hand. The best solution would be to remove the data completely from the database and reload it with the correct code page setting - have a look at the convert option to the input from statement.

But, matters might get even worse - this is what happend to us and that's why I introduced the check for UTF-8: As some of the fields in our database were indexed we could neither update them or remove the records. We got a cryptic error message stating something like too many characters in an indexed field - don't remember the Progress error number. Nevertheless, we had to deactivate the affected indexes, remove the offending records and activate the indexes again.
 
Top