code-page conversion table from UTF-8 to single-byte code pages description or example

#1
Hello,
I am facing the following problem:
1. AppServer communicating with a 3d-party web-service gets special characters (multi-byte characters) in UTF-8 encoding, which cannot be converted by Progress to the internal code-page 1252.
2. AVM throws the error 11395
3. I am aware of the concept code-page conversion tables and that in special cases it is possible to define own code-page conversion tables.
It appears though that in the documentation for Progress 11.7.3 there is only a hint that if you want to create a code-page conversion table from UTF-8 to a single-byte code-page you should set the "TYPE" value to 20. The "TYPE" statement is also described too short for understanding: "The TYPE statement specifies a conversion algorithm. For a conversion between two single-byte code pages, set TYPE to 1..." - what "conversion algorithm" and how it works or at least description of possible values is missing.
4. The documentation also does not state clearly that it is not possible to create a code-page conversion table to be able to convert from UTF-8 to a single-byte code-page.
I would like to ask for help with creating a code-page conversion table for conversion from UTF-8 to 1252/iso-8859-1 either with instructions or an example or both.
Thanks in advance!
 
#2
You cannot universally convert from UTF-8 to a "single byte code page". You could convert a subset but there are obviously going to be square pegs that will not fit into round holes. For instance, what single byte character would you like to have unicorn.png unicorn.png (aka U+1F984 or 0xF0 0x9F 0xA6 0x84) map to?
 
Last edited:
#3
Hello Tom, thank you for the answer. My Intention is not to define a proper mapping for each character from UTF-8, but rather to have a workaround for the rare cases, when I have to convert something like "ș" to 1252. In this concrete example I would like to map the "ș" character to the simple "s".
 

RealHeavyDude

Well-Known Member
#4
Please don't get me wrong, but I doubt that you will find anybody who has rolled their own codepage conversion.
I did have a look into it some 10 years ago but in the end I did not roll my own - back then I decided to convert the backend to UTF-8, and, it was a good decision.

The questions is whether the data containing UTF-8 characters is significant or just in comments or descriptions.

When it is significant ( for example a name that identifies a record - I remember an EU law that requires public registrations to store, display and print names as they are really written, which for example in Austria required the support of Eastern European code pages - or UTF-8. Have a look in a phone book from Vienna ... ) then you will most likely also have the requirement to store, display and print it correctly. Your only option that holds water then would be to convert your backend to UTF-8.

When it is non-significant you might be able to just strip unsupported characters off or replace them. I know that this is not a perfect solution but it might be good enough. You could make the web service call from a special AppServer instance that runs with codepage UTF-8 so that you are able to receive the data and then run your string manipulation on the UTF-8 string so that they only contain characters supported by the 1252 codepage you are using.
 
#5
Hi, thank you for the tips. At the moment I cannot consider changing the code-page of the backend. The signs are significant, but they are allowed to be replaced with other (supported) signs. I have already a mapping solution, where I just take the data with special characters, map them to some supported characters and give the processed data further. However, this solution depends on the software it implementing, so there is a risk of bug and I will have to maintain it. That is why I had the idea to use the construct with code-page conversion tables from Progress.
 
Top