Answered D/L Unicode databases.

ron

Member
AIX 7.1 -- OE10.2B05

In the next few weeks I have to migrate a DB to be Unicode. I am aware that "-cpinternal UTF-8" is required when re-indexing a Unicode DB. Is there anything else that is different when a Unicode DB is dumped/reloaded?

Second question: the Progress docs say: "When an existing database is converted to UTF-8, the amount of storage required by each non-ASCII character increases. Roughly, each non-ASCII Latin-alphabet character converted to UTF-8 tends to require two bytes, while each double-byte Chinese, Japanese, or Korean character converted to UTF-8 tends to require three bytes." How should I interpret that? Does it mean that each character in a char field in a single-byte DB (isi8859-1) becomes two bytes after Unicode conversion? Or does it become three bytes?

Third question: the size limit of a field is 32K. Is this limit 32K physical bytes? Or 32K logical (ie, Unicode) bytes?

Ron.
 

Stefan

Well-Known Member
2. I would guess it varies depending on the content

Code:
DEF VAR lcc AS LONGCHAR.
 
FIX-CODEPAGE( lcc ) = "utf-8".
 
lcc = "hëllö".
 
MESSAGE 
   LENGTH( lcc, "raw" ) SKIP 
   LENGTH (lcc, "character" ) 
VIEW-AS ALERT-BOX.

3. I would guess 32k bytes
 

RealHeavyDude

Well-Known Member
The codepage ( -cpstream ) is very much relevant when you ASCII dump & load a database. But as you mention the index rebuild I am guessing that you are planning to binary dump & load the database. If that's the case, the binary dump & load don't do anything with code pages, they dump and load the data as is. Therefore you need to ensure that the source and target database have the same code page in order to not screw your data.

How much bytes are needed per character very much depends on the character itself. You might want to have a look yourself here http://en.wikipedia.org/wiki/UTF-8.

Heavy Regards, RealHeavyDude.
 

ron

Member
Thanks a lot! I didn't appreciate that UTF-8 preserved single-byte encoding for ASCII - and 2 or 3 bytes for (for example) Chinese characters. That makes me feel a lot happier - because it means the conversion should not materially change the size of the database at all. It will only be a future issue as the database collects new data in other languages.

From the looks of things I need to include -cpinternal, -cpstream, -cpcoll and -cpcase as startup parameters.

Ron.
 

urgent

New Member
I remember these days, i had to do d/l to a different language.
This is what i had to do ill refer in my example to the oldcode & newcode:
  • Copy the database from the source db that you want to be used and copy it over to the location of the target db
  • Restore the data base in a separate sub directory using the new Lang code
  • Now, using the switches for the original db example used here "oldcode" to start the database
  • Perform .df & .d dump for everything *. Using the new target db name for the .df file name and using the UNDEFINED code.
  • Once all steps above completed delete the db that you used for D&L
  • In the same directory still, I created a new empty db using procopy from emtpy8.db for that newcode prolong code directory
NOTE: At this point I have a new empty "newcode" database a data dump and a .df dump from the "oldcode" database (UNDEFINED-for now) since I selected undefined.
  • Pro-copy the database and move it to the new final location using the "newcode" prolong codes
procopy example.db newdb.db -cpinternal newcode -cpstream newcode -cpcoll OPTION
Then​
prorest newdb.db example.db -cpinternal newcode -cpstream newcode -cpcoll OPTION
 
Top