Large Table Dump & Load

Hello Everyone,
I have a large table which is of size 300GB - like an audit table. We wanted to do a D&L for this table separately and not as part of our migration. I assume we will not be able to do a D&L in parallel with users writing data to this table.

What we are trying to get here is;
1. Load the schema which will include this large table
2. D&L for all tables other than the large one
3. Index rebuild for all tables
4. Start the DB
5. Users will connect to the application and start using it
6. Data will be written to the large table as well which is empty right now
7. In parallel I have to load the 300 GB data to that large table

Will table partition help? Or is there a better way to achieve this kind of a scenario?

If we don't have a better way then we are planning to the follow the below plan;
1. We can rename the large table (Let's say tbl will be renamed as artbl)
2. We will now have artbl which is 300 GB and tbl which is size 0
3. Users will be writing data to tbl
4. Users will be reading data from both artbl and tbl to fetch the right data

Note - to get this working we will need to modify 45 programs, so rather going with the above approach we wanted to see if we have a better solution.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
I feel like you have described two possible solutions without fully articulating either the business activity or the problem you are trying to solve. It seems you are trying to avoid D&L of your large table. Why is that exactly? What do you need to accomplish? What is your downtime window? What has your testing demonstrated so far?

I have a large table which is of size 300GB - like an audit table. We wanted to do a D&L for this table separately and not as part of our migration.
Can you give more details about the "migration"? Is this a move from one server to another? Is it also a change of OS type or version? Which Progress/OE version is in use in the source and target environments? Are the servers at the same physical location?

I assume we will not be able to do a D&L in parallel with users writing data to this table.
You can't give users access to the database during the D&L, otherwise you lose the ability to validate record counts. Also, if they can do writes on the source during the dump then some of those records may never make it to the target, causing data loss.

However someone may decide, based on a data-retention policy or their knowledge of the application, that you no longer require or don't immediately require all of the data in a given table. You could dump a subset of that data and load it. Or archive/purge some of that data prior to dump and compact/rebuild the indexes on that table, and then dump all of what remains. This is the type of scenario where table partitioning could be useful.

Will table partition help?
Possibly. But it is really a question for a conversation with both a knowledgeable application developer and a DBA. Table partitioning allows you to transform a table from one storage object, residing in one storage area, to multiple storage objects, one per partition, potentially spanning several areas, each of which can be dumped and loaded independently. Partitioning also allows for maintenance activities like marking a partition read-only, pruning a partition, etc. So instead of having to reduce the size of a table by programmatically exporting/deleting individual records in some date range and then rebuilding or compacting the indexes, you can instead just remove a partition and all of the data in it, in one operation.

Of course that requires preparation. You must have purchased and installed the TP license (OE 11.4+), created your partition policies, and partitioned your data in whatever manner makes sense for your data and your application, and based on the benefit(s) you believe partitioning will provide (e.g. maintenance benefit, improved query performance, reduced I/O contention, etc.). There are several white papers and conference presentations on progress.com and pugchallenge.org that cover partitioning types and strategies in some detail.

I have gone through that material but so far I haven't used TP myself. Be aware though that unlike TDE, it isn't fully transparent to the application. Implementing it for a given table, once the partitioning scheme has been chosen, requires review of the code that does creates and updates on that table. A transaction that does a create must assign values to all fields that make up the partition key, otherwise the client will throw a runtime error as the storage engine won't have enough information to know where to store the record. So for example if your application assigns a value to one of the two key fields in the create and then assigns a value to the other in a later update, it will have to be refactored to populate them both during the create.

I think that in theory, TP can be a very useful feature for some scenarios. But you should test with your data and application to prove to yourself that it will provide the benefit you believe it will.

One other point: you didn't mention backups or AI in your high-level steps. Maybe that's on purpose, as they are high-level. But sometimes detailed procedures get written from high-level outlines and important steps are forgotten. Always take a full backup after a load and before the idxbuild. If idxbuild crashes, restoring from backup is your only option. And idxbuild requires you to disable AI, so don't forget to re-enable it (and replication and AI daemon, if applicable) afterwards.
 

cj_brandt

Active Member
List your version of OE and the version you are migrating to. Going from 1 server to another ?
Do you know the amount of time required to dump, load and rebuild the indexes on the large table.

We had a similar issue about 15 years ago on 9.1a when we were moving from one server to another and didn't have enough time to move 1 table. This was an audit table, that was rarely read and read performance didn't matter.

First weekend
took at outage and installed a 4gl trigger to the table. Any update to the table - create or update - was written to a file.
allowed users back into db.
Started dumping the table using 4gl scripts based on date. Dumped all data was older than time when we installed trigger. Those scripts ran for 3 or 4 days.
Started loading large table into new database as the dump scripts finished.

Second weekend
Ran D&L for rest of tables.
Loaded data from files created by trigger.
A mistake in our logic, we had SQL that would write to the large table, which our 4GL triggers didn't catch, so we had to go back and track that data down.

With OE 10 and 11 there are other options now to increase performance on a D&L.
 
- Progress 9.1E [Unix - Sun Solaris] to OE 11.6 [Linux].
- Server Locations are different (pardon me for not giving more details on this as am not supposed to)
- This is more like an audit table (only create and read)
- we did purge on tables that we can. on this table client has confirmed they can't allow us to do any purge activity.

We split tables across 4 files and started dump in parallel. It took 8 hrs as average and we are planning to split tables across 8 files for next round of migration testing and based on current results we expect the dump to be completed under 4 hrs. The one large table by itself easily takes 7 hrs - similar kind of issue with load as well.
 

TheMadDBA

Active Member
If you know for a fact that existing records will not be updated you can (very carefully) allow access to the database while you load in the historical records. Testing the process and validating the process is key. You have to also know that the application will behave properly without those historical records loaded. So.. proceed with extreme caution if you take this approach.

Otherwise.. you should look into your dump and load procedures to make sure you are using the most efficient process for your data/system. Are you doing proutil dumps or ascii dumps? Single or multi threaded loads? Are you using the new idxbuild parameters to make sure that part of the load runs as fast as possible? Etc.

Table Partitioning is a waste of time and money in OE. You don't get the bulk operations like you would with other databases.
 

cj_brandt

Active Member
If there is a date field you can use to identify the records and existing records are not updated, then just dump the records up to a certain date ahead of time. For example dump all records more than 2 weeks old and start loading them into the new environment. Then on the migration weekend, dump the last couple weeks of records.
 
Thanks Everyone!

Single or multi threaded loads?
Load is multi threaded which gets completed in 4 to 5 hrs which is on par with rest of the sessions that I have kicked in parallel. The issue is mainly on the dump side, the dump of this table takes 15 to 16 hrs which is hampering the downtime window and couldn't fit.

As @cj_brandt said we have a date field on this history table. We are sure that we don't have any updates to this tables, only create and read operation. We were thinking of taking a snapshot of the live db; pre dump/load the large history table and later rest of the records can be moved on off peak hours or weekends.
 
Top