better way to delete 5millions rows

ricardos · Nov 26, 2024

Hi,
Looking for delete 5millions rows(50% of table), tryed using 4GL "delete where" the proccess exits with:
[2024/11/24@00:53:26.491-0400] P-31364 T-0 F SRV 1: (854) ** Insufficient disk space to extend the before-image file, I already have 4 .bi files, then we add 3 .bi files, and runing the truncate for every, couldnt startup the DB, added another bi and hopefully the proserve works.
After several attempts we decide to search google finding several sugests like dump load, use -i, ... but how about use export import to extract only the needed rows then delete and recreate table and import rows needed, what do you think about?

Thanks a lot

tamhas · Nov 26, 2024

You might start by rethinking the delete. What are you hoping to accomplish? Disk space is cheap. Data deleted is data lost.

TomBascom · Nov 26, 2024

NO! YOU DO NOT WANT TO USE THE -i PARAMETER!

That is suicidal when updating a production database. If *anything* goes wrong - like the error below, you are in an UNRECOVERABLE state if you used -i.

You might also want to post a bit more than "delete where" as a code snippet.

Having said that -- running out of space in the BI file strongly suggests that you have coded this as a single gigantic transaction. If you did that deliberately you should revisit that decision. If you did it accidentally then posting complete sample code would help us to identify the source of the problem.

You have additionally failed to specify what release of OpenEdge you are using. Which can actually make a difference when deleting large amounts of data.

When deleting lots of data on an unknown release it is generally fastest to commit the transaction one record at a time. Be very careful with record and transaction scope. If you mistakenly scope the transaction to a larger block than expected (perhaps by way of a "free reference" outside of your loop) then you will blow up the BI file as above. It will also slow you down.

Something like this:

Code:

for each customer exclusive-lock where custNum >= 100 and custNum < 200:
  delete customer.
end.

Code:

compile delcust.p listing "delcust.lis"

delcust.p                             11/26/2024 09:46:21   PROGRESS(R) Page 1   

{} Line Blk
-- ---- ---
      1     for each customer exclusive-lock where custNum >= 100 and custNum < 
      1   1 200:
      2   1   delete customer.
      3     end.

delcust.p                             11/26/2024 09:46:21   PROGRESS(R) Page 2   

     File Name       Line Blk. Type   Tran            Blk. Label            
-------------------- ---- ----------- ---- --------------------------------
delcust.p               0 Procedure   No                                    
delcust.p               1 For         Yes                                   
    Buffers: s2k.Customer

Note that the procedure block does NOT have a transaction scoped to it. That is critical -- you want the transaction to be on each iteration of the FOR EACH.

If you are running a release that supports TABLE-SCAN then you may benefit by running this instead:

Code:

for each customer table-scan exclusive-lock where custNum >= 100 and custNum < 200:
  delete customer.
end.

Regarding startup parameters -- DO NOT USE -i. (Yes, I know I am repeating myself - it cannot be said enough when someone hints that they are thinking about it.) You may, however, consider using -r. You get almost all of the same benefits but without the problem of there being norecovery from any little thing. You can still shoot yourself in the foot but it takes more work. So you really also should have a backup first along with a robust implementation of after-imaging.

TomBascom · Nov 26, 2024

tamhas said:
You might start by rethinking the delete. What are you hoping to accomplish? Disk space is cheap. Data deleted is data lost.

One potentially valid reason to delete data is that retaining records beyond your required legal responsibility may expose you to a requirement to produce those records in legal proceedings. If your company has a record retention policy for that reason then you should certainly follow that policy.

But deleting data in the expectation that it is going to make processing go faster indicates that you probably have a poor design with inadequate indexing.

Rob Fitzpatrick · Nov 26, 2024

TomBascom said:
Regarding startup parameters -- DO NOT USE -i.

Seconded. Do not use -i, ever. I wouldn't even mess around with -r if I were you. You need to know what you're doing and understand the consequences of your actions to be using such things in production.

TomBascom said:
But deleting data in the expectation that it is going to make processing go faster indicates that you probably have a poor design with inadequate indexing.

I have seen this many times in the past. Someone decides to delete a large number of rows from a large table as a proposed solution for some business problem, like a database file being too large, or a query taking too long to run.

Bad news: deleting records won't ever shrink database files, and it may make query performance worse.

The lesson: talk to knowledgeable people about the actual business problem you are trying to solve ("I can't do X", "Y takes too long", etc.), not about how to execute the imagined solution to the problem.

As Tom said, it is a good idea to run a compile listing of your procedure before running it again. If you had rapid BI growth the last time you ran this mass delete then one of two things happened:

The purge procedure had improper transaction scope and deleted everything in one big transaction, thus causing BI growth. Or the session was already in a transaction when that procedure ran.
The purge procedure is designed properly, with small transaction scope, but some other database user was already in a long-running transaction, preventing BI cluster re-use when the purge program filled all of the existing BI clusters with transaction notes, thus causing BI growth.

We aren't talking about a lot of data here. You should be able to delete five million records without restarting your database with special settings before and after the purge, assuming the database is even moderately well configured.

For an investigation of the performance of any mass-write scenario in the database, it is important to know:

The full OpenEdge release number (from the version file in the OpenEdge install directory).
The database license (Workgroup or Enterprise). You can get this from the output of a showcfg command.
The BI block size and BI cluster size and AI block size (from proutil dbname -C describe).
Whether BIW, APW, and AIW processes are running (from promon dbname, R&D, 1, 4, 7).
The sizes of the BI and AI buffer pools (from promon dbname, R&D, 1, 9 and R&D, 1, 10).
Whether after imaging is enabled.
Whether OpenEdge Replication is enabled.
The structure of the database, and particularly the BI and AI areas and the areas that contain the table being purged and the index(es) involved in the query (from dbname.st).

Other information about the configuration and state of the database might be required for a more in-depth look at performance issue, but these are the basics for look at mass writes.

If you do go ahead with the purge of this table, be sure to compact or rebuild the indexes on this table afterward. Index compaction can be run online or offline. Rebuild generally isn't required and is more invasive: it must be done offline and it requires disabling after imaging and reseeding replication.

ricardos · Nov 27, 2024

Hi Thomas, Thanks, for advice, we want to archive the old rows as history(before 2010).

Thanks Tom,
The -i option was saw at community.progress.com/s/article/P27293, and I agree not use.
Agree, we have to use a transaction in the code.
The code used from Developer was:
FOR EACH Historico-Pre-Pauta WHERE Fecha-Pauta < wkFecha .
ASSIGN wkCantidad[4] = wkCantidad[4] + 1.
DELETE FROM Historico-Pre-Pauta.
END.
With a previous disable triger and using a Fecha-Pauta index.
What do you think about use export and import, is not a better way?
The problem is because sometimes the users complaints about application speed, and we decide takeoff the old data plus in addition we will restructure the areas because is not upgraded from DB creation >10 years, but seems to be another factor of delay like network issue because some days the users dont complaint.

Thanks Rob too,
The full OpenEdge release number (from the version file in the OpenEdge install directory).
Progress OpenEdge Release 10.1B build 1216 on Linux tlvn-xca 2.6.9-55.ELsmp #1 SMP Fri Apr 20 17:03:35 EDT 2007.

The database license (Workgroup or Enterprise). You can get this from the output of a showcfg command.
Workgroup

The BI block size and BI cluster size and AI block size (from proutil dbname -C describe).
OpenEdge Release 10.1B as of Thu Dec 7 23:30:30 EST 2006
OpenEdge Database Description
Database Name : /disco2/vulcano/vulcano
Version : 150.0
Block Size : 1024
Largest Cluster : 0
Create Date : Sat May 28 12:43:23 2005
Last Open Date : Tue Nov 26 01:08:02 2024
Prior Open Date : Tue Nov 26 01:08:02 2024
Schema Change Date : Mon Nov 18 07:46:21 2024
Before Imaging information
Block Size : 8192
Cluster Size (16K Units) : 32
Last Open Date : Tue Nov 26 01:08:03 2024
Backup Information
Last Full Backup Date : Tue Nov 26 12:00:01 2024
Last Incremental Backup : *** Not yet performed ***
Database Features
ID Feature Active Details
---- --------------------------------- ------ -------
9 64 Bit DBKEYS Yes

Whether BIW, APW, and AIW processes are running (from promon dbname, R&D, 1, 4, 7).
Usr Name Type Start time Pid
0 root BROK 11/26/24 01:08 21312
1 root SERV 11/26/24 08:16 28359
2 root SERV 11/26/24 08:16 28361
3 root SERV 11/26/24 08:23 28502
4 root SERV 11/26/24 08:28 28583
5 root MON 11/26/24 15:08 2096

The sizes of the BI and AI buffer pools (from promon dbname, R&D, 1, 9 and R&D, 1, 10).
11/26/24 Status: BI Log
Before-image cluster age time: 0 seconds
Before-image block size: 8192 bytes
Before-image cluster size: 512 kb (524288 bytes)
Number of before-image extents: 8
Before-image log size (kb): 11246896
Bytes free in current cluster: 291719 (56 %)
Last checkpoint was at: 11/26/24 14:31
Number of BI buffers: 20
Full buffers: 0
11/26/24 Status: AI Log
15:12:20
*** After-image logging is not enabled. ***

Whether after imaging is enabled.
No
Whether OpenEdge Replication is enabled.
No

The structure of the database, and particularly the BI and AI areas and the areas that contain the table being purged and the index(es) involved in the query (from dbname.st).
The structure is the same as the date of creation
[root@tlvn-xca vulcano]# /usr/dlc/bin/prostrct list /disco2/vulcano/vulcano
Area Name: Control Area, Type 6, Block Size 1024, Extents 1, Records/Block 32, Cluster Size 1
Ext # 1, Type VARIABLE, Size 32 KByte, Name: /disco2/vulcano/vulcano.db
Area Name: Primary Recovery Area, Type 3, Block Size 8192, Extents 8
Ext # 1, Type FIXED , Size 150000 KByte, Name: /disco2/vulcano/vulcano.b1
Ext # 2, Type FIXED , Size 2096768 KByte, Name: /disco2/vulcano/vulcano.b2
Ext # 3, Type FIXED , Size 1500032 KByte, Name: /disco2/vulcano/vulcano.b3
Ext # 4, Type FIXED , Size 1500032 KByte, Name: /disco2/vulcano/vulcano.b4
Ext # 5, Type FIXED , Size 1500032 KByte, Name: /disco2/vulcano/vulcano.b5
Ext # 6, Type FIXED , Size 1500032 KByte, Name: /disco2/vulcano/vulcano.b6
Ext # 7, Type FIXED , Size 1500032 KByte, Name: /disco2/vulcano/vulcano.b7
Ext # 8, Type FIXED , Size 1500032 KByte, Name: /disco2/vulcano/vulcano.b8
Area Name: Schema Area, Type 6, Block Size 1024, Extents 1, Records/Block 32, Cluster Size 1
Ext # 1, Type VARIABLE, Size 13136 KByte, Name: /disco2/vulcano/vulcano.d1
Area Name: tablas, Type 6, Block Size 1024, Extents 16, Records/Block 64, Cluster Size 1
Ext # 1, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d1
Ext # 2, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d2
Ext # 3, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d3
Ext # 4, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d4
Ext # 5, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d5
Ext # 6, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d6
Ext # 7, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d7
Ext # 8, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d8
Ext # 9, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d9
Ext # 10, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d10
Ext # 11, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d11
Ext # 12, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d12
Ext # 13, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d13
Ext # 14, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d14
Ext # 15, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_7.d15
Ext # 16, Type VARIABLE, Size 1183440 KByte, Name: /disco2/vulcano/vulcano_7.d16

Area Name: indices, Type 6, Block Size 1024, Extents 6, Records/Block 1, Cluster Size 1
Ext # 1, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_8.d1
Ext # 2, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_8.d2
Ext # 3, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_8.d3
Ext # 4, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_8.d4
Ext # 5, Type FIXED , Size 102400 KByte, Name: /disco2/vulcano/vulcano_8.d5
Ext # 6, Type VARIABLE, Size 934592 KByte, Name: /disco2/vulcano/vulcano_8.d6

Sorry for my spanglish and thanks a lot for responses.

tamhas · Nov 27, 2024

If you want to archive the records, then you want to move them, not delete them.

But, think carefully since moving records to an archive table almost inevitably leads to a query across both tables which can be messy compared to a simple query on one table.

BUT! This
The problem is because sometimes the users complaints about application speed,
makes me think that you think deleting the records will speed up queries and, as Tom said, it won't.

So, set this issue aside for a moment and do the right things to address the problem first.

Rob Fitzpatrick · Nov 27, 2024

ricardos said:
Sorry for my spanglish and thanks a lot for responses.

Your English is fine.

(or my Spanglish is improving?)

There is a lot to unpack here.

ricardos said:
The code used from Developer was:
FOR EACH Historico-Pre-Pauta WHERE Fecha-Pauta < wkFecha .
ASSIGN wkCantidad[4] = wkCantidad[4] + 1.
DELETE FROM Historico-Pre-Pauta.
END.

If we're seeing all of the purge code here, and not just the bit that does the record deletion, then you are not archiving the data and you are not causing a long-running transaction that would cause BI growth. However, as I said earlier, another user could have caused a long-running transaction that would result in rapid BI growth when this code ran.

ricardos said:
The database license (Workgroup or Enterprise). You can get this from the output of a showcfg command.
Workgroup

You should seriously consider upgrading your database license to Enterprise RDBMS. Your database is 4 GB in size and "sometimes the users complain about application speed". The Workgroup license is appropriate for tiny (think MB, not GB) databases that don't have performance requirements. It cannot be tuned much at all for performance.

ricardos said:
Progress OpenEdge Release 10.1B

This release is ancient. It was released in December 2006 and retired in May 2008. You really should upgrade to a modern release. Note that the 10.1B release had three service packs and you don't have any of them installed, so you are missing dozens if not hundreds of bug fixes.

If you have your application source code, you should be able to upgrade to OpenEdge 12.8 and be on a supported release. Even if you don't have source, you should still be able to upgrade to OpenEdge 10.2B SP08 and run your existing r-code.

As for your configuration, there are several problems/red flags:
Database Block Size : 1024
This is the worst possible setting. It is bad for index compression, record fragmentation, caching efficiency, and more. It is smaller than your file system cluster size, so every time your database requests a physical I/O (database block read or write) from the operating system, Linux performs a 4 KB disk I/O. If you had a database block size of 8 KB, your database could perform the same amount of logical I/O it does now, for a given business task, with much less physical I/O, and thus greater performance.
This setting cannot be changed. In order to use a larger database block size you must create a new database and then dump and load your existing schema and data into it.

All of the application data is in Type 1 data storage areas (cluster size = 1). This has a number of effects including a reduction in caching efficiency. Be sure to change this when you dump and load your database. There are many threads on this site that detail the differences between Type 1 and Type 2 data storage areas.

License: Workgroup RDBMS
Almost all of performance-tuning capabilities of the database, which are already quite limited in 10.1B, are unavailable with the workgroup license.

BI Block Size : 8192
BI Cluster Size (16K Units) : 32 (512 KB)
Number of BI buffers: 20
Large file support (feature 5): Not enabled
Because you are using Workgroup, increasing these settings likely wouldn't help much and could even hurt write performance.

Table area (7): 15 fixed extents of 100 MB each, and one variable extent of 1.15 GB.
Index area (8): 5 fixed extents of 100 MB each, and one variable extent of 913 MB.
It appears that for a time, someone was managing the storage and adding extents as the data grew, keeping extent sizes to 100 MB each. But apparently quite a while ago, that strategy was discontinued and all new data was added in the variable extents. Note also that in your database, the maximum extent size is 2 GB, and the variable extent in area 7 is approaching that size. You will need to add one or more extents to the area before that happens.

Overall, what I see is a database and an environment much in need of maintenance and improvement. That will cost time and money, but it is essential. And if it is done correctly, it will eliminate the problem of application performance. "Done correctly" will likely require some outside consultation.

Potentially most urgent, in my view, is this:
Whether after imaging is enabled: No

After imaging should be enabled in every production database. The only exceptions would be a database that is read-only, or a database where the data changes only once per day in a batch process. Are daily backups alone really sufficient for you to have a robust business-continuity plan in 2024 and beyond?

TomBascom · Nov 27, 2024

Aside from what Rob said (which is excellent):

This code is not the complete set of code. For one thing we are missing the definition of wkCantidad. So there is clearly more stuff going on. You really need to show the COMPLETE code in order to get useful help.

Code:

The code used from Developer was:
FOR EACH Historico-Pre-Pauta WHERE Fecha-Pauta < wkFecha .
  ASSIGN wkCantidad[4] = wkCantidad[4] + 1.
  DELETE FROM Historico-Pre-Pauta.
END.

Other things that I see:

FOR EACH statements should end with a ":" not a period. Yes, "." compiles. But that is an abomination, not a feature.
No locking is being specified. Which means that the default lock type of SHARE-LOCK is in use and that SHARE-LOCK is being upgraded to an EXCLUSIVE-LOCK when the DELETE executes. This costs performance and the impact can be severe if this is a client/server connection type (if you use -S in your startup parameters).
DELETE FROM is SQL syntax. The 4gl DELETE statement does not use FROM. It is possible that the ancient SQL-89 compatibility syntax is letting you use FROM without a compile error.

You should compile your actual code with the listing option and see where your transaction scope is. I am 99% certain that it is NOT the line with FOR EACH on it. If I am correct then that is why your BI file is growing. But, as Rob mentioned, it could be that some other user has a long transaction. Maybe. Possibly. But pretty unlikely IMHO.

Rob Fitzpatrick · Nov 27, 2024

"DELETE FROM" looked odd to me too. But it does compile in 10.1B, and it behaves like "DELETE".

tamhas · Nov 28, 2024

All of which only emphasizes that the performance issues which motivated this delete process almost certainly can be addressed by proper updating and tuning so that actual deletion will be unnecessary unless there is a business case for it ... which has not been stated.

ricardos · Nov 28, 2024

Hi, thanks again for all coments.

We are reviewing internally several things about the cause.
We compiled the code and found into list there is no transaction indeed, because syntax ". and :", the use of "delete from", understanding the reason of delete all the rows, how long the bi increase and long time took.
Of course the actual structure of DB is defficient so have to make a plan to restructure them.
Agree the version is very old too, i think the progress is so robust that keep live under these circumstances. We will forecast this need for next year, although other technologies too. We are a TV station in Venezuela "Televen".
Reviewing Protop utility found White Star Software and understood that you are a deep professionals in Progress.

Apreciate, Best Regards,
Ricardo Sardinha

better way to delete 5millions rows

ricardos

New Member

tamhas

ProgressTalk.com Sponsor

TomBascom

Curmudgeon

TomBascom

Curmudgeon

Rob Fitzpatrick

ProgressTalk.com Sponsor

ricardos

New Member

tamhas

ProgressTalk.com Sponsor

Rob Fitzpatrick

ProgressTalk.com Sponsor

TomBascom

Curmudgeon

Rob Fitzpatrick

ProgressTalk.com Sponsor

tamhas

ProgressTalk.com Sponsor

ricardos

New Member