Answered Before Image Cluster Size Vs. Checkpoint Interval

RealHeavyDude

Well-Known Member
OE 11.3.1 64 Bit on Solaris SPARC 10.

I've recently checked the checkpoint interval during the night on our production database where the batch processing in our applications creates ca. 24 GB worth of After Image transaction notes in roughly 6 hours. During that time the checkpoint interval is between 4 - 10 seconds - which is, as far as I know, way too low. Currently the before image cluster size is 65536 which seems already high to me. As the cluster size directly correlates to the checkpoint interval, I think I should increase it, for example double it.

During the rest of the day where are only 6 GB worth of After Image transaction notes generated the checkpoint interval is already somtimes hours.

Does anybody have experience with cluster sizes that large and possible negative impacts?

Thanks in Advance, RealHeavyDude.
 
Last edited:

Rob Fitzpatrick

ProgressTalk.com Sponsor
4 - 10 seconds is quite low. Ideally you'd like checkpoints to be several minutes apart under load. Your application might be spending quite a lot of time with forward processing stalled.

One impact, which may just be a small annoyance if your disks are sufficiently fast, is the time taken for formatting clusters. Larger clusters means more time to extend the BI file (whether logically or physically) and to format 4 clusters after truncating the BI then opening the database.

Having fixed-length extents can reduce this cost. I like to create two extents for the BI: one that is four times the largest cluster size I might use (e.g. for a D&L or mass data load) and a variable. So if I might use 16 MB cluster size day to day and 64 MB for a mass load or purge, I would create a 256 MB fixed extent and a variable. This speeds up the formatting of clusters after truncating the BI.

I started doing this in v10 after dealing with clients who had slow disks, where it could take several minutes to format the clusters after a truncate. v11.3 changed the way BI I/O is done so that may no longer be a necessity; I haven't tested this recently. It buffers BI writes initially during allocation/extend and then reverts back to unbuffered I/O (v10 behaviour) to writes BI notes.

With larger clusters there is potentially more work to do so you may need to add one or more APWs to keep pace. Also, your BI disk storage needs to be able to perform well enough to not be a bottleneck. If it can't keep up then adding APWs won't help you avoid flushing buffers at checkpoints.

So, set your BI cluster size to 128 MB (I assume your AI/BI block size is already 16 KB) and monitor your checkpoint and AI/BI stats (including buffers flushed, sync duration, checkpoint lengths, empty BI buffer waits, BI buffers in use, etc.).

Do you use -directio? What are your -aibufs/-bibufs? Are you using OE Replication?
 

RealHeavyDude

Well-Known Member
Thanks for the quick response and the valuable information.

Some more background information I forgot to mention:
  • We are running in Solaris Zones with 64 GB memory. In the production environment it is the only zone on the zone host. But that might be subject to change as soon as capacity mangement thinks the zone host could handle another zone from a different application.
  • We must use ZFS file systems and all these file systems ( we have separate ones for the database, BI, AI and backup ) reside on some EMC SAN which, as far as I know, internally uses some RAID 5 variant. What I know for sure is that the write performance of them is ridiculously bad - but that is what we have to live with.
  • We don't use OE Replication
  • We don't use direct I/O
  • BI and AI blocksize is 16K
  • Both -aibufs and -bibufs is 500 and we have ca. 1'200'000 BI buffer waits and ca. 100'000 AI buffer waits in 4 days.
  • The BI has two extents, 1 fixed at 8 GB and one variable and I have set the BI threshold to 8 GB because the BI processing after a crash might take up to 40 minutes with 8GB worth of BI transaction notes to process. My database monitor also monitors the BI utilization which would automatically bump the BI threshold at 80 % utilization and disconnect processes responsible for long running transaction ( more than 30 minutes ).
  • We run 2 APWs
I have no access to the production environent. All information I get comes from a database monitor that I have developed that stores the information ( for example checkpoint interval ) in the database and allows the operater to generate reports which he is allowed to show me as long as they don't include any CID ( client identifying data ).

To put things into perspective, my employer operates 30'000+ Solaris zones, almost all of them but us are x86, not SPARC, running Oracle and Java application servers. So we are just one of many.

Nevertheless I am trying to get a better picture of what is happening on our production system and what I can introduce to increase performance and the low checkpoint interval obviously caught my attention.

I think I will set the BI cluster size to 128 MB and probably add 2 APWs with the next change window where it is allowed to stop the databse.
I am also thinking on increasing the number for BI and AI buffers but I don't thinkt it will have much effect.

Thanks, RealHeavyDude.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
1'200'000 BI buffer waits and ca. 100'000 AI buffer waits in 4 days.
Are those busy BI buffer waits or empty buffer waits? If the latter, increasing -bibufs may help. If the former, it's just an indication of a busy system and can't really be tuned away. 500 is a pretty good-sized BI buffer pool, so I am inclined to agree that increasing may not help much. I forgot to confirm, are you running a BIW?

With that many systems your employer should have a storage team that is very adept at ZFS performance tuning. If write performance is so bad, doesn't that also affect the Oracle-based applications? Are you able to compare notes with people on those teams?
 

RealHeavyDude

Well-Known Member
Thanks for your response.

We do run the BIW and AIW. By the way -B is 20 GB where the database is 500 GB large.

Oracle provides their own file system for databases which is used by the other applications which run on Oracle/Java. So, these setting is optimized for the Oracle/Java application server technology stack - not Progress. The only other file system supported by the storage team is ZFS. Since we can't run on Oracle's database file system, we are the sole exotic ( non-standard ) application operarting databases on ZFS file systems. Therefore I don't get much help from our storage team when it comes to ZFS vs. database specifics. But I think we are doing reasonably well with ZFS and what needs to be set performance wise.

Bad write performance of the file systems residing on SANs do affect other applications too. But usually these applications mostly aggregate services which are hosted on System Z ( IBM mainframe with DB2 databases and COBOL ) where most of the transactions are processed. Mostly they suffer from bad performance when they need to perform maintenance tasks on their Oracle databases. We are special because we do a lot of transaction processing on a Solaris system. I really believe, if the Progress DB would not be such a mean OLTP machine, we wouldn't be able to operate anything with reasonable response times for our users or not constantly breach our SLAs for data delivery to other systems. Actually our users are really happy with the response times and our user satisfaction index regarding performance tops all other Solaris systems ...

Nevertheless, I see room for improvement for the day-end batch processing that starts at midnight and runs through 6:00 in the morning as we sometimes do breach our SLAs with some specific jobs.

Therefore increasing the checkpoint interval to give the database a better chance processing the transaction load seems important to me.

Need to check whether these are busy or empty buffer waits ...

Thanks, RealHeavyDude.
 

cj_brandt

Active Member
We run a couple databases with clustersize of 256 and a few more at 128. The only downside I am aware of is the startup time when the db formats the 4 bi clusters.

I would be curious to know the total time of the checkpoint and then how much of that was the sync time. promon R&D 3 4 (I think) screen to look at that. I am not sitting in front of a promon screen so I can't double check. promon R&D 2 - 5 shows the BI stats where you will see empty vs busy buffer waits.

If you are creating a large amount of disk IO and you start to see signs of the db being impacted by IO, you might consider using fixed AI extents. Most of our databases use variable length, but on our large systems where we are pushing the disk limits, having all db writes go to fixed extents makes a noticeable difference in batch processing.
 

RealHeavyDude

Well-Known Member
Thanks for your response.

All of them - database, AI and BI - are fixed extents and resided on their own file system. This, like so many other rules, are imposed on us by capacity management.


Thanks, RealHeavyDude.
 
Top