Question Checkpoint Duration Sync Time and Direct I/0 (-directio)

BigSlick · Jun 30, 2020

Hi,

The aim of this post is to determine whether -directio is the answer. Are there any downsides to using it?

My research indicates an almost 90% improvement. All from Sync Time, which correlates to Progress KB - What value should -B be set to?, although this suggests it will only improve slightly

Here's the background.

We recently moved provider and whilst moving we also tried to upgrade from Windows 2008 to 2016. The way in which Windows manages its memory between the two versions is intensely different and this caused huge issues and we saw a strange pattern in which 2016 performed much better from a cold start, but as the buffers filled, both within the database and windows, the longer a checkpoint took to complete compared to 2008. After months of testing and getting the Progress guys in, we used RAMMap and emptied the buffer pools within Windows and this seemed to improve checkpoints again vastly. But this seemed to only be a temporary fix, as this would need to be scheduled to run every so often. We had Microsoft involved and they saw an issue from their end and created an alpha patch for Windows 2019 for us to test, as they didn’t see the benefit of doing this for 2016. Anyway, it took a while for the patch to become available and we decided to stick with 2008, upon re-testing 2016 and 2019, I couldn’t replicate the original issue. Which I thought was very strange but then again any delay to the project was political and would have to be funded by the party where the blame lied, so maybe not so strange after all.

I digress slightly; but if you’d like to know more I’m sure I can dig out some stuff.

Whilst doing all the above testing one of the options I explored was –directio and this seemed to solve all of our issues, but the project didn’t want to ‘change the fundamental way in which the system worked’. Which I can understand. It worked before with the settings. It also seems to work now.

Now we have moved to the new provider, I started testing the settings within our main databases and naturally I started with –directio, considering the benefits. I’ve been through all parameters that I can think of and ive attached the difference between direction being off, then on, then on with pinshm (slight improvement).

These tests were performed on Windows 2016, although Prod is still on 2008.

OpenEdge 11.7.3.005.
DB Size 5056GB
Blocksize 4096

-B 22,500,000
-B2 No
WDOG Yes
APW's 3
-bibufs 150
-bithold 3
-Mf 3
-G 60
-groupdelay 0
BI Cluster Size 262128
BI Block Size 16
bigrow 10

Thanks

TomBascom · Jun 30, 2020

Aside from changing some numbers in a spreadsheet did the changes make any difference to anything that was user visible?

Rob Fitzpatrick · Jun 30, 2020

BigSlick said:
My research indicates an almost 90% improvement.

Is this theory or benchmark measurement?

BigSlick said:
All from Sync Time

I don't understand this. I would expect sync time to always be zero when using -directio, because with this configuration there is no end-of-checkpoint buffer synchronization necessary. However that doesn't imply that you are getting better write throughput overall. You are still doing file system buffer synchronization with -directio, just in a different way. It is being done by the APWs when you do block writes, so it makes their work incrementally more expensive.

TomBascom · Jun 30, 2020

That kbase that you are referencing also thinks that a cache hit ratio of 90% is "good".

Rob Fitzpatrick · Jun 30, 2020

BigSlick said:
I’ve been through all parameters that I can think of and ive attached the difference between direction being off, then on, then on with pinshm

I was under the impression that -pinshm has no effect on Windows and AIX.

TomBascom · Jun 30, 2020

Regarding -directio... I am dubious that it makes a positive impact on anything that a user would notice. Historically it has been a waste of time and it has never been a significant difference maker. There are occasional stories that it has helped someone out in very specific circumstances but those never seem to translate into anything that is generally useful. Initially it was implemented only on some very specific architectures and for a while there were no architectures were it actually did anything. (The same can be said for -pinshm)

So - worth trying? Sure.

Adopt it because it changed some numbers in monitoring output? Not unless those numbers are directly related to something that matters to users.

You mention that moving providers has caused "huge issues". But no issues are specified and I'm not seeing any evidence that -directio is at all related to the unspecified issues or their solution. I see that you ran a couple of tests but it doesn't look like enough tests to say "before and after", "new gear and old gear". It looks more like "you can prove anything with one data point".

Rob Fitzpatrick · Jun 30, 2020

BigSlick said:
-B 22,500,000
-B2 No
WDOG Yes
APW's 3
-bibufs 150
-bithold 3
-Mf 3
-G 60
-groupdelay 0
BI Cluster Size 262128
BI Block Size 16
bigrow 10

From the 11.7 docs:
"Pin Shared Memory (-pinshm) is only supported on UNIX platforms, excluding IBM AIX."

You mentioned some background processes (WDOG and APWs) but not BIW and AIW, which are relevant to write throughput. Are they running?

-G defaults to 0 in recent releases, including 11.7.x. It used to default to 60. Did PSC tell you to change it to 60?

You have also set a non-default value for -groupdelay. Was this a PSC recommendation?

Rob Fitzpatrick · Jun 30, 2020

BigSlick said:
Are there any downsides to using it?

Yes.

You can get more details from Rich Banville's Database I/O talk at PUG Challenge Americas 2013:
Slides: https://pugchallenge.org/downloads2013/321_Database_IO.pptx
Audio: https://pugchallenge.org/downloads2013/audio/321_Database_IO.mp3

BigSlick · Jul 1, 2020

TomBascom said:
Aside from changing some numbers in a spreadsheet did the changes make any difference to anything that was user visible?

Hi Tom,

When doing the initial testing the users were experiencing screen pauses in line with write activity as expected. This was averaging 11.74 seconds without directio. So was quite noticeable.

This test replicates one of our overnight jobs and only if it runs late will it normally intefere with users, but the idea that this simple job causes an issue highlights how fragile the system can be.

Thanks

BigSlick · Jul 1, 2020

Rob Fitzpatrick said:
Is this theory or benchmark measurement?

I don't understand this. I would expect sync time to always be zero when using -directio, because with this configuration there is no end-of-checkpoint buffer synchronization necessary. However that doesn't imply that you are getting better write throughput overall. You are still doing file system buffer synchronization with -directio, just in a different way. It is being done by the APWs when you do block writes, so it makes their work incrementally more expensive.

Hi Rob,

Well, based on my tests its benchmark.

In regards to Sync Time, yes with directio sync time is 0. But i'd have thought we would see a knock-on effect somewhere else within the process. Unless of course syncing is still happening and freezing occurs, but isnt recorded within OpenEdge?

So theres a trade off with APW's? - We only use 3, could this be helped by adding more or are we just shifting a problem around?

Thanks

BigSlick · Jul 1, 2020

TomBascom said:
That kbase that you are referencing also thinks that a cache hit ratio of 90% is "good".

It does and i think its trying to emphasise large systems and the trade off with having 10%+ in memory.

This trade off of scanning a massive -B will increase checkpoint durations, so i believe they are suggesting 90% in order to allow a lower -B therefore reducing scanning times.

BigSlick · Jul 1, 2020

Rob Fitzpatrick said:
I was under the impression that -pinshm has no effect on Windows and AIX.

Using RAMMap you can clearly see that it does, in fact do as expected.

BigSlick · Jul 1, 2020

TomBascom said:
Regarding -directio... I am dubious that it makes a positive impact on anything that a user would notice. Historically it has been a waste of time and it has never been a significant difference maker. There are occasional stories that it has helped someone out in very specific circumstances but those never seem to translate into anything that is generally useful. Initially it was implemented only on some very specific architectures and for a while there were no architectures were it actually did anything. (The same can be said for -pinshm)

So - worth trying? Sure.

Adopt it because it changed some numbers in monitoring output? Not unless those numbers are directly related to something that matters to users.

You mention that moving providers has caused "huge issues". But no issues are specified and I'm not seeing any evidence that -directio is at all related to the unspecified issues or their solution. I see that you ran a couple of tests but it doesn't look like enough tests to say "before and after", "new gear and old gear". It looks more like "you can prove anything with one data point".

We are talking Windows here, Tom :-D

The numbers correlate to the system pausing though in which write activity cannot take place, so surely the lower the numbers the less pausing and the more writing and therefore better performance the system is experiencing. Unless, of course, the sync time that was being reported within Promon is still happening elsewhere but not taking into account by OpenEdge.

The provider move happened a while back and i have a plethora of data regarding that move. But ultimatley it was down to the change in disk and setup from the provider, their use of Qos and many other parts - in fact we were forced to get Progress to come in and Libor ended up telling them what we all knew but the provider wouldnt hear. Like i said, i cant replicate the original issues any longer - its almost like they changed something and didnt tell us to avoid a fine. Thats pure speculation though.

BigSlick · Jul 1, 2020

Rob Fitzpatrick said:
From the 11.7 docs:
"Pin Shared Memory (-pinshm) is only supported on UNIX platforms, excluding IBM AIX."

You mentioned some background processes (WDOG and APWs) but not BIW and AIW, which are relevant to write throughput. Are they running?

-G defaults to 0 in recent releases, including 11.7.x. It used to default to 60. Did PSC tell you to change it to 60?

You have also set a non-default value for -groupdelay. Was this a PSC recommendation?

Pinshm does wotk in Windows, i had to check using RAMMap and dont understand why the arcticle says that.

Both BIW and AIW are running, apologies i should have mentioned this; theyre default to me

I believe Libor advised the -G and -groupdelay settings. Should i try with defaults?

BigSlick · Jul 1, 2020

Rob Fitzpatrick said:
Yes.

You can get more details from Rich Banville's Database I/O talk at PUG Challenge Americas 2013:
Slides: https://pugchallenge.org/downloads2013/321_Database_IO.pptx
Audio: https://pugchallenge.org/downloads2013/audio/321_Database_IO.mp3

Now thats what i needed. Cheers Rob. Looks great.

TomBascom · Jul 1, 2020

> We are talking Windows here, Tom :-D

That's easy to fix. Let me know if you need any help with that.

Regarding the kbase and 90%... no, 90% is not "good". Nobody who has any performance tuning experience at all would say that. And it has nothing to do with trade-offs to reduce synctime. If it did they would have said so much more clearly instead of an off-hand remark buried in some bullet points.

A hit ratio of 90% is worse than awful. Your typical record read requires a minimum of 2 block accesses. One for the index and one for the data (on average it takes slightly more than one access for the index, but we will round down for convenience). A hit ratio of only 90% means that 1 in 5 record reads has to come from disk. A hit ratio of 99% means that only 1 in 50 is coming from disk. 99.9% means 1 in 500. Disk reads, even SSD disk reads, are *thousands* of times slower than memory access. Every time you have to read a block from disk you could have been reading thousands of records from memory. That statement in that kbase (and several others) is normalizing something that should be a huge red flag. Not something that should be considered "good", "good enough" or in any way acceptable.

Re: -directio... ok, if it has a user perceptible impact then sure, use it. It has never been my experience that it has such an impact but if you've tested it in your environment and found it to be helpful that's a good thing to know. If the reason that it is helping is the duration of the synctime that you are showing then there should also be an application focused test and a suitable metric that you would run to show that correlation in your spreadsheet. That would be something interesting to see.

Another bit of missing information is what sort of storage systems are being compared and the specifics of how they are configured. There is a world of difference between the performance of a Netapp "filer" and an internal SSD. Understanding that part of your two configurations would go a longs ways towards helping to know when -directio might be useful.

If I am understanding what you say has been happening then -directio has just shifted the responsibility for the writes. It hasn't changed the amount of writing (the numbers of writes in your spreadsheet don't change much) it just changes when they get done and by whom. It would appear that what you have really done is spread the delay out. As long as the APWs are handling the writes users shouldn't notice them except to the extent that it puts more work onto the disks that might get noticed during non-transaction activity.

Re: -pinshm and your testing with RAMMap: Did you check RAMMap both with and without -pinshm? I have not looked into it in a long time but my understanding is that on operating systems that support it shared memory is always pinned. Just about the last thing you would ever want is for any shared memory to be paged out.

jdpjamesp · Jul 1, 2020

My experience of directio (and it's small I must admit) is that all it does is mask other issues that should/could be resolved to much greater effect. As Tom suggests, it's shifting responsibility.
Am I reading it correctly that your database is around 5TB? A decent size then! I'd be interested to know what % of that is actually active data and how much of it is essentially just there for old time sake.
3 APWs is verging on the upper limit of what I'd consider implementing without serious thought. But then I've not worked with such a large DB in the past.
Tom's question re the storage system is really important. I might have missed it, but do you have info on how fast a bigrow is on the disk? The so-called Furgal Test: Furgal Test?

Rob Fitzpatrick · Jul 1, 2020

BigSlick said:
I believe Libor advised the -G and -groupdelay settings. Should i try with defaults?

In general I wouldn't advocate changing those parameters away from their defaults without a very good reason. But if Libor suggested those values for your particular installation, that's a very good reason.

Rob Fitzpatrick · Jul 1, 2020

BigSlick said:
Unless of course syncing is still happening and freezing occurs, but isnt recorded within OpenEdge?

When you use -directio, there are different algorithms in use so the numbers in the promon checkpoint screen aren't directly comparable between the two scenarios. In either case, when a modified buffer in the buffer pool is written, it isn't really written to disk directly, it is written to the file system cache in RAM. What happens next is determined by the presence or absence of -directio.

Without -directio, that flushed buffer remains in the file system cache, and the OS may flush it, perhaps coalescing writes of various buffers for greater efficiency. But the storage engine has no more to do immediately with that buffer. However since the block in that modified buffer isn't yet permanent on disk, a periodic synchronization is required and that happens in end-of-checkpoint processing when the data in the file system cache is written to disk via an fdatasync (Unix) or FlushFileBuffers (Windows) OS call.

With -directio, the modified buffer in the file system cache is flushed, at the time of that write, to the disk subsystem (and that might have its own cache layers). This makes such writes more expensive due to the extra overhead of the flush. But because the data is made permanent on disk at the time of the write, that periodic synchronization isn't required during end-of-checkpoint processing. That is why, on the checkpoint screen, Sync Time (the duration of that synchronization) is always zero and Duration, which includes the duration of several operations* including Sync Time, may be slightly less. But lower numbers with -directio than without don't necessarily mean better performance as those numbers measure different things.
*operations: flush bi buffers, scan buffer pool, flush buffers from previous checkpoint, put dirty buffers on checkpoint queue, flush ai buffers, sync file system cache

You could also compare the page writer stats in promon R&D 2, 4 between the two scenarios. That, or the I/O stats per process (R&D 3,4) might help you understand whether additional APWs are doing meaningful work. So in answer to your question, it isn't necessarily that the information you need isn't recorded. But it may be recorded in different places that you have to correlate together. And remember this is to some extent an apples-to-oranges comparison.

BigSlick said:
So theres a trade off with APW's? - We only use 3, could this be helped by adding more or are we just shifting a problem around?

The advice from Progress is that you may need to use more APWs with -directio than without. This is because the per-write synchronization means APWs have more work to do. I can't tell you whether adding more will help with overall performance. If the storage subsystem is a bottleneck, adding more APWs might just add CPU load without increasing write throughput.

The bottom line is this: the VST numbers are somewhat esoteric and ultimately what should matter is how the application performs. For example, if large write-heavy update.p consistently runs for some duration X without -directio, and it consistently runs with -directio in a duration that is less than X by a statistically-significant amount, then -directio is helping you.

JayVee · Jul 22, 2020

Hi

BigSlick said:
The way in which Windows manages its memory between the two versions is intensely different and this caused huge issues and we saw a strange pattern in which 2016 performed much better from a cold start, but as the buffers filled, both within the database and windows, the longer a checkpoint took to complete compared to 2008. After months of testing and getting the Progress guys in, we used RAMMap and emptied the buffer pools within Windows and this seemed to improve checkpoints again vastly. But this seemed to only be a temporary fix, as this would need to be scheduled to run every so often. We had Microsoft involved and they saw an issue from their end and created an alpha patch for Windows 2019 for us to test, as they didn’t see the benefit of doing this for 2016.

This story sounds very familiar as we're seeing the same behavior when running tests on 2012 vs 2016.

Checkpoint durations are very concerning and we're told to open a case with Microsoft.
Would you happen to have a Microsoft case number, so maybe we can refer to that one when we open a case ourselves?
I hope this could give us some more leverage for a fix on 2016.

Migrating to 2019 is not an option at this point due to OpenEdge 11.7 being tied in to .NET 4.6.

Question Checkpoint Duration Sync Time and Direct I/0 (-directio)

Member

Attachments

Curmudgeon

ProgressTalk.com Sponsor

Curmudgeon

ProgressTalk.com Sponsor

Curmudgeon

ProgressTalk.com Sponsor

ProgressTalk.com Sponsor

Member

Member

Member

Member

Member

Member

Member

Curmudgeon

ProgressTalk.com Moderator

ProgressTalk.com Sponsor

ProgressTalk.com Sponsor

New Member