Question 32b Progress Above the 4G line

TheMadDBA · Oct 21, 2014

Sounds like your vendor has an incredibly poor written application.... my condolences.

On the 32 bit/64 bit side... Progress has "fixed" this by making 64 bit servers and clients available quite some time ago. They expect vendors not to buy the wrong product for their clients, but sometimes they do. On the plus side all your would need is the new Progress server binaries. No DB conversion required. 32 bit clients connecting through TCPIP work just fine with 64 bit servers. No reason for your vendor to know/care about the server being 64 bit.

Spin: Every system and app is different, but I had -spin set at 2000 for a 4000+ user system.... but our code wasn't complete crap.

Checkpoints: When you say frequent checkpoints how often are we talking about? Unless it is every 30 seconds or less don't worry about that. What you need to worry about are buffers flushed at checkpoint.. that indicates a stall in DB activity while the checkpoint is happening. That would mean the cluster size needs to be increased and/or the number of APWs adjusted.

It appears you are at least running the Enterprise version but I didn't see any APWs started in the log file. Start 2 of them and the PROWDOG as well. APWs are background agents that write DB and BI activity to disk behind the scenes much like DB Writers do in Oracle. PROWDOG is basically a sanity checker that should be started by default, but for some reason Progress just doesn't.

SQL: Are users actually connecting through SQL or just through the 4GL/ABL applications? If they are using SQL make sure the statistics are current.

What does a promon -> R&D -> 2 -> 1 look like during peak times? 5 minute samples are good to start with.

Rob Fitzpatrick · Oct 22, 2014

Spin applies to all latches. A value that works well for one may not work well for others. It's not the best place to start for basic performance tuning.

With latch timeouts, look at promon | R&D | debghb | 6 (hidden menu) | 11 (latch counts) to determine which latches are the most active. If you have transaction throughput issues you should consider increasing the AI and BI block sizes to 16 KB, the maximum. Your AI and BI buffers are also set quite low at 50 each. Promon | R&D | 2 | 5 will tell you if you have empty BI buffer waits. If so you can address that by increasing -bibufs. Try 200 for a start. Set -aibufs the same as -bibufs.

I assume you are using after imaging. You should have it enabled and you should be running an AIW (after image writer). This is not a performance optimization. It is protection against data loss, media failure, and other disaster scenarios.

What kind of hardware are you running on; is it decent? What can you tell us about the storage tier?

kdefilip · Oct 22, 2014

TheMadDBA said:
Sounds like your vendor has an incredibly poor written application.... my condolences.

On the 32 bit/64 bit side... Progress has "fixed" this by making 64 bit servers and clients available quite some time ago. They expect vendors not to buy the wrong product for their clients, but sometimes they do. On the plus side all your would need is the new Progress server binaries. No DB conversion required. 32 bit clients connecting through TCPIP work just fine with 64 bit servers. No reason for your vendor to know/care about the server being 64 bit.

Spin: Every system and app is different, but I had -spin set at 2000 for a 4000+ user system.... but our code wasn't complete crap.

Checkpoints: When you say frequent checkpoints how often are we talking about? Unless it is every 30 seconds or less don't worry about that. What you need to worry about are buffers flushed at checkpoint.. that indicates a stall in DB activity while the checkpoint is happening. That would mean the cluster size needs to be increased and/or the number of APWs adjusted.

It appears you are at least running the Enterprise version but I didn't see any APWs started in the log file. Start 2 of them and the PROWDOG as well. APWs are background agents that write DB and BI activity to disk behind the scenes much like DB Writers do in Oracle. PROWDOG is basically a sanity checker that should be started by default, but for some reason Progress just doesn't.

SQL: Are users actually connecting through SQL or just through the 4GL/ABL applications? If they are using SQL make sure the statistics are current.

What does a promon -> R&D -> 2 -> 1 look like during peak times? 5 minute samples are good to start with.

Vendor claims that their client calls 32b dll's and until that is corrected they can not move to 64b - I'm sensing BS in that explaination.

Considering that drastically lowering -spin from 800000 to 100000 had on effect, I agree that is not the issue

During peak times I will see checkpoints sometimes at a rate of 4 or 5 during a 30 second period, occasionally with buffer flushes

Yes, I believe there are two APW's running and Watchdog is running as well. I have also noted that shared locks are not releasing at end of transactions leaving watchdog to clean up.

Not sure on the client connects. I have not fully attacked the client side yet, again, I have only been on this for about 10 days.

On the R&D output, I have a document I can share with you if there is a way to do it privately here

EDIT: Sorry, I missed your SQL questions. I guess I don't really know how to tell that. I can tell you that most of our connections are "SELF", a handful of TCP and a larger handful of "BATCH" - how do I check for stale stats?

kdefilip · Oct 22, 2014

Rob Fitzpatrick said:
Spin applies to all latches. A value that works well for one may not work well for others. It's not the best place to start for basic performance tuning.

With latch timeouts, look at promon | R&D | debghb | 6 (hidden menu) | 11 (latch counts) to determine which latches are the most active. If you have transaction throughput issues you should consider increasing the AI and BI block sizes to 16 KB, the maximum. Your AI and BI buffers are also set quite low at 50 each. Promon | R&D | 2 | 5 will tell you if you have empty BI buffer waits. If so you can address that by increasing -bibufs. Try 200 for a start. Set -aibufs the same as -bibufs.

I assume you are using after imaging. You should have it enabled and you should be running an AIW (after image writer). This is not a performance optimization. It is protection against data loss, media failure, and other disaster scenarios.

What kind of hardware are you running on; is it decent? What can you tell us about the storage tier?

I have tried debghb before in the R&D menu, it doesn't work for me. Just returns me to the menu - I wish it would because I don't know exactly where the latch waits/misses are coming from. Is it possible latch misses from the BI file are counted in this counter?

For a 400 hour period it looks like below. I have other samples in a doc I have been collecting:
Busy buffer waits 315849 13
Empty buffer waits 2897 0

While we are on the subject of BI, when we refer to BI cluster size, are we talking about block size?

Yes, AI is set, two writers, I believe

Ah, the hardware - make sure your sitting down with your seat belt attached:
C: Direct attached Raid5 contains BI and AI file, allocation unit and BI Block size DO NOT match
D: RAID10 SAN Disks, contains ALL datafiles and, wait for it..... a very large page file 50-75G
16 cores
96G ram - average usage 12-16G

I have an assessment doc I have been working on I could share with you privately, if there is a way to do that here.
Thanks.

jdpjamesp · Oct 22, 2014

kdefilip said:
While we are on the subject of BI, when we refer to BI cluster size, are we talking about block size?

BI Cluster Size and BI Block Size are different things.

kdefilip · Oct 22, 2014

Cringer said:
BI Cluster Size and BI Block Size are different things.

Okay. I thought so. Cluster size determines the frequency of checkpoints, I believe. So when I am seeing heavy checkpointing, solution is increase cluster size; correct?

I assume block size on BI works the same as it does on buffer cache, etc.?

kdefilip · Oct 22, 2014

TomBascom said:
"The db" is neither 32 nor 64 bits. The executables are one or the other. The db server can use 64 bit executables while the clients (aka prowin32.exe) can remain 32 bits if they connect via tcp/ip (as most windows clients do).

The "showcfg" command will show the "port". 33 is 64 bit windows executables.

Tom, sorry, again, I'm new to progress. Where do I issue the showcfg command?
Thanks.

kdefilip · Oct 22, 2014

Cringer said:
NB - the screen it pops up can also be accessed from the start menu - it's called Config.

I guess I don't have that.

kdefilip · Oct 22, 2014

TheMadDBA said:
Sounds like your vendor has an incredibly poor written application.... my condolences.

On the 32 bit/64 bit side... Progress has "fixed" this by making 64 bit servers and clients available quite some time ago. They expect vendors not to buy the wrong product for their clients, but sometimes they do. On the plus side all your would need is the new Progress server binaries. No DB conversion required. 32 bit clients connecting through TCPIP work just fine with 64 bit servers. No reason for your vendor to know/care about the server being 64 bit.

Spin: Every system and app is different, but I had -spin set at 2000 for a 4000+ user system.... but our code wasn't complete crap.

Checkpoints: When you say frequent checkpoints how often are we talking about? Unless it is every 30 seconds or less don't worry about that. What you need to worry about are buffers flushed at checkpoint.. that indicates a stall in DB activity while the checkpoint is happening. That would mean the cluster size needs to be increased and/or the number of APWs adjusted.

It appears you are at least running the Enterprise version but I didn't see any APWs started in the log file. Start 2 of them and the PROWDOG as well. APWs are background agents that write DB and BI activity to disk behind the scenes much like DB Writers do in Oracle. PROWDOG is basically a sanity checker that should be started by default, but for some reason Progress just doesn't.

SQL: Are users actually connecting through SQL or just through the 4GL/ABL applications? If they are using SQL make sure the statistics are current.

What does a promon -> R&D -> 2 -> 1 look like during peak times? 5 minute samples are good to start with.

See attached

Rob Fitzpatrick · Oct 22, 2014

kdefilip said:
Vendor claims that their client calls 32b dll's and until that is corrected they can not move to 64b - I'm sensing BS in that explaination.

Me too.

kdefilip said:
During peak times I will see checkpoints sometimes at a rate of 4 or 5 during a 30 second period, occasionally with buffer flushes

One every six seconds is a lot. Increasing the BI cluster size will make checkpoints less frequent but it will also mean there is more work to do during a checkpoint. As TheMadDBA said, you may need to add APWs to compensate.

kdefilip said:
how do I check for stale stats?

There is no way to check the quality of the stats. But you can run some ABL code to generate a SQL script that will refresh the stats.

Code:

output to "updatestatistics.sql".
for each _file no-lock where _tbl-type = "t":
  put unformatted "update table statistics and index statistics and all column statistics for PUB." _file-name ";" skip.  
  put unformatted "commit work;".
end.
output close.

Then run it:

Code:

sqlexp <dbname> -S <SQL broker port no.> -infile updatestatistics.sql -user <username> -password <password>

kdefilip said:
I have tried debghb before in the R&D menu, it doesn't work for me.

Unfortunately, it doesn't give you any feedback at all. It is just a toggle that enables a hidden menu item number 6. After you type debghb you can type 6 and then get to the hidden menu that shows latch counts (item 11).

kdefilip said:
I have also noted that shared locks are not releasing at end of transactions leaving watchdog to clean up.

The watchdog doesn't release share locks for running clients. Clients release locks based on how the code is written. The watchdog cleans up after dead self-service clients.

kdefilip said:
Yes, I believe there are two APW's running

Based on your screenshot there are 10 APWs running.

TheMadDBA · Oct 22, 2014

That isn't a horrible amount of activity for that many users, small sample obviously and I don't know the app well enough. You have a mix of client/server connections and appserver connections. Not sure why your vendor is doing both kinds of connections.

10 APWs seems excessive but probably isn't causing any meaningful performance issues compared to the rest of the processes.

SQL clients should show up as REMC/SQLC in promon R&D,1,4,1. If there aren't any don't worry about the statistics, the 4G/ABL uses rules and isn't cost based.

One of the benefits of dropping spin is potentially getting back CPU cycles for more productive work. Use the hidden debghb menu like Rob suggested to find out what kind of latches you are waiting on. Some will benefit greatly from spin and some are just going to waste CPU cycles. What does you CPU usage look like during peak times?

Also you mentioned crashes... what kind of crashes are we talking about? Ones where the database or appservers actually go down? If so... what kinds of errors are in the logs?

kdefilip · Oct 22, 2014

TheMadDBA said:
That isn't a horrible amount of activity for that many users, small sample obviously and I don't know the app well enough. You have a mix of client/server connections and appserver connections. Not sure why your vendor is doing both kinds of connections.

10 APWs seems excessive but probably isn't causing any meaningful performance issues compared to the rest of the processes.

SQL clients should show up as REMC/SQLC in promon R&D,1,4,1. If there aren't any don't worry about the statistics, the 4G/ABL uses rules and isn't cost based.

One of the benefits of dropping spin is potentially getting back CPU cycles for more productive work. Use the hidden debghb menu like Rob suggested to find out what kind of latches you are waiting on. Some will benefit greatly from spin and some are just going to waste CPU cycles. What does you CPU usage look like during peak times?

Also you mentioned crashes... what kind of crashes are we talking about? Ones where the database or appservers actually go down? If so... what kinds of errors are in the logs?

All the conn types are a handful of"SELF/APSV", and a majority of "REMC/ABL"

CPU: 16 cores, generally we are about about 25-30%

We have the process _ProMonsrv crashing frequently, which is one issue
Users also often complain of frozen screens/pauses at the client

kdefilip · Oct 22, 2014

Rob Fitzpatrick said:
Spin applies to all latches. A value that works well for one may not work well for others. It's not the best place to start for basic performance tuning.

With latch timeouts, look at promon | R&D | debghb | 6 (hidden menu) | 11 (latch counts) to determine which latches are the most active. If you have transaction throughput issues you should consider increasing the AI and BI block sizes to 16 KB, the maximum. Your AI and BI buffers are also set quite low at 50 each. Promon | R&D | 2 | 5 will tell you if you have empty BI buffer waits. If so you can address that by increasing -bibufs. Try 200 for a start. Set -aibufs the same as -bibufs.

I assume you are using after imaging. You should have it enabled and you should be running an AIW (after image writer). This is not a performance optimization. It is protection against data loss, media failure, and other disaster scenarios.

What kind of hardware are you running on; is it decent? What can you tell us about the storage tier?

Not sure which latches are most important. By far OM are the most numerous, followed by LKT. Right now, MTX seem to be the only ones that are napping, 4 - 8 per any 10 second period.

also note we have zero BFP yet we have very very many BUF (listed 4 times??)

LRU: very many. LRU is listed twice, one is zero and the other for a 10 sec period is13,500

TheMadDBA · Oct 22, 2014

See this for setting -omsize to fix the OM latch waits.... http://knowledgebase.progress.com/articles/Article/20398

With spin set so high I wouldn't expect to see very many naps

When you say _ProMonsrv crashes... is that the right executable name? Not really ringing any bells here.

CPU looks decent... how are the disk IO times and rates?

TheMadDBA · Oct 22, 2014

Also based on how badly some of the parameters were set by the vendor/others before you... it might be helpful to get all of your DB startup parameters and appserver startup parameters. Sometimes (just like in Oracle) people set flags/options without knowing what they do and cause negative impacts. We may not know what is "right" for your application without seeing it in action but we will all know what is wrong/suspect just by looking at it.

A few quick things to check....(client and appserver only for these)

Make sure -T (temporary files) isn't pointing to a network share.

-Bt (number of blocks) along with -tmpbsize (block size) control how much memory is used to buffer temp-table activity (heavily used data structures in Progress). If you are writing to disk too often for temporary files, tuning these appropriately can make a huge difference (just make sure not to consume too much memory).

kdefilip · Oct 22, 2014

TheMadDBA said:
See this for setting -omsize to fix the OM latch waits.... http://knowledgebase.progress.com/articles/Article/20398

With spin set so high I wouldn't expect to see very many naps

When you say _ProMonsrv crashes... is that the right executable name? Not really ringing any bells here.

CPU looks decent... how are the disk IO times and rates?

Sorry, _proappsv I've been digging so much on this issue my brain is turning to puddy.

kdefilip · Oct 22, 2014

TheMadDBA said:
Also based on how badly some of the parameters were set by the vendor/others before you... it might be helpful to get all of your DB startup parameters and appserver startup parameters. Sometimes (just like in Oracle) people set flags/options without knowing what they do and cause negative impacts. We may not know what is "right" for your application without seeing it in action but we will all know what is wrong/suspect just by looking at it.

A few quick things to check....(client and appserver only for these)

Make sure -T (temporary files) isn't pointing to a network share.

-Bt (number of blocks) along with -tmpbsize (block size) control how much memory is used to buffer temp-table activity (heavily used data structures in Progress). If you are writing to disk too often for temporary files, tuning these appropriately can make a huge difference (just make sure not to consume too much memory).

Do you mean settings at the client or are there other start up parameters for the appserver?

kdefilip · Oct 22, 2014

TheMadDBA said:
Also based on how badly some of the parameters were set by the vendor/others before you... it might be helpful to get all of your DB startup parameters and appserver startup parameters. Sometimes (just like in Oracle) people set flags/options without knowing what they do and cause negative impacts. We may not know what is "right" for your application without seeing it in action but we will all know what is wrong/suspect just by looking at it.

A few quick things to check....(client and appserver only for these)

Make sure -T (temporary files) isn't pointing to a network share.

-Bt (number of blocks) along with -tmpbsize (block size) control how much memory is used to buffer temp-table activity (heavily used data structures in Progress). If you are writing to disk too often for temporary files, tuning these appropriately can make a huge difference (just make sure not to consume too much memory).

also, how do I issue statements against this database such as the one suggested in the article link you sent?

TheMadDBA · Oct 22, 2014

For the _proapsv crashes.... check the log files and find the error numbers. That will help us and you searching the KB to determine the root cause.

There are settings for the clients (on the user desktop shortcut) and settings for the appservers. For the three I mentioned they apply to both types of connections. Getting the database startup options will also be helpful to make sure there aren't any other crazy settings. You can look at <Progress Install Directory>/properties/ubroker.properties to find the appserver startup options, this will also give you the locations of the log files.

If you only have the runtime you will not be able to run those kind of statements. You would be looking through the start menu for Client or Character Client. While you are there you can choose the Config option to get the installed products.

If you don't have a query/compile client installed I would suggest starting with -omsize 2048 and see how that goes. Keep increasing from there based on how many OM waits you see after the change.
Then ask your boss to buy a single user development license so you can poke around in the database a little more, without it you will only be able to run compile code and your options will be limited.

TheMadDBA · Oct 22, 2014

You should contact your vendor and see about upgrading your OpenEdge/Progress versions... Just noticed you are running 11.1 which was retired in Feb of this year. 11.4 is the current version, 11.3 is one release back and has several service packs available.

Lots of bugs resolved between 11.1 and 11.4.

Question 32b Progress Above the 4G line

Active Member

ProgressTalk.com Sponsor

Member

Member

ProgressTalk.com Moderator

Member

Member

Member

Member

Attachments

ProgressTalk.com Sponsor

Active Member

Member

Member

Active Member

Active Member

Member

Member

Member

Active Member

Active Member