Memory Allocation And Usage

jennid · Nov 14, 2016

Hello-Over the past few months we've been experiencing some performance issues, so we've been reviewing all Progress, network, OS etc....settings trying to determine where the problem might be. We're on Progress 10.2B07, AIX 64 bit. We noticed that the -B (blocksindatabasebuffers) param was set to 68000. Our db is 423 GB. Blocksize is 8192. We have 66 GB of memory on the AIX server. Our understanding of the -B parameter is that it should ideally be set to 10% of the database size. We set it to 3 million thinking this would allocate 22 GB (or so) to the Progress DB and leave 44 GB for OS, client processes, etc...

And now we're swapping. We have about 50GB of paging space and 32% is being used right now.

What we don't understand is when we do the svmon command (svmon -P -O unit=GB), it shows that there are 25 GB of memory (23.8 virtual) allocated to each of the servers in our 4GL server group.

Is this expected behavior? We were expecting 22 GB to be used for the entire Progress database buffer, not each server. (I understand that each isn't actually using that much memory for each server, as we would have crashed by now if that is the case. But we are using all 66 GB of memory plus a third of our paging space.)

Additionally, svmon shows several sqlsrv2 processes with 25 GB allocated to each of those as well. The amount of memory allocated to each sqlsrv2 process seems to directly correlate with the -B setting....if I double -B, the amount of memory for each sqlsrv2 process also doubles. It seems that each sqlsrv2 process is being allocated the -B amount.

I feel like there is something wrong or something we aren't understanding about -B. Help/advice is appreciated.

Thanks.

TomBascom · Nov 14, 2016

10.2B07 ins ancient, obsolete and unsupported. Service pack 8 was the very least release of 10.2B and it was issued 3 years ago (November 12, 2013). You should upgrade to something modern such as 11.6.

Our understanding of the -B parameter is that it should ideally be set to 10% of the database size...

This is wrong. That suggestion is not in any way an "ideal". It is nothing more than a starting point if you have no other idea what value to use. 68,000 was pretty ridiculously undersized though.

One could make an argument that the "ideal" size for -B is the size of your database.

In reality it is a trade off between best performance and cost of RAM.

Since you have a running database you could collect data about your IO rate, establish a target IO rate and then adjust -B to meet that target. The ProTop "Big B Guesstimator" will do this for you. You can download ProTop, for free, from ProTop

And now we're swapping. We have about 50GB of paging space and 32% is being used right now.

Use of paging space <> "swapping". There are a number of AIX tuning parameters that impact how paging is managed. The details vary by release of AIX. Which is unknown at the moment. Your LPAR configuration may also be less than perfectly ideal.

Your understanding of the svmon output is incorrect. The -B memory is being *shared* among those processes. Each server does have some private memory (generally 10MB to 20MB in my experience) but the GB values that you are seeing are the shared bits.

Your biggest HW configuration problem is likely that someone stuck you with SAN storage.

Given that you started with -B 68000 and that you mentioned "blocksindatabasebuffers" I can only guess that someone is using the cursed OpenEdge Explorer Tool or, worse, OE Management to administer things. So it is likely that very little meaningful tuning has taken place. Aside from looking at -B you should also probably review your storage area configuration. I hope everything isn't in the "schema area".

It might be helpful to share the symptoms of the poor performance you mention -- that may lead to useful suggestions for improvement. It is also helpful to crack open the db .lg file and extract the 75 to 100 lines of messages right after the most recent db startup (search for message "(333)"). That will tell us what the startup parameters actually are -- as opposed to the incomplete and not exactly believable list of values that OEM might be showing. Also -- run "proutil dbname -C describe" at the command line. That provides some useful nuggets too.

Rob Fitzpatrick · Nov 14, 2016

I think the confusion is about shared (virtual) memory vs. RAM.

The -B parameter specifies the size of the database primary buffer pool. It is typically the largest single component of the database's shared memory, which is allocated by the broker when the database is opened. Database processes (brokers, 4GL servers, SQL servers, background processes, utilities, etc.) and 4GL self-service clients do not "connect" to the database via TCP sockets. They map the database shared memory segment(s) into their own virtual memory address space. So if you look at the output of an OS utility like top, nmon, etc., you will see each of these processes using a similar amount of memory. This is (mostly) virtual memory, not physical memory. Each of these processes will also allocate private memory for their own use, for code, thread stacks, heap, caches, etc. etc. But that amount will likely be quite small relative to their total virtual memory usage when your database shared memory is measured in dozens of GB. For this reason it is not straightforward to determine the actual amount of physical memory in use by each server-side process. But that amount, whatever it is, will be quite small for most processes, in comparison to the primary broker which actually allocated the shared memory segments.

jennid said:
We noticed that the -B (blocksindatabasebuffers) param was set to 68000. Our db is 423 GB. Blocksize is 8192. We have 66 GB of memory on the AIX server. Our understanding of the -B parameter is that it should ideally be set to 10% of the database size.

A -B of 68,000 (531.25 MB) is very very small. Although I don't know the "working set" of your application, i.e. the amuont of memory it wants to read repeatedly and therefore the amount it would benefit from caching, I will venture a guess that it is too small. And given that you have 66 GB of RAM and 64-bit OpenEdge, your buffer pool is much smaller than it needs to be.

The notion that your buffer pool should be x% of your database size is antiquated and is now considered a worst practice. You want as much buffer pool as you can get, increasing it to the point that the increases continue to improve the ratio of logical to physical I/O and they don't hurt server performance by causing swapping.

cj_brandt · Nov 14, 2016

How many physical reads and physical writes is the database trying to do when performance is slow. Does the database checkpoint during that slow time ?

Was there new code introduced during the time when the performance issue was noticed ?

Disk IO and new code are good places to start looking when performance drops.

jennid · Nov 14, 2016

Thanks everyone. Will try to answer all questions that were asked of me. Also downloading Pro Top.

I understand that 10.2B07 is old and unsupported. Within the next 6 months we will be upgrading to 11.x...there are a number of moving targets and our Progress upgrade is dependent on some other pieces of software getting upgraded beforehand.

AIX is version 6.1.

Yes, we have SAN storage.

Our ERP is Infor's SX.enterprise. It's install program modifies the properties files which is where I got the "blocksindatabasebuffers" from versus just the -B term.

Type II storage areas. No, everything is not in the schema area.

Details on the performance issues....a few months ago users started reporting "halting" / slow response time in a number of applications. The most frequent complaint was seeing it in SX.enterprise (the progress based erp), but it has been reported in other applications as well. Most of these applications use the SAN. Some network hardware/config changes were made around the time the problems started happening. The network techs have backed out a number of those changes and made hardware swaps and seem to feel that they've ruled out anything on their end. There is basically a lot of finger pointing going on as to what the cause is and I'm just trying to shore up anything that isn't right in the SX.e/Progress portion of the business. Maybe about 80% of the time when users report the "halting", they're doing something that would be writing to the db...like saving an order after entering it. But that isn't always the case. While users are reporting slow response in the UI, batch processes like reporting and larger back-end updates (ie month end data updates) have not gotten any slower.

On the morning of June 13th, users started occasionally seeing Progress error 5451, indicating that the appserver was not connected (but it was ... appserver was running fine, but a few (not all) users see this error message). Our network group was again doing maintenance the weekend before. I'm not part of that group, but from what I understand, most of what they did that weekend was cleaning up old files on the SAN. I think they think I'm paranoid, but one of the things they also did that weekend was remove some files from the SAN that were encrypted by a virus a user accidentally downloaded the previous week. I've always wondered if that played into this...the fact that error # 5451 started abruptly after that weekend seems...interesting. Also, error 5451 has been seen on our development AIX system as well, which is a different Progress db, but same SAN, same network, etc...

I get what you're saying about the shared memory and how the OS utility is reporting that. What about _sqlsrv2 processes? I got conflicting information on that from our ERP support tech. Are the sqlsrv2 processes for the odbc connections using shared memory, or is the memory the OS is reporting accurate for those processes?

I can't say for sure if the database checkpoints when the slow performance is happening. Our Help Desk gets the call and usually by the time they contact me, everything looks fine. I should also be clear that not all users see the problem. User A and B might be doing the exact same thing and user A sees a problem, user B doesn't. Then the next day B sees it and A does not. I can train the Help Desk on how to look for checkpoints and see what that shows. They are aware of how to check IO and have said that nothing looks unusual when the problems are reported. From what I am told, wait for IO is usually under 5% (right now it is 3).

Right now buffer hits are at 99%. I believe they were in the mid-nineties before any -B changes were made.

Database was dumped/loaded a month ago with no noticeable changes.

No code changes at that time.

jennid · Nov 14, 2016

Also, here is the start of the db log file.

Rob Fitzpatrick · Nov 14, 2016

Error 5451 ("SERVER <servername> is not connected") could be the result of a firewall issue or maybe server TCP/IP timeout configuration.
KB: Firewall timeout causes 5451 on appserver connections
Knowledge Article

jennid said:
I get what you're saying about the shared memory and how the OS utility is reporting that. What about _sqlsrv2 processes? I got conflicting information on that from our ERP support tech. Are the sqlsrv2 processes for the odbc connections using shared memory, or is the memory the OS is reporting accurate for those processes?

The _sqlsrv2 processes are OpenEdge SQL servers. They also map shared memory into their virtual memory address spaces. SQL clients, local or remote, connect to them via TCP to access the database.

jennid said:
Right now buffer hits are at 99%. I believe they were in the mid-nineties before any -B changes were made.

Buffer hit percentage is not a great metric for database health/performance. For one thing, a low number is bad (mid nineties is low), but it is not necessarily the case that a high number is good.

Consider this:

Code:

def var i as i.
do i = 1 to 1000000:
  for each company no-lock.
  end.
end.

You'll have a buffer hit percentage at or close to 100%. That doesn't mean application performance will be good. The second point to consider is that a percentage is just not a very granular measure. Imagine your buffer hit percentage is 99%. It sounds good; 99 out of every 100 read requests is a database cache hit. But if you're doing 500,000 DB reads/second, that means you're doing 5,000 disk reads per second. That doesn't sound so good. And if you do something to make it twice as good, the number hardly changes (from 99% to 99.5%).

A better way to measure is to take the ratio of logical to physical I/O. You'll see that metric when you run ProTop.

If clients are reporting performance issues (and especially if run time of server-side batch jobs is not changing much) then it is relevant to consider the wider architecture of the application.

Are ABL clients self-service, remote, or a mix?
If remote, what platform are they on? Is it thick client, single user per PC, or some client virtualization solution (RDS, Citrix, 2X, etc.)? If the latter, do those servers have some other workload that might affect performance (file server, print server, etc.)?
Are the clients' temp files local or on a network share?
What are the client/server parameters for each connection broker? e.g. -Mpb, -Ma, -Mi, -Mm
Are you using the new 10.2B06+ -prefetch* server parameters for improving ABL clients' prefetch query performance?
Is the DLC directory local or remote for the clients?
How much temp file I/O are the clients doing? How much of it is to disk? Are -tmpbsize/-Bt tuned properly given the application's use of temp-tables?
Do you have jumbo ethernet frames enabled end-to-end on the network path between clients and servers?
What is the connection speed between clients and servers?
What is the typical CPU, disk I/O, and network I/O load on clients that are not reporting performance issues versus clients that are?

Rob Fitzpatrick · Nov 14, 2016

jennid said:
Also, here is the start of the db log file.

Direct I/O (-directio): Enabled.
That's a red flag for me. Do you (or whoever set it) know why you're using -directio? Can they prove you're better off with it than without it? Typically this parameter was used in the past as a workaround for poor server file system performance. It can reduce application performance.

Number of LRU force skips (-lruskips): 0
Set it to 100. Quick and easy way to reduce LRU latch contention, if it exists. If it doesn't, no harm done (as long you don't set it so high that it increases in-paging).

Number of Alternate Database Buffers (-B2): 0.
Number of LRU2 force skips (-lru2skips): 0
You aren't using Alternate Buffer Pool. Some applications do heavy reads on small, static reference tables. Assigning them to an appropriately-sized Alternate Buffer Pool can improve application performance by reducing LRU latch contention and ensuring that frequently-accessed data remains memory-resident. Some thoughtful analysis and preparation beforehand, and monitoring after the fact, is called for if you're going to implement this feature. If you do implement it, set -lru2skips to 100 as well. If I were you I'd wait until I got onto a later 11.x release, to avoid ABP bugs in the 10.2B implementation.

Before-Image Cluster Size: 1048576.
Your BI cluster size is 1 MB. That is quite low, barely above the default of 512 KB. The appropriate range of values depends on your transaction volume but I imagine it would be fairly high for an ERP system configured for over 2000 users. I typically start with 16 MB and adjust if needed. How frequently does your database checkpoint during busy periods? The smaller your BI cluster size, the more frequently it will checkpoint. Mind you, with -directio you are changing from the default checkpoint logic. I'd be inclined to look into the -directio sitation first, then look at bi cluster size.

Number of Before-Image Buffers (-bibufs): 60.
This might be on the low side, depending on your transaction note volume. Check your empty buffer waits in promon. If they are high then double -bibufs and check again after your next DB restart.

Number of After-Image Buffers (-aibufs): 60.
If you do increase -bibufs, set -aibufs to the same value.

BI File Threshold Stall (-bistall): Disabled.
After-Image Stall (-aistall): Enabled.
BI stall is disabled and AI stall is enabled. I'm curious about the rationale for this.

Before-Image Block Size: 16384.
After-Image Block Size: 8192.
Your AI blocks are smaller than your BI blocks. I would change AI block size to 16384.

Storage object cache size (-omsize): 1024
This is the default value; is this enough for your schema? It should equal at least the number of records in _StorageObject. If it is too low you will have unnecessary OM latch contention that can reduce application performance.

Maximum Number of Users (-n): 2601.
Maximum Number of Servers (-Mn): 232.
Maximum Servers Per Broker (-Mpb): 140.
Maximum Number of Clients Per Server (-Ma): 15.
Minimum Clients Per Server (-Mi): 1.
I assume you have two connection brokers: the primary login broker which is also a connection broker for ABL clients, and a secondary broker for SQL clients.
Primary allows a maximum of 140 * 15 = 2,100 remote users.
-Mn is 232, so I assume -Mpb for the secondary broker is 91? (232 - 140 - 1) What is -Ma for the SQL broker?
-Ma 15 is fairly high; fully-utilized 4GL will have to round-robin between 15 clients to serve their query results. Performance differences between one client and another could come down to load differences between servers. -Mi is set to 1, which is good.
Your primary broker minport/maxport range is 976 ports, which is a lot bigger than it needs to be. That's not a problem per se; just a bit wasteful.

Message Buffer Size (-Mm): 1024
This is the default (1 KB), and it is very likely too low to provide good performance for remote ABL clients. 8 KB is probably a better starting point. In 10.2B it has to be set to the same value on both the client and the connection broker. Ideally, you would want to enable jumbo frames as well when setting this.

BIW 233: (-----) Login by root.
AIW 234: (-----) Login by root.
Good.
You're not running a WDOG though.

APW 235: (-----) Login by root.
APW 236: (-----) Login by root.
APW 237: (-----) Login by root.
APW 238: (-----) Login by root.
4 APWs is probably overkill.

The after-image manager is beginning.
Good.
After-image Management Archival Method : On Demand.
Just curious about why you prefer demand-based switching to time-based switching. Is it because you have fixed-length AI extents?

SRV 4: (7129) Usr 2829 set name to SYSPROGRESS.
I recommend that you create a user for SQL access with the appropriate permissions and not use the default SYSPROGRESS user.
Do you keep the SQL query optimizer statistics up to date?

TheMadDBA · Nov 14, 2016

Since everybody else helped with the DB settings..

Run vmo -a to check the values of:

minperm
maxperm
maxclient
strict_maxclient
strict_maxperm

If you were previously on AIX 5 or if someone felt they knew better those parameters might be wrong for a database server. If you have nmon installed (please make sure you do) you can also run nmon -m and get the memory settings and usage more clearly.

They control how much memory is used for AIX buffer cache and if the minperm value is set higher than a few percent and/or the strict* values are set to on you could actually have paging issues.

jennid · Nov 18, 2016

Thank you all for your replies and advice. I'm out of the office for a few days, but when I return, will be reviewing all of your suggestions in detail and will let you know how things go. Thanks.

Memory Allocation And Usage

jennid

Member

TomBascom

Curmudgeon

Rob Fitzpatrick

ProgressTalk.com Sponsor

cj_brandt

Active Member

jennid

Member

jennid

Member

Attachments

Rob Fitzpatrick

ProgressTalk.com Sponsor

Rob Fitzpatrick

ProgressTalk.com Sponsor

TheMadDBA

Active Member

jennid

Member