Database Shutdown very slow

asucher · Feb 19, 2008

Hi all,

I have a Progress DB (from promon: Database version number: 8283; Size: 130GB) on an AIX system with the following paramters:
-L 16384
-Mn 860
-Ma 1
-spin 60000
-B 327680
-bibufs 50
-aibufs 50

We have lots of server processes, because it is a business critical DB which needs good response time. In the past we used a lower -Mn and a higher -Ma and had lots of performance issues. Since we changed these parameters, the performance is OK. But, now we have the following problem: every shutdown of this database brings the AIX-Server (a new p570 machine with 8 CPUs) in a state where it is very very very busy for 10 minutes and more (we do not get a prompt from the root console then). And sometimes the machine crashes because it is so busy that it is not able to answer the HACMP-Heartbeat/alive-requests from the other cluster member.

Do you have any suggestions, what we can do? We already changed some HACMP/AIX parameters (Heartbeat-Timeout, etc.) after talking to IBM Support. But we still encounter the problem. Should I use a kill-Script to kill some processes from the OS-Level before running proshut? Or is there a way to 'manually' kill the user-/server-processes with a progress script before issuing proshut? Is there a way to use a "shutdown abort" (like in Oracle) without the risk of data loss? Would the emergency shutdown be a option for us?

BTW: currently we use 'proshut -by' for the shutdown.

Thank you in advance!

Andy

Casper · Feb 19, 2008

I suppose this a 9.1x progress database with 8K blocksize.
How many users are on the system? Are there really 800+ users?
How much memory is there?
What AIX version and TL you have?

Looking at the -B I suppose this is 64 bit Progress, so I'm thinking like 9.1E.

Are you sure you ain't swapping shared memory to disk, (Is paging space nearly empty?). If you have shared memory swapping to disk then during shutdown lots of paging space has to be processed before the database can shut down....
What are the values of maxperm and minperm?

Do you know what is causing the slowdown of the system? Is this IO? memory usage? (data from topas or nmon?)

AIX handles many servers very efficiently, so I don't suspect a problem there if you have enough memory on the system.

Regards,

Casper.

TomBascom · Feb 19, 2008

What version of Progress is this really? The PROMON db version number is just about useless...

What is the db block size?

How about the bi cluster size?

What does the application do? One potential source of very long shutdown times is a need to flush a great many dirty buffers. If you have a very update intensive application that might be related.

The thing that you need to know is what is the db doing that is making the machine so busy during shutdown? Once you know that then you might be able to address the problem.

Offhand, while killing 800+ servers strikes me as being a fairly intense thing to do, it doesn't seem like it should be as bad as you indicate.

I'd also go back to your original performance problems -- those sound awfully suspicious too.

I think that there is an underlying problem that hasn't been found or solved.

Casper · Feb 19, 2008

What version of Progress is this really? The PROMON db version number is just about useless...

You are right (ok , I never doubted that, but anyway

), I never paid attention to it. I just seemed to remember something that the version number is blocksize plus progress version.
So 8232 -8192 = 91 I thought version 9.1.

But looking at the first testdatabse I could find I have version number 2198 with blockszie 2048, that would leave version 150 for version....

Casper

asucher · Feb 20, 2008

Hi Casper and Tom,

thank you for your posts.

Progress Version is: 9.1E04
DB-Block-Size: 8192 bytes
BI cluster Size: 8192 kilobytes
AI block size: 8192 bytes

During peak-times we have 800+ users/sessions on this database. I don't know in detail what does the application exactly do because it is an application developed by an external company (we do not have the source). All I know is that this database is used to store rich text files (rtf).

BTW: during peak-times the machine has a CPU-load of approximately 30-40% (mean value of all CPUs).

The machine has 16GB of physical memory and 12GB of allocated paging space. The paging space is not used:
Page Space Physical Volume Volume Group Size %Used Active Auto Type
paging00 hdisk0 rootvg 6080MB 1 yes yes lv
hd6 hdisk0 rootvg 6080MB 1 yes yes lv

So I think we have no troubles with paging nor with CPU (in "normal" case). Values for MINPERM and MAXPERM are (I think per default):
5.0 minperm percentage
10.0 maxperm percentage

Why we use -Mn 860 and -Ma 1?
In the past we used -Mn 43 and -Ma 20 for this DB, and some users very often encountered performance problems, while other users had no performance impacts at the same time. We guessed that the problem comes from "batch jobs", which ran on the same server process as the users with the performance problems - while the other users with no performance impact did not have to share the CPU with batch jobs. Since we changed the design to use one server process per user, we did not encounter this performance problem again. So it seems that this decission was OK.

But, as I mentioned, now the shutdown is very slow and sometimes causes an AIX crash. While shutting down the database it isn't even possible to run topas or any other command. The machine gives no response for these minutes. So I thought that the only problem could be that the shutdown sends all progress servers a kill at the same time - what causes our dilemma. What I saw at last shutdown: topas reported a value of > 1700 for runqueue (Runqueue: The average number of threads that were ready to run but were waiting for a processor to become available).

BTW: I have an other databases on this machine - with -Mn 430 and -Ma 2 - and it could be shut down "normal".

I hope you can understand my explanations. Thank you!

Regards
Andy

TomBascom · Feb 20, 2008

Version 9 + 8k blocks + -B of 327680 means that you must be doing something to get around the AIX 32 bit shared memory limit. My guess is that you have used EXTSHM (extended shared memory).

This will result in very poor performance. Stop using it. EXTSHM is poison. Either accept that you cannot set -B to much more than 100000 with this combination of Progress & AIX or upgrade Progress to 10.0B or higher and use a 64 bit executable.

I think that if you remove EXTSHM and set -B to about 100,000 you will find that your problems go away.

BTW, your bi cluster size is quite low. I'd probably set it to at least 32000.

There may well be other issues but until EXTSHM is out of the way it will be very difficult to see them.

Casper · Feb 20, 2008

I'm sure Tom will jump (he's the expert in this) in here in the mean time I reply

All I know is that this database is used to store rich text files (rtf).

Are they actually stored in the database? (as raw fields?) 'Normally' the documents are stored outside the database and the reference to the document is stored in the database...

So I think we have no troubles with paging nor with CPU (in "normal" case). Values for MINPERM and MAXPERM are (I think per default):
5.0 minperm percentage
10.0 maxperm percentage

No paging is good

Minperm and maxperm are default 20 and 80 respectively. But for a dedicated database server these are IMO good values.

In the past we used -Mn 43 and -Ma 20 for this DB, and some users very often encountered performance problems, while other users had no performance impacts at the same time.

Well that is from one extreme to another. 20 clients per server is IMO abit much. If you have -Ma of 5 it would probably have had the same effect...

We guessed that the problem comes from "batch jobs", which ran on the same server process as the users with the performance problems - while the other users with no performance impact did not have to share the CPU with batch jobs. Since we changed the design to use one server process per user, we did not encounter this performance problem again. So it seems that this decission was OK.

Do you mean you run batch jobs clients server? Why not run it as a self-service client. It is much more efficient to connect directly to shared memory then through a server. (provided the batch jobs run at the same server).

Well, I noticed that in the mean time Tom already answered :biggrin:
(I was getting there, lol :awink

.

Regards,

casper.

asucher · Feb 22, 2008

Thank you so far. I will discuss your suggestions with the software vendor and try them out and post the results. This will take some weeks because I have to wait for our next maintenance window to change the parameters...

TomBascom · Feb 22, 2008

Good luck. Hopefully you're working with a vendor who has experience with large scale systems.

Database Shutdown very slow

asucher

New Member

Casper

ProgressTalk.com Moderator

TomBascom

Curmudgeon

Casper

ProgressTalk.com Moderator

asucher

New Member

TomBascom

Curmudgeon

Casper

ProgressTalk.com Moderator

asucher

New Member

TomBascom

Curmudgeon