Recurrent Shared memory failure

illmatic · Oct 9, 2008

Dear all,

hopefully someone is able to give me some advise!?

I have a recurrent database error:

"(14394) Out of free shared memory"

this entry repeats for a long long time....

...and for the time being this should not be the problem as I expect progress to automatically increase the needed resources by itself, doesn't it?

BUT!

After some time (in fact 10 or even more days) the database crashes and is no more responding. Even an emergency shutdown is not mentioned in the log file. Furthermore the database seems to be still up again, but no connection are possible anymore. (Maybe because of the self killed Broker???, pls find the logfile below)

cut out of the logfile:

[2008/10/01@20:04:59.959+0200] P-12419 T-1 F SRV 3: (6495) Out of free shared memory. Use -Mxs to increase.
[2008/10/01@20:04:59.959+0200] P-906 T-1 I SRV 2: (453) Logout by eaiopreu on agnux02 batch.
[2008/10/01@20:04:59.964+0200] P-22037 T-1 I ABL 104: (453) Logout by aitopr on batch.
[2008/10/01@20:04:59.966+0200] P-12419 T-1 I SRV 3: (5028) SYSTEM ERROR: Releasing regular latch. latchId: 5
[2008/10/01@20:04:59.970+0200] P-12419 T-1 I SRV 3: (2522) User 597 died holding 1 shared memory locks.
[2008/10/01@20:04:59.970+0200] P-906 T-1 I SRV 2: (2520) Stopped.
[2008/10/01@20:04:59.970+0200] P-12419 T-1 I SRV 3: (2520) Stopped.
[2008/10/01@20:04:59.972+0200] P-28227 T-1 I AIW 58: (2520) Stopped.
[2008/10/01@20:04:59.983+0200] P-28179 T-1 I BIW 56: (2520) Stopped.
[2008/10/01@20:05:00.182+0200] P-28059 T-1 I BROKER 0: (2249) Begin ABNORMAL shutdown code 2
[2008/10/01@20:05:00.182+0200] P-28059 T-1 I BROKER 0: (-----) Sending signal 5 to user 1
[2008/10/01@20:05:00.182+0200] P-28059 T-1 I BROKER 0: (-----) Sending signal 5 to user 4
[2008/10/01@20:05:00.182+0200] P-28059 T-1 I BROKER 0: (-----) Sending signal 5 to user 5

and so on.....

after almost all user of the specific Broker are disconnected the last entry in the logfile is:

[2008/10/01@20:05:01.212+0200] P-28059 T-1 I BROKER 0: (2525) Disconnecting dead server -1.
[2008/10/01@20:05:02.221+0200] P-28059 T-1 I BROKER 0: (2525) Disconnecting dead server -1.
[2008/10/01@20:05:07.752+0200] P-24612 T-1 I ABL 66: (453) Logout by aitopr on batch.
[2008/10/01@20:05:08.495+0200] P-28246 T-1 I ABL 74: (453) Logout by root on batch.
[2008/10/01@20:05:10.299+0200] P-28059 T-1 I BROKER : (-----) Removed shared memory with segment_id: 1605651
[2008/10/01@20:05:10.611+0200] P-28059 T-1 I BROKER : (334) Multi-user session end.

In my opinion this Broker disconnects himsef and therefore no more 4GL connections are possible. (The *.lk file is still there, but proshut is no more working)

Anybody got an idea?

Below some information about the system:

HP Itanium (B.11.23 U ia64)
Progress (OE 101C)
Startup Parameter (-L 74000 -B 150000 -Mn 50 -Ma 10 -n 550 -s 2400 -bibufs 50 -Mi 2 -Mf 6 -spin 80000 -bithold 10000 -bistall
-pf -/XXX/aidaemon.pf -t -ServerType 4GL)

I would really appreciate some help!

Thanks in Advance

TomBascom · Oct 9, 2008

...and for the time being this should not be the problem as I expect progress to automatically increase the needed resources by itself, doesn't it?

No.

Have you tried doing as the error message says and increasing -Mxs?

BTW, (this has nothing to do with your shared memory issue but...) -spin 80000 is almost certainly way too high. Something like 5000 would probably be more appropriate.

Why -Mi 2? That's a fairly strange setting.

You should not be specifying -Mf unless you have a very good reason to (and I seriously doubt that you do, hardly anyone does).

illmatic · Oct 9, 2008

Hello Tom,

thank you for the quick answer!

Have you tried doing as the error message says and increasing -Mxs

Well in fact I did not increase or set -Mxs as I never used that parameter in the past and the error only appears from time to time. In general the database is doing fine. This error is occuring since the update from 101A to 101C. (but I didnt changed any startupparamters, except to spawn a seperate SQL Broker)

Therefore I dont want to experiment with that parameter, I can not afford to restart the database several time as our largest customer is working on it 24/7.

-spin 80000 is almost certainly way too high. Something like 5000 would probably be more appropriate

Well this parameter was set in the past and a Progress consultant has been admitting that the "Latch Timeouts" should be "< 10 /sec".
As we have 8 CPU's running it was said to set that parameter to 80000.

But indeed maybe this to high value could cause the problems concerning the shared memory!!!

Why -Mi 2? That's a fairly strange setting

You should not be specifying -Mf unless you have a very good reason

Well I have to admit that these parameter may be superfluous, they have been a set long time ago and are more or less just a relict.

To get back to the -Mxs parameter,

so do you think that this paramter wont harm the database or the server?
The last entry in the log file I found was "Excess Shared Memory Size (-Mxs): 251"

To be honiest I dont know how to check how much shared memory is allocated on a unix system and as we have 19 different databases running I dont want to persist a specific amount only to one database.

TomBascom · Oct 9, 2008

You have an error.

It crashes your database.

A clear diagnostic message with a suggested action is presented.

I strongly suggest that you take that action. To my mind that would be better than tolerating system crashes for your largest customer. A customer who wants to be 24x7.

While it isn't an every day happening it also isn't all that unusual for -Mxs to need to be tweaked. Set -Mxs to 512 or 1024 and be done with it. It's not much memory and it isn't going to cause a problem.

To be honiest I dont know how to check how much shared memory is allocated on a unix system and as we have 19 different databases running I dont want to persist a specific amount only to one database.

ipcs -m

The largest consumer of shared memory in a Progress database is -B. Your -B setting of 150,000 dwarfs my suggested -Mxs setting. If you have 19 databases and they all use -B 150,000 and you are using an 8k db block then you are consuming 21GB of shared memory. If you have 550 self-service clients (you show no -S in your startup parameters) or if you are using app-servers then you are also using a whole lot of non-shared memory. It's an Itanium server so, presumably, there is lots of RAM installed. Is that your situation?

As for -spin, like I said -spin has nothing to do with shared memory. However, 80,000 is quite high, is likely causing you to pointlessly waste CPU cycles and is potentially limiting the scalability of your system. Furthermore the idea of setting -spin based on some multiple of the number of CPUs has long been denounced by the database engine crew. It seemed like a good idea once upon a time but it quickly turned out to be ineffective -- yes, I know that you can still find kbase entries that say to do it and I know that some consultants and even people in tech support or Progress professional services still roll it out as "the standard". They're wrong. Don't do it. Especially not with 10.1C where there have been major changes to the latching code that significantly change the behavior of -spin.

Latch timeouts of < 10/sec is certainly nice and is a wonderful goal but 11 isn't exactly a problem. Occasional measurements in the 100s aren't necessarily a problem. The thing to pay attention to with latch timeouts at low levels is the trend. If the trend is increasing, especially non-linearly with load (as measured by logical IO ops) then it is serving as a "canary in the coal mine" telling you that there is a problem coming.

You've got some other oddball parameter settings -- the bi stuff, for instance, is kind of old-school. The bi file hasn't been limited to 2Gb since v9 so there's no actual need to be setting that stuff. OTOH it can be useful in a twisted sort of manner if your application often misbehaves -- but there are less intrusive ways to handle those situations.

If you really don't know much about UNIX and Progress I suggest that you get some help. This might interest you: DBAppraise

illmatic · Oct 9, 2008

You have an error.

It crashes your database.

A clear diagnostic message with a suggested action is presented.

you are completely right!

I will try to set that as soon as possible!

It's an Itanium server so, presumably, there is lots of RAM installed. Is that your situation?

Indeed there are about 48 GB of RAM

If you really don't know much about UNIX and Progress I suggest that you get some help

As you mentioned about these kind of "old school" paramters I am thinking about to generally overview those paramters and set them "up to date".
Therefore some kind of consultant will be necessary, is your company offering any?

In the meanwhile I will try to convience the management that there has to be done a lot of "rethinking" concerning those startup parameters even after the migration to 101C.

Thank you for your detailed answers.

I will get back to you as soon as I got a "go" from the upper management!
Maybe DBAppraise is an opportunity as well.

TomBascom · Oct 9, 2008

As you mentioned about these kind of "old school" paramters I am thinking about to generally overview those paramters and set them "up to date".
Therefore some kind of consultant will be necessary, is your company offering any?

Sure. That's what I do in my day job

illmatic · Oct 15, 2008

Always those political discussion.....our management dont mean to request consulting service for a problem that only occurs from time to time and with a respective "solution" given.

Lets see how many times the database has to crash again, before they are going to change their mind!

In that case I will get back to you immediately

Many thanks for your replies!

comatt1 · Oct 29, 2008

I think I would also look at the number of AIBUFS/BIBUFS you have allocated to all the databases (and their block/cluster size.)

Are you using AI as well? You may have said you did, but I missed while skimming.

for looking for Progress shared memory you can also do

proutil -C dbipcs

illmatic · Feb 24, 2009

Hello community,

as I finally got a solution for the problem above I just want to share it with you.

After a long time of investigating and reconfigure several paramters, the progress support team finally delivered a patch which fixes the issue of realeasing attached shared memory.

The patch number is:

10.1C0208

Note:

this error occured on a HP- UX B.11.23 U ia64 (Itanium) Server.

Regards,

Alex

Recurrent Shared memory failure

illmatic

New Member

TomBascom

Curmudgeon

illmatic

New Member

TomBascom

Curmudgeon

illmatic

New Member

TomBascom

Curmudgeon

illmatic

New Member

comatt1

Member

illmatic

New Member