Question Database shuts down - AIX ulimit

MarkoLachance

New Member
Hi !

Yesterday we had database shuts down with the follwing errors in :

- [2015/01/08@16:39:49.079-0500] P-27787570 T-1 I ABL 75: (1450) -> ** stget: out of storage.
- [2015/01/08@16:39:49.094-0500] P-27787570 T-1 I ABL 75: (6079) -> SYSTEM ERROR: bkioWrite:Bad address ...
- [2015/01/08@16:39:49.094-0500] P-27787570 T-1 I ABL 75: (6071) -> SYSTEM ERROR: error reading file... In this case, it was the BI fixed extent file.

Before-image files was only 360 MB, so we had enought space.
According to Progress Knowledgebase it seems that a process has reached some soft limits for process.

We have both CHUI and GUI applications and somes are using class.
Here are the current value of ulimit

time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 131072
stack(kbytes) 32768
memory(kbytes) 32768
coredump(blocks) 2097151
nofiles(descriptors) 3000
threads(per process) unlimited
processes(per user) unlimited

I don't know if those limits are correct, how can I determine them, any best pratices ?

Progress version 10.2B07, database size about 35 Gig and average of users : 275.

Thanks,
Marko
 

TomBascom

Curmudgeon
Why do you think that ulimit is invloved? It /might/ be related to the stget error but there is no special reason to think that it has anything to do with "bkio".

If you found a kbase that you think is relevant it is always helpful to post the link or article number.

"Error reading" means "error reading". Possibly due to hardware failure. Or maybe because the redacted address is bogus. Possibly because something went wrong that caused the stget -- but there is scant information to go by here.

Hiding the full text of the error is self-defeating. Knowing the rest of the message is useful -- there are usually some hints about things like the file name, the os error number and the offset in the file.

There should also be a protrace.2778750 file lying around somewhere. It should have a 4gl stack trace in it that might shed some light on what your code was doing at the time of the error.
 

MarkoLachance

New Member
This is the kbase I was talking about : http://knowledgebase.progress.com/articles/Article/P168379.
I was not trying to hide detail, the reason why I posted only the error is because the messages are in french, but here are the full error messages.

[2015/01/08@16:39:49.079-0500] P-27787570 T-1 I ABL 75: (1450) ** stget: Hors limite.
[2015/01/08@16:39:49.079-0500] P-27787570 T-1 I ABL 75: (2252) D^Âbut de la transaction de
'backout'.
[2015/01/08@16:39:49.094-0500] P-27787570 T-1 I ABL 75: (6079) SYSTEM ERROR: bkioWrite : L
'adresse utilis^Âe est incorrecte : adresse = 0x0
[2015/01/08@16:39:49.094-0500] P-27787570 T-1 F ABL 75: (6071) SYSTEM ERROR: Erreur en cou
rs de lecture du fichier /dc03db12/db/acct/can/acct.b1, ret = -1"

Of course I was hoping to find some details about the file name, but I don't see any hints..
while answering you I found this in the database log, maybe it's more helpful

[2015/01/08@16:40:23.853-0500] P-4981152 T-1 I WDOG 75: (5028) SYSTEM ERROR: Releasing reg
ular latch. latchId: 1
[2015/01/08@16:40:23.856-0500] P-4981152 T-1 I WDOG 49: (2522) User 75 died holding 1 shar
ed memory locks.
[2015/01/08@16:40:23.856-0500] P-7340776 T-1 I SRV 24: (2520) Stopped.
 

TomBascom

Curmudgeon
Google translate is a wonderful tool :)

The later errors are a result of user 75 dying while holding a latch. That is why the db stopped.

Usually those errors occur because someone killed user 75 with a "kill -9". They might also occur due to a bug.

I cannot recall ever seeing one associated with an "stget" type of error so that is unusual. I suppose the process /might/ have been killed in the 45 seconds between messages but I can't say that for sure.

Did you find a protrace file?

Your ulimit values seem "modest", it probably wouldn't hurt to increase them. But if you find yourself frequently doing so I'd want to take a good hard look at your code. It may be that you have some emory leaks.

Memory leaks are usually a result of flawed coding with handles. DELETE widget and WIDGET-POOL are your friends. Dynamic queries and XML handling are frequent sources of handle based memory leaks on UNIX systems.
 

TheMadDBA

Active Member
I have seen stget style errors cause DB shutdowns like this before. Usually the process is leaking memory like mad and reached either a ulimit or a 32 bit limit of some kind which causes the client to die in an untrapped manner.

Most of the times it will just crash the session and not the DB, but if that session has a latch locked it can cause the watchdog to shutdown the database for integrity reasons.
 
Top