Db abnormal shutdown

VageeshA

New Member
Hello Everyone,

Yesterday one of our DB was brought down abnormally. The following information was present in the LOG file:

SYSTEM ERROR: Memory Violation
User 650 Died with holding Latch id 2
Begin Abnormal ShutDown

Scanning through the LOG file shows that User 650 had logged in at 11:00:00 and Logged Out at 11:02:18. The DB was brought down at 11:30:00.

We believe User 650 has to be a Native Process on the server hosting the DB server.

Can anyone please clarify the below concerns:

1. Can PROQUIET be the cause of the issue as it is scheduled to run at 11:00:00
2. We have set the Lock Wait Time Out to 30Mins which is exactly the same time from 11:00:00 - 11:30:00
3. No DB Table updates are running on the Server Hosting the DB for Lock Wait scenario
4. Does Latch Wait also follow the Lock Wait Time Out
5. AI EMPTY also runs at the mentioned time

Operating Environment:
Progress 9.1E
Solaris 10

Thanks in Advance...

Regards,
Vageesh
 
Can you please look at the log file to figure out wether user 650 was allocated to any other process between 11:02 and 11:30 as it will give us some idea if the it was graceful logout for user 650 logged at 11:00.Progress will only assign the same usr number only when all the process acquired by user has been released to maintain it uniquess.

As you mentioned that ai_empty was also scheduled to run at 11:00 and proquiet also ran at 11:00 both of the jobs will try to switch extents at the same time . but i feel that this may not have led to the issue. was database queit point enabled at 11:00 in the log files.I think to avoid any issues occuring because of this either change the cron timings or enable proquiet with no lock which will quiet the db without having a latch lock.

as per my understanding latch wait should not be the same as lock wait out time because latch locks are held for very short duration (ms) can be looked in the promon. latch wait time is doubled for each successive wait(number of times spin to acquire) starting with 5 ms with a maximum of 5 sec.

even a lock wait out time should not bring the Db down it will just affect the process which requested the lock.By any chance did someone issue a kill -9 to any process this is one of the main reasons where user dies holding a latch lock.

Arshad
 
Thanks Tom, Arshad.
We scanned through history of commands used and could not locate KILL -9. We also observed that there were lot of User Sessions abnormally terminated. This might be because of network issues.

Could this be the cause for the shutdown? Also the LOG file indicates the User # 650 died. This user has to be Local to the server hosting DB.

Please provide your valuable inputs.

Thank You,
Vageesh
 
Look for
"SIGKILL" in the log file. network issues should not bring the db down as server runs locally on the box unless there is some real issue with the box.

Arshad
 
Kill -9 is often scripted. "Because it always works" (which, BTW, is not true. It does not always work. It is, however, untrappable which is why it crashes databases.)

999 out of 1,000 "User X died while holding Latch id Y" crashes are due to kill -9 (or an equivalent). The 1 out of 1,000 is due to a bug of some sort.

BTW -- this might help explain why kill -9 is bad: http://dbappraise.com/traps.html
 
Back
Top