Diagnosing DB crash

Rob Fitzpatrick · Feb 21, 2012

Back end: AIX 7.1
RDBMS: 10.2B02 Enterprise
Brokers: 4GL (primary), SQL

I had a database crash recently. I'm confused by what I see in the DB log. (Not the first time, not the last.)

Here is the relevant bit from the log:

Code:

 SRV     3: (8873)  Login usernum 99, remote SQL client.       
 SRV     3: (7129)  Usr 99 set name to <username>.               
 BROKER  1: (1153)  BROKER detects death of server 13107340.                
 BROKER  1: (8839)  No SQL servers are available.  Try again later.         
 BROKER  0: (2526)  Disconnecting client 92 of dead server 3.               
 APW    34: (453)   Logout by root on /dev/pts/0.                           
 BROKER  0: (5028)  SYSTEM ERROR: Releasing regular latch. latchId: 22      
 BROKER  0: (2522)  User 92 died holding 1 shared memory locks.             
 BIW    32: (2520)  Stopped.                                                
 AIW    30: (2520)  Stopped.                                                
...
other servers and process stop, and local clients are signalled by the broker...
then the broker begins an abnormal shutdown

The dead server in question was a SQL server (PID 13107340). According to the log the last thing it did before it died was process a login from a SQL client (Cyberquery). Presumably, it also ran a select. This was six seconds before the 1153 error. From what I understand, the 1153 error by itself should not be fatal, unless there was some unrecoverable situation, like the server held a latch.

So, questions:

Are latch IDs always the same, from one startup to the next or one DB to the next (within a given version)? In other words, does latchId 22 mean "the _Latch record with _Latch-Id = 22", i.e. MTL_CPQ (checkpoint queue latch)?
If so, does this help me diagnose the crash in any way? My guess is "no".
How is it that user 92, a remote SQL client, held a shared memory lock? Does this actually mean that server 3 held the lock while serving a request from user 92?
Has anyone run into this kind of issue in the past?

bvanmeer · Feb 22, 2012

Hi,

We had some similar stiuation where an Task scheduler started a progress routine that started a communication with the smtp server, within this communication we had a smtp error, which wasn't cathed where progress constantly started restarting the session with after a while practically the same error as you experience.

We had a progress support call about this.

Greetings,

Benny

TomBascom · Feb 22, 2012

Yes, latch ids stay consistent.

Yes, "held by 92" means server 3 on behalf of user 92.

You either killed the server with "kill -9" or a bug caused it to crash.

Yes, I occasionally tend to see this kind of stuff with SQL connections. If you are lucky you might find a core dump or protrace and be able to get something out of that (or send it to TS). There is also some additional logging that can be turned on but it might be too much data if this isn't something that happens very often.

Rob Fitzpatrick · Feb 22, 2012

Thanks guys. I'll see if the client has a procore or protrace from that time that we can send to TS. Cheers.

Diagnosing DB crash

Rob Fitzpatrick

ProgressTalk.com Sponsor

bvanmeer

New Member

TomBascom

Curmudgeon

Rob Fitzpatrick

ProgressTalk.com Sponsor