Question Urgent Need to find RCA.

Mike

Moderator
Server :- HP-UX
Openedge version: - 11.2

We had a situation where all AI files become full and AI demon was in hung state. We tried make empty full ais but does not work, Db had down. So we re-boot the server. and got date time mismatch. Db got corrupt after re-boot or before no idea.Seeking for root cause analyst\ . Please help. Apologies coz Db logs in Spanish so i just took Few logs.

2023/08/18@09:54:05.052-0400] P-17252 T-1 I AIMGT 16: (3776) Backup the ai extension and mark it as empty.
[2023/08/18@09:54:10.063-0400] P-17252 T-1 I AIMGT 16: (3775) Cannot switch to After-Image extension /chi/rt/ai/ladb2/ladb2 .a4, is full.
[2023/08/18@09:54:10.063-0400] P-17252 T-1 I AIMGT 16: (3776) Backup the ai extension and mark it as empty.
[2023/08/18@09:54:15.072-0400] P-17252 T-1 I AIMGT 16: (3775) Cannot switch to After-Image extension /chi/rt/ai/ladb2/ladb2 .a4, is full.
[2023/08/18@09:54:15.073-0400] P-17252 T-1 I AIMGT 16: (3776) Backup the ai extension and mark it as empty.
[2023/08/18@09:54:15.964-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 17.
[2023/08/18@09:54:15.967-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 18.
[2023/08/18@09:54:15.968-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 20.
[2023/08/18@09:54:15.968-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 21.
[2023/08/18@09:54:15.969-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 22.
[2023/08/18@09:54:15.969-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 23.
[2023/08/18@09:54:15.970-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 25.
[2023/08/18@09:54:15.971-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 27.
[2023/08/18@09:54:15.971-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 29.
[2023/08/18@09:54:15.972-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 30.
[2023/08/18@09:54:15.972-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 31.
[2023/08/18@09:54:15.973-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 32.
[2023/08/18@09:54:15.974-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 34.
[2023/08/18@09:54:15.974-0400] P-17237 T-1 I WDOG 34: (2252) Beginning of transaction rollback.
[2023/08/18@09:54:15.974-0400] P-17237 T-1 I WDOG 34: (2253) Transaction rollback has finished.
[2023/08/18@09:54:15.974-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 35.
[2023/08/18@09:54:15.975-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 36.
[2023/08/18@09:54:15.976-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 37.
[2023/08/18@09:54:15.976-0400] P-17237 T-1 I WDOG 15: (2527) Disconnecting missing user 39.
[2023/08/18@09:54:20.084-0400] P-17252 T-1 I AIMGT 16: (3775) Cannot switch to After-Image extension /chi/rt/ai/ladb2/ladb2 .a4, is full.
[2023/08/18@09:54:20.084-0400] P-17252 T-1 I AIMGT 16: (3776) Backup the ai extension and mark it as empty.
[2023/08/18@09:54:25.094-0400] P-17252 T-1 I AIMGT 16: (3775) Cannot switch to After-Image extension /chi/rt/chl/ai/ladb2/ladb2 .a4, is full.
[2023/08/18@09:54:25.094-0400] P-17252 T-1 I AIMGT 16: (3776) Backup the ai extension and mark it as empty.
[2023/08/18@09:59:43.428-0400] P-11391 T-1 I ABL : (451) Single-user login for root at /dev/pts/1.
[2023/08/18@09:59:43.439-0400] P-11391 T-1 I ABL : (886) ** The database was last used on Fri Aug 18 05:20:03 2023.
[2023/08/18@09:59:43.440-0400] P-11391 T-1 I ABL : (887) ** Previous-image file expected Fri Aug 18 08:48:39 2023.
[2023/08/18@09:59:43.444-0400] P-11391 T-1 I ABL : (888) ** The dates do not match, this indicates that you have an incorrect copy of one of them.
[2023/08/18@09:59:43.445-0400] P-11391 T-1 I ABL : (334) Single-user session end.
[2023/08/18@09:59:45.734-0400] P-17252 T-1 I AIMGT 16: (3775) Cannot switch to After-Image extension /chi/rt/ai/ladb2/ladb2 .a4, is full.

Fri Aug 18 09:59:43 2023
[2023/08/18@09:59:43.428-0400] P-11391 T-1 I ABL : (451) Single-user login for root at /dev/pts/1.
[2023/08/18@09:59:43.439-0400] P-11391 T-1 I ABL : (886) ** The database was last used on Fri Aug 18 05:20:03 2023.
[2023/08/18@09:59:43.440-0400] P-11391 T-1 I ABL : (887) ** Previous-image file expected Fri Aug 18 08:48:39 2023.
[2023/08/18@09:59:43.444-0400] P-11391 T-1 I ABL : (888) ** The dates do not match, this in

6:02.757-0400] P-20592 T-1 I ABL : (451) Single-user login for root at /dev/pts/1.
[2023/08/18@10:36:02.766-0400] P-20592 T-1 I ABL : (5326) Begin physical redo phase at 512 .
[2023/08/18@10:36:03.032-0400] P-20592 T-1 I ABL : (16793) BEGIN RL Control Structure Dump
[2023/08/18@10:36:03.032-0400] P-20592 T-1 I ABL : (16794) Cluster Data | Bi Blocksize: 16384
[2023/08/18@10:36:03.032-0400] P-20592 T-1 I ABL : (16795) rlcurr: 512 rlclused: 0 curBlk: 0 curOfst: 0 nxtBlk: 0 nxtOfst: 0
[2023/08/18@10:36:03.032-0400] P-20592 T-1 I ABL : (16796) rlctr: 1934 rlsize: 2560 left: 1536 right: 1024 rlcap: 8376320 bytes: 8388608 lastblk: 0 lastOfst: 0
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16797) Recovery Data
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16798) TranTableSize: 172 XIDTableSize: 100
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16799) RedoBlk: 939 RedoOfst: 3862 ReadBlk: 0 ReadOfst: 0 PrevBlk: 0 blkLog: 14
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16800) Cluster Timing
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16801) Sys: 1692369362 base: 1659785124 clst: 32584238
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16802) Dependency Control
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16803) depend: 0 written: 0
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16803) depend: 3 written: 0
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16790) START RL Bi Buffers 3
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (16791) Bi Buffer: 0 state 4 dpend: 0 sdpend: 0 dbkey: -1
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0000: 0004 0000 0000 0000 000a 0000 0062 09c8
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0010: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0020: 000a 0000 0062 09c8 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0030: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0040: 000a 0000 006c 10a8 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0050: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0060: 0000 0000 0000 0000 ffff ffff ffff ffff
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0070: 0000 0003 ffff ffff 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0080: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0090: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 00a0: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 00b0: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 00c0: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 00d0: 0002 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 00e0: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 00f0: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0100: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0110: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0000: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0010: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0020: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0030: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0040: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0050: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0060: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.033-0400] P-20592 T-1 I ABL : (-----) 0070: 0000 0000 0000 0000 0000 0000 0000 0000
[2023/08/18@10:36:03.
 

TomBascom

Curmudgeon
There is nowhere near enough information here to say why you experienced what you experienced. You will need to provide substantially more of the log file. Go back at least to the FIRST error and provide everything from that point forward.

You also need to provide the details behind what commands you actually ran. Not vagueness like "tried to make empty" but the actual rfutil commands that were executed along with the actual output thereof. Along with everything that you really did to get from a crashed database to the point where you are seeing a corrupt block. All of the commands that were executed are potentially relevant.

Lastly... 11.2 is ancient, obsolete and pretty much unsupported. On the bright side upgrading to 11.7.latest should be trivial.
 

TomBascom

Curmudgeon
You have incompletely answered a small part of what I asked for. I cannot help you if you are not going to cooperate. I am not asking you these things out of malice - we need to know more in order to figure out what is going on.

What was the FIRST error that led down this path? Specifically what was the error that led to WDOG disconnecting users? That didn't just magically start happening because you had FULL AI extents. Something happened earlier.

What commands were you using to attempt to mark after-image extents as empty?

Were any commands run prior to rebooting?

What commands did you run to recover from the reboot?

What was the command line that you used for the single user sessions that were started?

In all cases - what was the actual output of the commands?

How is it that the allegedly "hung" AIMGT daemon is running while you are making single user connections? Is the DB on a shared filesystem and somehow being simultaneously accessed from two different servers? (That would certainly be a recipe for corruption.)

And while we are on the topic - why do you say that AIMGT is "hung"? What does that mean and how can you tell?

Below it looks like you got things back on track at around 10:20. What did you do?

Code:
[2023/08/18@10:20:36.884-0400] P-5761       T-1     I AIMGT  16: (3777)  Se ha cambiado a la extensión de ai /sar/chl/ai/ladb2/ladb2.a1. 
[2023/08/18@10:20:36.884-0400] P-5761       T-1     I AIMGT  16: (3778)  Este es el fichero imagen-posterior número 28877 desde el último AIMAGE BEGIN 
[2023/08/18@10:20:38.235-0400] P-17252      T-1     I AIMGT  16: (3775)  No se puede cambiar a la extensión de la Imagen-Posterior /sar/chl/ai/ladb2/ladb2.a4, está llena. 
[2023/08/18@10:20:38.235-0400] P-17252      T-1     I AIMGT  16: (3776)  Haga copia de seguridad de la extensión de ai y márquela como vacia. 
[2023/08/18@10:20:41.892-0400] P-5761       T-1     I AIMGT  16: (13199) La extensión After-image /sar/chl/ai/ladb2/ladb2.a4 se ha copiado en /backup/aiarch/chl/sar~chl~db~ladb2~ladb2.20200307.133041.00028876.ladb2.a4. 
[2023/08/18@10:20:41.893-0400] P-5761       T-1     I AIMGT  16: (3789)  Se ha marcado la extensión /sar/chl/ai/ladb2/ladb2.a4 de la imagen-posterior como vacia (EMPTY). 
[2023/08/18@10:20:43.247-0400] P-17252      T-1     I AIMGT  16: (3777)  Se ha cambiado a la extensión de ai /sar/chl/ai/ladb2/ladb2.a4. 
[2023/08/18@10:20:43.247-0400] P-17252      T-1     I AIMGT  16: (3778)  Este es el fichero imagen-posterior número 28876 desde el último AIMAGE BEGIN 
[2023/08/18@10:20:48.254-0400] P-17252      T-1     I AIMGT  16: (13231) A partir de este momento todas las extensiones after-image se archivarán en /backup/aiarch/chl. 
[2023/08/18@10:20:48.255-0400] P-17252      T-1     I AIMGT  16: (13199) La extensión After-image /sar/chl/ai/ladb2/ladb2.a3 se ha copiado en /backup/aiarch/chl/sar~chl~db~ladb2~ladb2.19691231.200000.00000000.ladb2.a3. 
[2023/08/18@10:20:48.257-0400] P-17252      T-1     I AIMGT  16: (3789)  Se ha marcado la extensión /sar/chl/ai/ladb2/ladb2.a3 de la imagen-posterior como vacia (EMPTY). 

                Fri Aug 18 10:36:02 2023
[2023/08/18@10:36:02.757-0400] P-20592      T-1     I ABL      : (451)   Inicio de sesión Single-user para root en /dev/pts/1. 
[2023/08/18@10:36:02.766-0400] P-20592      T-1     I ABL      : (5326)  Comenzar la fase de la acción rehacer física en 512 . 
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16793) BEGIN RL Control Structure Dump 
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16794) Cluster Data | Bi Blocksize: 16384 
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16795) rlcurr: 512 rlclused: 0 curBlk: 0 curOfst: 0 nxtBlk: 0 nxtOfst: 0 
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16796) rlctr: 1934 rlsize: 2560 left: 1536 right: 1024 rlcap: 8376320 bytes: 8388608 lastblk: 0 lastOfst: 0 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16797) Recovery Data 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16798) TranTableSize: 172 XIDTableSize: 100 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16799) RedoBlk: 939 RedoOfst: 3862 ReadBlk: 0 ReadOfst: 0 PrevBlk: 0 blkLog: 14 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16800) Cluster Timing 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16801) Sys: 1692369362 base: 1659785124 clst: 32584238 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16802) Dependency Control 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16803) depend: 0 written: 0 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16803) depend: 3 written: 0 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16790) START RL Bi Buffers 3 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16791) Bi Buffer: 0 state 4 dpend: 0 sdpend: 0 dbkey: -1 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (-----) 0000:  0004 0000 0000 0000 000a 0000 0062 09c8 
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (-----) 0010:  0000 0000 0000 0000 0000 0000 0000 0000 
. . .
[2023/08/18@14:36:03.000+0000] P-20592      T-1     I ABL      : (49)    SYSTEM ERROR: Violacion de la memoria. 
[2023/08/18@14:36:28.000+0000] P-20592      T-1     I ABL      : (439)   ** Salve el fichero de nombre core para ser analizado por Progress Software Corporation.

Have you reached out to Progress Tech Support as suggested at 14:36?
 

Mike

Moderator
You have incompletely answered a small part of what I asked for. I cannot help you if you are not going to cooperate. I am not asking you these things out of malice - we need to know more in order to figure out what is going on.

What was the FIRST error that led down this path? Specifically what was the error that led to WDOG disconnecting users? That didn't just magically start happening because you had FULL AI extents. Something happened earlier.

What commands were you using to attempt to mark after-image extents as empty?

Were any commands run prior to rebooting?

What commands did you run to recover from the reboot?

What was the command line that you used for the single user sessions that were started?

In all cases - what was the actual output of the commands?

How is it that the allegedly "hung" AIMGT daemon is running while you are making single user connections? Is the DB on a shared filesystem and somehow being simultaneously accessed from two different servers? (That would certainly be a recipe for corruption.)

And while we are on the topic - why do you say that AIMGT is "hung"? What does that mean and how can you tell?

Below it looks like you got things back on track at around 10:20. What did you do?

Code:
[2023/08/18@10:20:36.884-0400] P-5761       T-1     I AIMGT  16: (3777)  Se ha cambiado a la extensión de ai /sar/chl/ai/ladb2/ladb2.a1.
[2023/08/18@10:20:36.884-0400] P-5761       T-1     I AIMGT  16: (3778)  Este es el fichero imagen-posterior número 28877 desde el último AIMAGE BEGIN
[2023/08/18@10:20:38.235-0400] P-17252      T-1     I AIMGT  16: (3775)  No se puede cambiar a la extensión de la Imagen-Posterior /sar/chl/ai/ladb2/ladb2.a4, está llena.
[2023/08/18@10:20:38.235-0400] P-17252      T-1     I AIMGT  16: (3776)  Haga copia de seguridad de la extensión de ai y márquela como vacia.
[2023/08/18@10:20:41.892-0400] P-5761       T-1     I AIMGT  16: (13199) La extensión After-image /sar/chl/ai/ladb2/ladb2.a4 se ha copiado en /backup/aiarch/chl/sar~chl~db~ladb2~ladb2.20200307.133041.00028876.ladb2.a4.
[2023/08/18@10:20:41.893-0400] P-5761       T-1     I AIMGT  16: (3789)  Se ha marcado la extensión /sar/chl/ai/ladb2/ladb2.a4 de la imagen-posterior como vacia (EMPTY).
[2023/08/18@10:20:43.247-0400] P-17252      T-1     I AIMGT  16: (3777)  Se ha cambiado a la extensión de ai /sar/chl/ai/ladb2/ladb2.a4.
[2023/08/18@10:20:43.247-0400] P-17252      T-1     I AIMGT  16: (3778)  Este es el fichero imagen-posterior número 28876 desde el último AIMAGE BEGIN
[2023/08/18@10:20:48.254-0400] P-17252      T-1     I AIMGT  16: (13231) A partir de este momento todas las extensiones after-image se archivarán en /backup/aiarch/chl.
[2023/08/18@10:20:48.255-0400] P-17252      T-1     I AIMGT  16: (13199) La extensión After-image /sar/chl/ai/ladb2/ladb2.a3 se ha copiado en /backup/aiarch/chl/sar~chl~db~ladb2~ladb2.19691231.200000.00000000.ladb2.a3.
[2023/08/18@10:20:48.257-0400] P-17252      T-1     I AIMGT  16: (3789)  Se ha marcado la extensión /sar/chl/ai/ladb2/ladb2.a3 de la imagen-posterior como vacia (EMPTY).

                Fri Aug 18 10:36:02 2023
[2023/08/18@10:36:02.757-0400] P-20592      T-1     I ABL      : (451)   Inicio de sesión Single-user para root en /dev/pts/1.
[2023/08/18@10:36:02.766-0400] P-20592      T-1     I ABL      : (5326)  Comenzar la fase de la acción rehacer física en 512 .
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16793) BEGIN RL Control Structure Dump
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16794) Cluster Data | Bi Blocksize: 16384
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16795) rlcurr: 512 rlclused: 0 curBlk: 0 curOfst: 0 nxtBlk: 0 nxtOfst: 0
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16796) rlctr: 1934 rlsize: 2560 left: 1536 right: 1024 rlcap: 8376320 bytes: 8388608 lastblk: 0 lastOfst: 0
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16797) Recovery Data
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16798) TranTableSize: 172 XIDTableSize: 100
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16799) RedoBlk: 939 RedoOfst: 3862 ReadBlk: 0 ReadOfst: 0 PrevBlk: 0 blkLog: 14
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16800) Cluster Timing
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16801) Sys: 1692369362 base: 1659785124 clst: 32584238
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16802) Dependency Control
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16803) depend: 0 written: 0
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16803) depend: 3 written: 0
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16790) START RL Bi Buffers 3
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16791) Bi Buffer: 0 state 4 dpend: 0 sdpend: 0 dbkey: -1
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (-----) 0000:  0004 0000 0000 0000 000a 0000 0062 09c8
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (-----) 0010:  0000 0000 0000 0000 0000 0000 0000 0000
. . .
[2023/08/18@14:36:03.000+0000] P-20592      T-1     I ABL      : (49)    SYSTEM ERROR: Violacion de la memoria.
[2023/08/18@14:36:28.000+0000] P-20592      T-1     I ABL      : (439)   ** Salve el fichero de nombre core para ser analizado por Progress Software Corporation.

Have you reached out to Progress Tech Support as suggested at 14:36?
This is the first error we see in log WDOG . Please find attached.
Second your question: -

What commands were you using to attempt to mark after-image extents as empty?
Answer :- We tried with copy full extent to and mark it empty but nothing was working.

Were any commands run prior to rebooting?
Answer :- We tried to connect in mpro mode as well pro single user mode. Nothing worked.

What commands did you run to recover from the reboot?
Answer:- All process looks like hanged No command.

What was the command line that you used for the single user sessions that were started?
Answer: -We user pro db name and it was showing date time mismatch wrong version.

In all cases - what was the actual output of the commands?

Miss match and wrong version.

Below it looks like you got things back on track at around 10:20. What did you do?
We rebooted the server coz nothing was working. Got this from KB
Knowledge Article S000064261 000064261 64261 Database-fails-to-complete-crash-reco... Page 1 o



How is it that the allegedly "hung" AIMGT daemon is running while you are making single user connections? Is the DB on a shared filesystem and somehow being simultaneously accessed from two different servers? (That would certainly be a recipe for corruption.)
We received shared memory error 1260
Shared memory in use by another process. (1260)

PROGRESS could not access shared memory because it was allocated by another process. Investigate shared memory usage by all processes and determine whether more resource is necessary!


And while we are on the topic - why do you say that AIMGT is "hung"? What does that mean and how can you tell?
Nothing was working. There was no .lk file.
 

Mike

Moderator
You have incompletely answered a small part of what I asked for. I cannot help you if you are not going to cooperate. I am not asking you these things out of malice - we need to know more in order to figure out what is going on.

What was the FIRST error that led down this path? Specifically what was the error that led to WDOG disconnecting users? That didn't just magically start happening because you had FULL AI extents. Something happened earlier.

What commands were you using to attempt to mark after-image extents as empty?

Were any commands run prior to rebooting?

What commands did you run to recover from the reboot?

What was the command line that you used for the single user sessions that were started?

In all cases - what was the actual output of the commands?

How is it that the allegedly "hung" AIMGT daemon is running while you are making single user connections? Is the DB on a shared filesystem and somehow being simultaneously accessed from two different servers? (That would certainly be a recipe for corruption.)

And while we are on the topic - why do you say that AIMGT is "hung"? What does that mean and how can you tell?

Below it looks like you got things back on track at around 10:20. What did you do?

Code:
[2023/08/18@10:20:36.884-0400] P-5761       T-1     I AIMGT  16: (3777)  Se ha cambiado a la extensión de ai /sar/chl/ai/ladb2/ladb2.a1.
[2023/08/18@10:20:36.884-0400] P-5761       T-1     I AIMGT  16: (3778)  Este es el fichero imagen-posterior número 28877 desde el último AIMAGE BEGIN
[2023/08/18@10:20:38.235-0400] P-17252      T-1     I AIMGT  16: (3775)  No se puede cambiar a la extensión de la Imagen-Posterior /sar/chl/ai/ladb2/ladb2.a4, está llena.
[2023/08/18@10:20:38.235-0400] P-17252      T-1     I AIMGT  16: (3776)  Haga copia de seguridad de la extensión de ai y márquela como vacia.
[2023/08/18@10:20:41.892-0400] P-5761       T-1     I AIMGT  16: (13199) La extensión After-image /sar/chl/ai/ladb2/ladb2.a4 se ha copiado en /backup/aiarch/chl/sar~chl~db~ladb2~ladb2.20200307.133041.00028876.ladb2.a4.
[2023/08/18@10:20:41.893-0400] P-5761       T-1     I AIMGT  16: (3789)  Se ha marcado la extensión /sar/chl/ai/ladb2/ladb2.a4 de la imagen-posterior como vacia (EMPTY).
[2023/08/18@10:20:43.247-0400] P-17252      T-1     I AIMGT  16: (3777)  Se ha cambiado a la extensión de ai /sar/chl/ai/ladb2/ladb2.a4.
[2023/08/18@10:20:43.247-0400] P-17252      T-1     I AIMGT  16: (3778)  Este es el fichero imagen-posterior número 28876 desde el último AIMAGE BEGIN
[2023/08/18@10:20:48.254-0400] P-17252      T-1     I AIMGT  16: (13231) A partir de este momento todas las extensiones after-image se archivarán en /backup/aiarch/chl.
[2023/08/18@10:20:48.255-0400] P-17252      T-1     I AIMGT  16: (13199) La extensión After-image /sar/chl/ai/ladb2/ladb2.a3 se ha copiado en /backup/aiarch/chl/sar~chl~db~ladb2~ladb2.19691231.200000.00000000.ladb2.a3.
[2023/08/18@10:20:48.257-0400] P-17252      T-1     I AIMGT  16: (3789)  Se ha marcado la extensión /sar/chl/ai/ladb2/ladb2.a3 de la imagen-posterior como vacia (EMPTY).

                Fri Aug 18 10:36:02 2023
[2023/08/18@10:36:02.757-0400] P-20592      T-1     I ABL      : (451)   Inicio de sesión Single-user para root en /dev/pts/1.
[2023/08/18@10:36:02.766-0400] P-20592      T-1     I ABL      : (5326)  Comenzar la fase de la acción rehacer física en 512 .
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16793) BEGIN RL Control Structure Dump
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16794) Cluster Data | Bi Blocksize: 16384
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16795) rlcurr: 512 rlclused: 0 curBlk: 0 curOfst: 0 nxtBlk: 0 nxtOfst: 0
[2023/08/18@10:36:03.032-0400] P-20592      T-1     I ABL      : (16796) rlctr: 1934 rlsize: 2560 left: 1536 right: 1024 rlcap: 8376320 bytes: 8388608 lastblk: 0 lastOfst: 0
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16797) Recovery Data
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16798) TranTableSize: 172 XIDTableSize: 100
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16799) RedoBlk: 939 RedoOfst: 3862 ReadBlk: 0 ReadOfst: 0 PrevBlk: 0 blkLog: 14
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16800) Cluster Timing
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16801) Sys: 1692369362 base: 1659785124 clst: 32584238
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16802) Dependency Control
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16803) depend: 0 written: 0
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16803) depend: 3 written: 0
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16790) START RL Bi Buffers 3
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (16791) Bi Buffer: 0 state 4 dpend: 0 sdpend: 0 dbkey: -1
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (-----) 0000:  0004 0000 0000 0000 000a 0000 0062 09c8
[2023/08/18@10:36:03.033-0400] P-20592      T-1     I ABL      : (-----) 0010:  0000 0000 0000 0000 0000 0000 0000 0000
. . .
[2023/08/18@14:36:03.000+0000] P-20592      T-1     I ABL      : (49)    SYSTEM ERROR: Violacion de la memoria.
[2023/08/18@14:36:28.000+0000] P-20592      T-1     I ABL      : (439)   ** Salve el fichero de nombre core para ser analizado por Progress Software Corporation.

Have you reached out to Progress Tech Support as suggested at 14:36?
 

TomBascom

Curmudgeon
This is the first error we see in log WDOG . Please find attached.

That is an incomplete log file. Are you rotating logs? Do you have the previous log that ended just before this one starts?

If you do not have any older logs then we will not be able to determine a "root cause".

Second your question: -

What commands were you using to attempt to mark after-image extents as empty?
Answer :- We tried with copy full extent to and mark it empty but nothing was working.

That is not a command. That is a vague description of some steps that you took. It is useless for determining what went wrong.

Were any commands run prior to rebooting?
Answer :- We tried to connect in mpro mode as well pro single user mode. Nothing worked.

Again - show your work. What did you actually run? What was the actual output of the actual commands?

What commands did you run to recover from the reboot?
Answer:- All process looks like hanged No command.

Still no command being shared.

What does "look like hanged" mean? How do you know?

What was the command line that you used for the single user sessions that were started?
Answer: -We user pro db name and it was showing date time mismatch wrong version.

What did you type? "pro db name" is not a legal command. "pro mfgprod" might be if your database is named "mfgprod". But, at this point, I really doubt that I can believe anything you say.

In all cases - what was the actual output of the commands?
Miss match and wrong version.

The actual error messages are necessary. Not a paraphrased version of your memory of something that happened last week.

Below it looks like you got things back on track at around 10:20. What did you do?
We rebooted the server coz nothing was working.

What were all of these "nothings" that weren't working?

Got this from KB
Knowledge Article S000064261 000064261 64261 Database-fails-to-complete-crash-reco... Page 1 o

If that is supposed to be a link to a kbase article it isn't working. Searching on "000064261" or "Database-fails-to-complete-crash-reco" doesn't find anything either. So I continue to have no idea what you might have done.

How is it that the allegedly "hung" AIMGT daemon is running while you are making single user connections? Is the DB on a shared filesystem and somehow being simultaneously accessed from two different servers? (That would certainly be a recipe for corruption.)

You didn't actually address the questions above.

Is the following bit a confession that the database is on a shared filesystem and that people on two different servers are simultaneously running Progress commands?

We received shared memory error 1260
Shared memory in use by another process. (1260)

PROGRESS could not access shared memory because it was allocated by another process. Investigate shared memory usage by all processes and determine whether more resource is necessary!

Where is this message in the .lg file? I do not see it. You seem to have copied it from somewhere so I can only conclude that you have access to logs that you have not shared.

I will go so far as to speculate that what you did in reaction to this message may well have caused a lot of the other symptoms that you reported. That is not a message to be taken lightly.

And while we are on the topic - why do you say that AIMGT is "hung"? What does that mean and how can you tell?
Nothing was working.

You are not showing that you tried anything so, in a sense, "nothing" is what you were doing?

It seems like this should be obvious but we cannot tell you what went wrong if we do not know what you did. "Nothing" is clearly not correct.

There was no .lk file.

"There was no .lk file" is not a symptom of anything being "hung". It _is_ a symptom of the database being down. Possibly because it was shutdown normally with proserve. Nothing in the .lg that you have shared tells us if it was a normal shutdown or a crash or something else. And, so far, there is no evidence to support the assertion that anything was "hung".
 

TomBascom

Curmudgeon
<Pure speculation follows>

HPUX is really old and I somehow doubt that your server is brand new. Back in the good old days I dealt with several situations on HPUX where a filesystem would "hang". Symptoms were very clear - any command, even something simple like "ls /db/mfgprod.db" would simply hang after the <enter> and never respond. (Fun fact: attempting to kill the hung command with kill -9 would also fail. Sometimes the kill command itself would hang while trying to kill the hung process.) This _could_ potentially lead to something like the AIMGT daemon being unable to archive FULL AI extents if, for instance, the aiarchive directory would be the one to suffer from this. Many times the underlying cause was the disk adapter and you could find errors to support that analysis in "syslog". If I recall, the path is /var/adm/syslog or maybe /usr/adm/syslog.

Another possible situation similar to a hanging filesystem would be if the aiarchive directory is an NFS mount. Even if it is only briefly unavailable the AIMGT daemon will not try again. So you get a situation where nothing is being archived. It isn't "hung", it just isn't doing anything useful (in that case).

I have no concrete reason to think that any of that actually applies to your situation. But maybe if you look in the syslog you might find something that indicates how it all started.
 
Top