Database crash

philipped

New Member
Hi guys!

Let me introduce myself. My name is Philippe and I work for a bath manufacturer near Quebec, Canada. Last summer we lost the guy who was in charge of the Progress database administration. We are using SyteLine 5 ERP.

So I took the flag he left and did my best to understand how is the RDBMS working. I started by configuring after-imaging archiving. Actually, i'm working on a replication system using AI.

Regarding the database administration itself, I never had to face a critical incident... before yesterday :mad:

I'm running Progress 9.1D0980, wich seem unsupported and obsolete on a Windows 2003 server. We are using AI since late november and it is working quite well.

Each night I'm doing a full online backup. I think that's were the fun began sunday. This morning, I received a call from someone on the production floor telling me that no Progress screen were working. After loging into the network, I found that a database was down. The database started without any problem. Wouhou!

But that does'nt meen I do not want to undestand what happened and to try to prevent the problem to come again. Here is the log file of the poor database.

Code:
00:44:07 BACKUP 27: Full backup started. (1362)
00:44:07 BACKUP 27: 
2427475 active blocks out of 2427550 blocks in d:\db\ultrax will be dumped. (6686)
00:44:07 BACKUP 27: 384 bi blocks will be dumped. (6688)
00:44:07 BACKUP 27: Backup requires an estimated 1.3   GBytes of media. (9285)
00:44:07 BACKUP 27: Restore would require an estimated 335887 db blocks using 9.3   GBytes of media. (9286)
00:44:07 BACKUP 27: Switched to ai extent F:\AI\ultrax.a3. (3777)
00:44:07 BACKUP 27: This is after-image file number 3023 since the last AIMAGE BEGIN (3778)
00:44:09 BACKUP 27: Begin backup of Before Image file(s). (5459)
00:44:09 BACKUP 27: End backup of Before Image file(s). (5460)
00:44:09 BACKUP 27: Begin backup of Data file(s). (5461)
01:00:08 BACKUP 27: End backup of Data file(s). (5462)
01:00:08 BACKUP 27: Wrote a total of 71565 backup blocks using 9.3   GBytes of media. (9284)
01:00:10 BACKUP 27: Full backup successfully completed. (1364)
01:00:11 RFUTIL 28: Login by administrator on CON:. (452)
01:00:12 RFUTIL 28: Switched to ai extent F:\AI\ultrax.a4. (3777)
01:00:12 RFUTIL 28: This is after-image file number 3024 since the last AIMAGE BEGIN (3778)
01:00:12 RFUTIL 28: Logout by administrator on CON:. (453)
01:00:13 RFUTIL 27: Login by administrator on CON:. (452)
01:00:13 RFUTIL 27: Logout by administrator on CON:. (453)
01:00:17 RFUTIL 27: Login by administrator on CON:. (452)
01:00:17 RFUTIL 27: Logout by administrator on CON:. (453)
01:00:17 RFUTIL 27: Login by administrator on CON:. (452)
01:00:17 RFUTIL 27: Unable to truncate extent header, file = F:\AI\ultrax.a2, ret = -30006. (9377)
01:00:17 RFUTIL 27: SYSTEM ERROR: Releasing regular latch. latchId: 11 (5028)
01:00:17 RFUTIL 27: User 27 died holding 1 shared memory locks. (5026)
01:00:18 APW    24: Stopped. (2520)
01:00:18 APW    23: Stopped. (2520)
01:00:18 APW    26: Stopped. (2520)
01:00:18 APW    25: Stopped. (2520)
01:00:19 BIW    21: Stopped. (2520)
01:00:20 SRV     2: Stopped. (2520)
01:00:20 SRV    17: Stopped. (2520)
01:00:20 SRV     1: Stopped. (2520)
01:00:20 SRV    13: Stopped. (2520)
01:00:20 SRV     5: Stopped. (2520)
01:00:20 SRV    16: Stopped. (2520)
01:00:21 SRV     9: Stopped. (2520)
01:00:21 SRV    19: Stopped. (2520)
01:00:21 SRV     3: Stopped. (2520)
01:00:21 SRV    20: Stopped. (2520)
01:00:21 BROKER  0: Begin ABNORMAL shutdown code 2 (2249)
01:00:21 SRV    15: Stopped. (2520)
01:00:21 SRV     6: Stopped. (2520)
01:00:20 SRV     7: Stopped. (2520)
01:00:21 SRV    11: Stopped. (2520)
01:00:21 SRV    12: Stopped. (2520)
01:00:21 SRV     8: Stopped. (2520)
01:00:21 SRV    18: Stopped. (2520)
01:00:22 SRV     4: Stopped. (2520)
01:00:22 SRV    14: Stopped. (2520)
01:00:23 SRV    10: Stopped. (2520)
01:00:24 WDOG   22: Stopped. (2520)
01:00:36 BROKER  0: Disconnecting dead user 27. (2527)
01:00:38 BROKER   : Multi-user session end. (334)

ProKB wasn't of any help... Can someone help?

Thanks!
 

TomBascom

Curmudgeon
It went wrong here:

Code:
01:00:17 RFUTIL 27: Unable to truncate extent header, file = F:\AI\ultrax.a2, ret = -30006. (9377)

But, as you say, there's nothing helpful about that in the kbase.

Since it is Windows my first guess is that an OS level thing, like a virus scan or a backup or the file indexer or some other bit of windows malware, locked the file at a bad moment.

My second observation would be that 9.1d (which is ancient, obsolete and unsupported) is known to have some unspecified but oddball bugs WRT to ai.

In any event "a bad thing" happened while attempting to truncate that .a2 extent. rfutil crashed without cleaning up its latches (that's rude). That may have occurred because the error was unrecoverable and the OS didn't allow rfutil to handle it -- but that seems unlikely since rfutil was able to write a message to the .lg file. That makes it more likely a bug -- rfutil should have done better than to simply give up and die. IMHO. Ultimately the db engine noticed that a latch was being held by a process that no longer exists and shut itself down. It has to shutdown in that circumstance (unreleased latches) because it has no way to reconstruct what the crashed process was doing under the aegis of the latch -- all of the information that could have potentially been used to recover was local to the crashed process (in this case the "rfutil"). The shutdown is "abnormal" but controlled. On restart you go through crash recovery and any active transactions are rolled back.
 

philipped

New Member
Tom,

Many thanks for your reply.

I think I will handle this as an isolated incident since it appears that it will be very hard to find exactly what happened.

The incident showed us that the event notification system we have is not perfect and need improvement : the database crashed sunday and we found monday morning that the database was down. We'll start by imrpoving the monitoring of our database.

Thanks again Tom.
 
Top