Database Corruption

mattk · Aug 14, 2009

Hi there,

We are operating a Progress 9.1D database on a Windows Server2003 Platform and we currently have AI replicating into a warm standby box on a seperate site every 15 minutes.
We recently had some corruption in our master database and when unfortunately when we came to fail over our target was unavailable.
If we experience corruption in our live database which could potentially lay dorment for a while would after imaging write that accross to the target db or would it identify and flag an issue?
If it doesnt identify then do any of the later Openedge versions have better integrity checks on data when replicating?

Any help would be much appreciated.

Thanks

TomBascom · Aug 14, 2009

9.1D is, of course, ancient, obsolete and unsupported. It also had some nasty ai related bugs.

A lot depends on what, exactly, you mean by "corruption".

There are some conditions which AI will propagate but they are generally more in the nature of "logical corruption".

Newer versions are of course more robust. In part in order to support products like OpenEdge Replication where any bugs become much more apparent. (OE Replication is built on the AI logging mechanism, so making it work well means improving the foundational AI logging subsystem.)

mattk · Aug 17, 2009

Hi Tom,

The errors we received are pretty much summed up by this below. Although we have nothing to confirm it we believe that this could have been caused by our SAN on which the databases sit.

SYSTEM ERROR: <function>: Bad file descriptor was used during <system call>, fd <file descriptor>, len <bytes>, offset <bytes>, file <file-name>. (9446)
· SYSTEM ERROR: bkioRead: Bad file descriptor was used during Read, fd <num> , len <num>, offset 308032, file <full pathname> <database extent> . (9446)
· SYSTEM ERROR: read wrong dbkey at offset <offset> in file <file> found <dbkey>, expected <dbkey>, retrying. area <number> (9445)
· SYSTEM ERROR: read wrong dbkey at offset 1346539520 in file <full pathname> <database extent> found 0, expected 21039680, retrying. area 7 (9445)
· Corrupt block detected when reading from database. (4229)
· <func-name>: Error occurred in area <num>, block number: <num>, extent<name>: . (10560)
· bkRead: Error occurred in area 7, block number: 328744, extent: . (10560)
· Writing block <num> to log file. Please save and send the log file to Progress Software Corp. for investigation. (10561)
· Writing block 328744 to log file. Please save and send the log file to Progress Software Corp. for investigation. (10561)
· SYSTEM ERROR: Wrong dbkey in block. Found <dbkey>, should be <dbkey2> in area <num>. (1124)
· SRV 5: SYSTEM ERROR: wrong dbkey in block. Found 0, should be 21039680 in area 7 (1124)
· Begin ABNORMAL shutdown code (2249)
· BROKER 0: Begin ABNORMAL shutdown code 2 (2249)

This all occured when one of our spawned processes reported it couldnt access the drive on which the databases sit. From there all of the above errors occured in one form or other. Eventually the databases shut down and although they did restart they would fall over very quickly once users were accessing them.

TomBascom · Aug 17, 2009

1124 errors are very serious and do indeed signal "physical" corruption.

As to the root cause... it certainly could have been something about the SAN. But it's hard to say at this point. What sort of SAN is it and how are the disks being presented to Windows?

Area 7 is the "Schema Area". Is all of your data in the schema area? That has nothing to do with corruption per se but it is a bad configuration for other reasons.

How many extents does the database have? I ask because that "bad file descriptor" error could also be related to having too many file handles open which could possibly cause corruption too.

Are you running a virus scanner on your db server? That's another potential source of badness. Non-Progress backup programs also do bad things (like locking files) to Progress databases.

mattk · Aug 17, 2009

Thanks Tom, there are currently 17 fixed (2Gig) extents and 1 variable extent on the data area. There are 5 fixed (2Gig) and 1 variable index extents.
The virus scanner is switched off on the server so that shouldnt have been a problem whilst the systems team have confirmed that no backup was running at the time.
To me it just looks like the processes (we have a lot of reports that run and updates) lost connection to the database drive for a while and when it reappeared the corruption occured.
If this is physical corruption am I right in thinking that it would not be written by AI? i.e. when a new data item is indexed on the master would the actual data item and index be passed in the AI file or simply the data item and the instruction to create an index?

TomBascom · Aug 17, 2009

The number of extents is way below any worrisome value so that's good.

1124 errors should not be propagated by AI.

The SAN situation worries me. The described behavior shouldn't happen.

I would also (of course) be upgrading to a supported and current release of Progress.

mattk · Aug 17, 2009

Thats great Tom thanks for your help. We are looking at upgrading and at improving our AI solution to safeguard against these issues as much as possible.

Thanks again,

Matt

Database Corruption

mattk

Member

TomBascom

Curmudgeon

mattk

Member

TomBascom

Curmudgeon

mattk

Member

TomBascom

Curmudgeon

mattk

Member