Archiving AI's to NFS mount storage causing system hang?

AS86 · Apr 22, 2022

Hi.

We have a Solaris system, running OE 11.7.10.

We have had a few occurrences in the past few months where the system has "hung up", database allowing no writes/reads.
Extents all ok, BI ok, AI's all set to variable length and plenty of AI extents available and empty.

At the point where the system beings unresponsive, AI management halts on the effected database.
It does not perform a switch after the normal 5minutes.
Current AI setup is archiving every 5minutes directly to a NFS mount storage, the exact same AI setup is in place on the other 6 databases that the system uses.

Checked log files etc, there's nothing in there that suggests anything strange, no mention of an AI stall etc etc which is how the behavior seems.

Only "fix" for the issue is to force close the effected database and then restart it, as soon as its restarted AI continues and switches every 5mins as usual.

Could archiving AI's directly to the NFS mount be causing this at all? Maybe at the exact second it goes to switch/archive it cannot write to the mount for some reason?

Thanks for any response.

TomBascom · Apr 22, 2022

Archiving ai logs to NFS is known to be troublesome. If NFS glitches and the archiver notices then the target filesystem is marked as unavailable. There is no check to see if it ever becomes available again. So the only solution is to restart the AI archiver. One way to do that is to restart the database. Another way is to run an online backup with appropriate arguments. (You could send the probkup output to /dev/null if you want to speed things up a bit.)

Having said that - I do not know of a case where this would also freeze the db. (Presuming that only the archived logs are on NFS.)

Rob Fitzpatrick · Apr 22, 2022

AS86 said:
Could archiving AI's directly to the NFS mount be causing this at all?

Absolutely. I ran into the same situation on AIX on OE 10.2B.

Archiving directly to a DR box, in theory, sounds like a great idea. When it works, it ensures your archived AI extents are away from production as soon as possible, giving you a good recovery position. However as you have seen, it doesn't always work. If the file system is unavailable for any reason (remote system goes down, network issue, NFS mount goes stale) the archiving stops and it doesn't recover gracefully.

Best practice is to write to a local AI archive directory and then copy or replicate the extents to your DR box from there.

Rob Fitzpatrick · Apr 22, 2022

TomBascom said:
One way to do that is to restart the database.

This might depend on OE version. In my case, the AIMD was unresponsive and I couldn't terminate it. Bouncing the DB was my only recourse.

AS86 · Apr 22, 2022

Thanks for your replies.

Yeah when this happens we cannot stop the aimgt process, it just sits there with a PID 1 even if we attempt to kill -9 it (i know we shouldn't)
It does seem like the aigmt process is holding a latch of some kind on the db, and every users process then halts waiting for it to release the latch, which it obviously doesn't.
The only solution we've found is a database restart.

It looks like a permanent solution is going to be going back to locally stored AI's then, and scripting something to move them every 5 mins to the NFS mount.

Archiving AI's to NFS mount storage causing system hang?

AS86

New Member

TomBascom

Curmudgeon

Rob Fitzpatrick

ProgressTalk.com Sponsor

Rob Fitzpatrick

ProgressTalk.com Sponsor

AS86

New Member