T
TheMadDBA
Guest
Well in our case (and I suspect yours)... the disks were not really the problem. In our example once one of the adapters failed all of the work was supposed to go to one of the other adapters (we had 4)... but it failed just enough to still be "valid" to AIX... and every time an IO went to that adapter large parts of AIX itself hung until that IO timed out and an error was generated. Not just the disk IO but memory and cpu scheduling as well. IBM at first blamed EMC and then admitted there was a bug in the failover... the large scale pauses were supposed to be a one time thing while the adapter was being marked dead and all of the processes moved over to the other adapters. Not much you can do from a Progress or Oracle perspective to know that the OS just went crazy. Having a process trying to sleep for 2 seconds and ending up taking 4 seconds isn't that weird to me in that situation.
Continue reading...
Continue reading...