Zombies are eating my brain and the watchdog won't help

dp74 · Nov 8, 2011

Had to reboot our ancient, rusty 8.3A QAD (running on HP-UX 11i) database today due to a nasty issue relating to users who were either kicked through an automated proshut script (it kicks users after about 10 minutes of holding open a transaction, to protect the BI file from hitting 1G and stalling the DB) or cancelled and restarted their Telnet sessions: for some reason, the watchdog process was not cleaning these up and their locks and transactions persisted, preventing other users from doing stuff. We eventually cleared the tx by killing the users' unix sessions with kill -9 (after regular kill failed), which at first seemed to allow other users to process tx. But soon it was reported others were still blocked and we noticed about 100 extra users -- there were about another 100 "zombie" sessions that were not being cleaned up by the watchdog (bad watchdog! BAD!).

So, my understanding of this is that the watchdog is supposed to clean up the sessions where the user disconnects, whether the disconnect was done through proshut or whatever, but that didn't happen. What can prevent the watchdog from doing so? Is there an OS component to that (or maybe we should move up the TCP_KEEPALIVE timer to help push things along) Or is this just an 8.3A bug? Would it help to kill and restart the WDOG when this happens?

(AFAIK, this is the first time this issue has happened, in about 10 years of running the DB. We've added a lot more users lately, though.)

Rob Fitzpatrick · Nov 8, 2011

First, I can't tell you anything about 8.3x specifically as I've never touched it.

That said... I would say that your automated proshuts could potentially be more of a "nasty issue" than anything they were intended to prevent. Long-running transactions can be bad, and can blow up your BI as you said. However, unconditionally killing users in long transactions is not something I would advocate. What if they're running a recently-deployed batch update job and the developer mistakenly messed up on his transaction scope? Stuff happens, even to good developers. Now you kill the user's 12-minute update job at the 10-minute mark, they curse the "unreliable" application, and guess what? They run it again.

A user in a long-running transaction can definitely be a precursor of trouble ahead, and could certainly indicate an application issue, but there are valid cases where it may be by design. So you certainly want to detect these situations early and look at them carefully and without delay, but I would say the remediation decisions should be made by a human who understands what is going on and why, rather than by a shell script with (probably) minimal business logic.

Also, killing a client session with -9 is a Bad Thing. Just ask this guy

. If the (self-service) user is holding a latch, killing him could crash the database. I assume they are self-service, as you mentioned the watchdog.

Regarding your watchdog, he may be misunderstood rather than bad. Put away the rolled-up newspaper. He cleans up after missing processes, i.e. the ones the DB knew about that no longer exist as OS processes. Zombie processes are still processes, so in the WDOG's eyes they aren't missing and there is nothing to do about them.

You noted that "regular kill failed" when killing sessions. Did the kill really fail, or did it just not complete yet when you decided to kill it with -9? If it was in a transaction when you killed it the first time, it might have been in the midst of rolling it back, which could explain its continued life. Or it could have been paged out by the OS and you were waiting on I/O for it to be paged back in. Or its normal processing could have been stalled by CPU starvation (though presumably you would have seen this). What did you see in the DB log after the first kill?

Regarding telnet clients, check your OS TCP keepalive settings. If users are killing their telnet sessions on a regular basis or having them die, I think that adjusting your various keepalive settings to reduce the time it takes for the server to clean up those clients could help. I'm not an expert on that so I would encourage you to seek other opinions.

Last opinion: you mentioned adding a lot more users recently. You may be in need of startup parameter tuning to address this additional workload. I would guess you've already done that, but if you can state the changes you've made then others here could weigh in on your changes. I hope this helps.

dp74 · Nov 8, 2011

Thanks Rob, here's what I can quickly respond on that before I fall asleep and mash the keyboard with my forehead:

-- users are only kicked for sitting in QAD maintenance screens, we are very careful about transaction scoping, and the kick log is monitored, long-running transactions would be fixed in the code

-- yes, very true, we try to avoid -9 if at all possible, we are painfully aware it can crash the database, but if the only alternative is to restart the db...

-- that makes sense about the WDOG only barking at dead processes, so then I guess the question would be: why did some disconnected sessions become zombified and go shambling and drooling after my brains? Generally they quietly RIP in their graves. The only other time I ever saw something like this was when users were using Openlink ODBC, if the user's PC doesn't have -RO option set, it starts a tx even on reads, and the unix process and tx never dies even after it's kicked. Maybe this is an HP-UX problem.

-- yep, the first thing I asked the UNIX guys was whether they gave the tx time to recover, they've made that exact mistake you refer to before, in this case the tx was very small and they swear they gave it a few minutes, it should have been done by then

-- thanks, I broached keepalive with the UNIX team, I think it is something we will research

-- we might have increased B a little bit, we didn't change too much else on this db (we did add -spin to the side DB)

I think the bolded question is probably the important one, and I'm afraid the answer is something along the lines of "this happens occasionally in 8.3A, and if you're going to use a DB version from the mists of antiquity you're just going to have to accept this sort of thing." Oh well.

Rob Fitzpatrick · Nov 9, 2011

dp74 said:
-- yes, very true, we try to avoid -9 if at all possible, we are painfully aware it can crash the database, but if the only alternative is to restart the db...

...then you have a penchant for Russian roulette? I would suggest that a scheduled outage is easier to schedule than a crash

, easier to explain to management and disgruntled users, and less likely to corrupt your DB and generally disrupt your business.

-- that makes sense about the WDOG only barking at dead processes, so then I guess the question would be: why did some disconnected sessions become zombified and go shambling and drooling after my brains?

I don't know. Do you have VSTs enabled? What does promon show you about them? And when you say zombies, does the OS actually show them as zombies (e.g. in top)? Or are they running processes whose PPID is now 1? Are they chewing up CPU? How do you define zombies?

-- yep, the first thing I asked the UNIX guys was whether they gave the tx time to recover, they've made that exact mistake you refer to before, in this case the tx was very small and they swear they gave it a few minutes, it should have been done by then

Given the choice, I prefer not to give "the UNIX guys" the 009 license to kill, when it comes to DB processes. They can be a little trigger-happy, and aren't on the hook when the DB is in the weeds.

-- we might have increased B a little bit, we didn't change too much else on this db (we did add -spin to the side DB)

If you didn't adjust client-related parameters, how were you able to add "a lot more users lately"? Were they over-provisioned in the first place? Also, are they all self-service or do you also have remote clients?

I think the bolded question is probably the important one, and I'm afraid the answer is something along the lines of "this happens occasionally in 8.3A, and if you're going to use a DB version from the mists of antiquity you're just going to have to accept this sort of thing." Oh well.

I don't know. I'll let an 8.x veteran chime in on that.

TomBascom · Nov 9, 2011

You have HPUX 11i (probably on a newish server) but you're running an ancient, obsolete and beyond unsupported release of Progress? Where does the insanity stop?

Version 8 did indeed have some infamous problems with killing and disconnecting users.

In fact, contrary to common sense, I generally recommended that you use "kill -1" on a v8 database rather than disconnecting via proshut because proshut disconnects were buggier in that era.

And set the tcp keepalive interval to something sensible like 5 minutes rather than the default of 2 hours.

Make very, very sure that you are NOT trapping signals. To do so is poison in that environment. It is always misguided. Traps are the most likely root cause of your zombies and the apparent lack of effectiveness of your "normal" kills.

BTW, "small" changes to -B are pointless. Its effectiveness follows an inverse square law. The point of -B is to reduce IO. To cut db IO in half you need to multiply -B by 4.

In v8 -spin is a silver bullet. Always use it if you have it. 10,000 is almost always "good enough".

dp74 · Nov 9, 2011

Rob -- thanks for the response. I agree on -9, but we can't wait to schedule an outage when large numbers of users are already hung and unable to work. It was "kill or restart db now".

Promon shows the tx, but says the user is disconnected. VSTs also show the transaction. "Zombie" is what I'm calling users who have been disconnected from the db, but continue to hold tx/locks.

Yep, we were over-provisioned. The company merged with another company several years back, and the QAD application lost a large number of users as production was moved to other plants. Then the migration of all plants to a new ERP system was put on hold and ultimately abandoned. Now we're moving those other plants onto QAD. No remote clients.

Tom -- yep, it's the "forklift and pray" model, not advised. Thanks for the recommendation and info.

Heh, I said the same thing to our DBA on the -B, but he has his own way of doing things. I still haven't convinced him our side database doesn't really need 360,000 lock table entries for 200 users, many of whom don't even use that db (I think the max entries actually peaks around 25).

The funniest part of our situation is that we know there are 9.1D CDs here... somewhere. They were checked into the software library but disappeared; there's even a suspicion they were deliberately hidden a few years ago to prevent us from upgrading for internal political reasons having to do with the long-term ERP vision...

rzr · Nov 9, 2011

The funniest part of our situation is that we know there are 9.1D CDs here... somewhere. They were checked into the software library but disappeared; there's even a suspicion they were deliberately hidden a few years ago to prevent us from upgrading for internal political reasons having to do with the long-term ERP vision...

you need a 007 license !! go get them

Zombies are eating my brain and the watchdog won't help

dp74

New Member

Rob Fitzpatrick

ProgressTalk.com Sponsor

dp74

New Member

Rob Fitzpatrick

ProgressTalk.com Sponsor

TomBascom

Curmudgeon

dp74

New Member

rzr

Member