System Hanging

Keith Owens

New Member
HP UX 11 (32 bit)
Progress V8.3E (32 bit)

We recently upgraded from 10.20 and we are now experiencing daily occurrences when the system freezes. None of our Progress parameters have been changed.

We swap AI files every 10 minutes.

Bi cluster is 512K

UNIX sync is default of 30 seconds

Checkpoints in promon show this (can anybody tell me what the columns mean).

05/13/03 Checkpoints
16:06:57

Ckpt ------ Database Writes ------
No. Time Len Dirty CPT Q Scan APW Q Flushes

76 16:04:19 0 953 860 11 91 0

75 15:58:54 325 858 856 71 49 0
74 15:56:11 163 994 908 16 84 0
73 15:51:28 283 1452 365 94 1128 0
72 15:46:22 306 1266 1264 87 0 0
71 15:41:02 320 786 746 67 38 0
70 15:34:41 381 1117 1094 101 343 0
69 15:29:33 308 1408 1155 109 251 0

Any help would be greatly appreciated.

Thanx in advance
Keith
 
Ckpt ------ Database Writes ------
No. Time Len Dirty CPT Q Scan APW Q Flushes

76 16:04:19 0 953 860 11 91 0
75 15:58:54 325 858 856 71 49 0
74 15:56:11 163 994 908 16 84 0
73 15:51:28 283 1452 365 94 1128 0
72 15:46:22 306 1266 1264 87 0 0
71 15:41:02 320 786 746 67 38 0
70 15:34:41 381 1117 1094 101 343 0
69 15:29:33 308 1408 1155 109 251 0

Len is the # of seconds the checkpoint took - even your fastest was 163 seconds which was plenty. You could even reduce the cluster size to 256K based on this (which would make freezes due to checkpointing "less noticable" since there would be less to be written, but it would happen more often).

APW Q is the number of block in the APW queue at the end of the checkpoint

Flushes (critical) is 0 so there's no problem there.

Has ANYTHING changed? New applications code? another process? Other things happening?

Running out of Data space in the database and extending the .dn extent?

Something has changes - you are getting freezes. Can you predict the times of these freezes (even to "about 11 ish") and leave some detailed system monitoring running to a file which you can examine at your leisure (remember Schroedinger's cat - the act of observing can change the situation - be careful where this log file goes!)
 

Keith Owens

New Member
Thanks Toby, some fresh ideas for us to throw into the pot...

Another factor is that we are currently processing a higher volume of business. More users logged in - max 150ish -

Is there a way I can see more of the checkpoints? Promon only seems to offer the last 8 or so, can't I see them all for today & then I can identify the hotspots?

Keith
 
Welcome :)

Ah - so the system is responding slower to higher volumes of business - tut tut.... ;)

Promon will only show you the last 8 checkpoints, as will the _CheckPoint (or whatever it is called) VST.

More users generally = more database accesses. Check your -B and %hit rate - if below 95% you have a major shortage of -B.

Check out the overall o/s statistics for CPU/Disk utilisation/Memory/Paging etc. Have you grown out of the platform you are currently on?

Is the application Client/server or host based. If Client/Server then did you start more servers for the clients or increase the number of clients per server.

Are you running with -bibufs and -aibufs? If not - try adding sensible amounts of each (try 50 and see if it helps).

Database sizes would help.

I expect that you have already been told to upgrade the database. The benefits of v9 will be most visible if you have high transaction throughput (which it doesn't seem to be), and this is generally created by batch processes...

Good luck with trying to identify the times (if there are any!)
 

Keith Owens

New Member
Hi,

We've been running promon checkpoints for 4 days and apart from a few odd occassions we're not flushing any buffers other than during the night when all the batch jobs are fired off.

-B 50000
-bibufs 50
-aibufs 100

db is just over 5 gig and the nth extent is not in use,

Turns out our previous dba configured 6 APWs - could this be a part of the problem - however we've always had 6?

In promon all our buffers are shown as in use, is this an issue for us?

We've changed the unix disksort_seconds parameter from 0 to 8 and the number of freezes seems to have dropped, perhaps coincidence.

On with the quest.......

Again, thanx in advance for any bright ideas on this point.

Keith
DBA in distress
 
6 APW's? Wow - I ran a system that was processing 3 million records a day for 2500 users and we had 4-6 APWs. (3 million may not sound a lot but that is 34.7 records per second - 24x7).

Kill at least 3 APWs, leaving 3. BIW? AIW?

Not sure what disksort_seconds does but it sounds fun!

Directio or not?

Disk configuration (and none of this EMC virtual volume stuff - The real thing!)? Seperate spindles for BI files/AI files etc?

I think that (given what you have saud so far I would be tempted to turn on directio (if it isn't already - though with 6 APWs I suspect it is).

Have you considered dropping the BI Cluster size? So the batch processes may take an extra 10 minutes. Big Deal! If it smooths out the users this may help, but with 6 APW's I cannot believe this is checkpoint related.

# of CPUs in the box - Could this be multiple queries hogging all the CPU?

Memory Paging at all?

Current guess is Disk config but I'll reserve judgement on that!
 
Top