Forum Post: Re: The latches and the disk IO

mfurgal · Jul 18, 2015

George: The promon R&D - 3 - 4 Checkpoints screen shows sync time. When the disks are bad the sync time increases. This was not indicated in your example. All of the sync times are 0.00. What is with the 3,115 flushes on checkpoint 72 that was 111 seconds long?? Maybe that is an indicator in itself. The APWs certainly had plenty of time to do all the checkpoint queue writes. The fact that you caught a process blocked on open() for a database file looks to me more like a bad block on disk than a disk performance issue. Open() should be instantaneous. In this case it looks like the device driver is blocked when opening a file. Opening a file does not write(). The file is either opened for buffered I/O which would only write any attribute information to the OS cache (time opened, accessed, etc), or it’s opened unbuffered, but in such a way that the file attributes are not updated, hence an open() should not incur a write(). I hope this helps you debug this down a little bit more. MikeF -- Mike Furgal PROGRESS Bravepoint 678-225-6331 (office) 617-803-2870 (cell) mfurgal@progress.com From: George Potemkin bounce-GeorgeP12@community.progress.com Reply-To: " TU.OE.RDBMS@community.progress.com " TU.OE.RDBMS@community.progress.com Date: Saturday, July 18, 2015 at 12:54 PM To: " TU.OE.RDBMS@community.progress.com " TU.OE.RDBMS@community.progress.com Subject: [Technical Users - OE RDBMS] The latches and the disk IO The latches and the disk IO Thread created by George Potemkin The customer of ours recently had the strong performance degradation of their Progress application. The root case turned out to be the disks: time to copy 1 GB file took 5-6 minutes instead of 7 seconds during the normal period. There were no IO waits. CPU was 95% idle. The iostat shown very low disk activity. From the Progress' point of view there were the following problems: Time to connect db through shared memory took 20-25 seconds. We had enough time to generate the protrace file: (5) 0xe000000130043440 ---- Signal 16 (SIGUSR1) delivered ---- (6) 0xc00000000054cc10 _open_sys + 0x30 [/usr/lib/hpux64/libc.so.1] (7) 0xc000000000561070 _open + 0x170 at ../../../../../core/libs/libc/shared_em_64_perf/../core/syscalls/t_open.c:28 [/usr/lib/hpux64/libc.so.1] (8) 0x40000000008069e0 dbut_utOsOpen + 0x90 at /vobs_rkt/src/dbut/ut/utosopen.c:310 [/usr/dlc/bin/_progres] (9) 0x4000000000716ec0 bkioOpenUNIX + 0xb0 at /vobs_rkt/src/dbmgr/bkio/bkio.c:1005 [/usr/dlc/bin/_progres] (10) 0x400000000071bc70 bkioOpen + 0x210 at /vobs_rkt/src/dbmgr/bkio/bkiotop.c:784 [/usr/dlc/bin/_progres] (11) 0x4000000000727c50 bkOpenOneFile + 0x290 at /vobs_rkt/src/dbmgr/bk/bkopen.c:2052 [/usr/dlc/bin/_progres] (12) 0x4000000000727720 bkOpenAllFiles + 0x1d0 at /vobs_rkt/src/dbmgr/bk/bkopen.c:1942 [/usr/dlc/bin/_progres] Remote connections were instant. Promon shown that there were more-or-less normal activity during 5-10 seconds. Then db hung approximately for 20 seconds. During this time one of the processes (the different ones) with the active transactions hold and did not release MTX latch. The same process during the same period hold and did not release a buffer lock (EXCL). There are no any signs that a process that hold MTX latch was waiting for any other db resource. Processes had created the small transactions. It looked like db was "proquiet'ed" except the facts that during proquiet it's a db broker that holds the MTX latch and the broker also holds BIB latch. In our case BIB was not locked. During the pauses there were neither db writes nor db reads. There were only the record reads - obviously the client's sessions were able to read the records that were already located in buffer pool. The example that combine a few promon screens: Activity: Latch Counts Status: Active Transactions Status: Buffer Lock Queue

Continue reading...

Forum Post: Re: The latches and the disk IO

mfurgal

Guest