Question Bi Checkpoints

BigSlick

Member
Hi All,

OE 10.2BSp07 32-Bit on Wndows Server 2008 64-bit.

We seem to be having some issues with Checkpointing. The issue is much much worse during online backups but during the day we are experiencing the issues too. I've been looking into this and the issues seem to tie in with checkpoints.

Although the singular APW seems to be doing a decent job, i'm in the process of adding another just to help out. Our disks are maintained by a 3rd party and i'm in the process of retrieving information regarding the make-up of the disks to understand if we can separate the bi and db files. Other than that i'm a little stuck. I feel the cluster size could be too large.

I have a 24hour recording of Checkpoints and APW usage and the only time we flush is at the beginning of the backup and APW writes % is rarely below 80%.

We have 2 extents for the bi; 1 at 4gb and a variable.

Any information would be greatly appreciated.

Here is an example set of checkpoints during normal activity;

Code:
06/10/16        Checkpoints
14:17:39

Ckpt                                  ------ Database Writes ------
 No. Time        Len   Freq   Dirty   CPT Q    Scan   APW Q Flushes   Duration  Sync Time

1463 14:11:01    398      0  261213  229170     130    2466       0      11.48      10.49

1462 14:04:14    401    407  125691  156792     286     519       0       0.79       0.13
1461 13:36:50   1610   1644  127401   72532   29224   69148       0       0.52       0.06
1460 13:03:31   1902   1999    4421   25673   43030    7071       0       0.62       0.16
1459 12:24:19   2256   2352   39281    5152   53118  158885       0       0.65       0.06
1458 11:45:18   2195   2341  159114   98902   35486  163071       0       1.58       0.11
1457 11:24:46   1203   1232  143526   50671   19526  179344       0       0.86       0.07
1456 11:09:50    868    896  227508   73238   10868  323464       0       1.48       0.17

And here are some things you may be interested in;

Code:
06/10/16        Status: BI Log
09:09:04

Before-image cluster age time:     0 seconds
Before-image block size:           4096 bytes
Before-image cluster size:         262128 kb (268419072 bytes)
Number of before-image extents:    2
Before-image log size (kb):        4095996
Bytes free in current cluster:     120801211 (46 %)
Last checkpoint was at:            06/10/16 08:58
Number of BI buffers:              100
Full buffers:                      2



06/10/16        Activity: BI Log
10:31:51        05/22/16 14:21 to 06/10/16 09:07 (450 hrs 46 min)

                                    Total         Per Min          Per Sec          Per Tx

Total BI writes                    104067K           3940            65.67            0.16
BIW BI writes                    70026156            2589            43.15            0.10
Records written                   2852607K         108006          1800.10            4.33
Bytes written                      363470M       14092041        234867.35          564.81
Total BI Reads                   17636890             652            10.87            0.03
Records read                      6178374             228             3.81            0.01
Bytes read                         723679K          27400           456.67            1.10
Clusters closed                      1428               0             0.00            0.00
Busy buffer waits                19415395             718            11.96            0.03
Empty buffer waits                 101970               4             0.06            0.00
Log force waits                         0               0             0.00            0.00
Log force writes                        0               0             0.00            0.00
Partial writes                   12966547             479             7.99            0.02
Input buffer hits                  390025              14             0.24            0.00
Output buffer hits                 925810              34             0.57            0.00
Mod buffer hits                    193711               7             0.12            0.00
BO buffer hits                    1933250              71             1.19            0.00

Shared Resources:
                     Busy After Image Extent: F:\db\ai\live.a4
                       Number of database buffers (-B): 16000000
            Number of database alternate buffers (-B2): 0
              Number of before image buffers (-bibufs): 100
               Number of after image buffers (-aibufs): 100
                      Excess shared memory size (-Mxs): 386
                   Before-image truncate interval (-G): 0
                              No crash protection (-i): Not enabled
             Maximum private buffers per user (-Bpmax): 64
                    Current size of locking table (-L): 35008
                          Locking table entries in use: 818
                         Locking table high water mark: 7321
            Maximum number of clients per server (-Ma): 3
             Max number of JTA transactions (-maxxids): 0
                     Delay of before-image flush (-Mf): 3
                       Maximum number of servers (-Mn): 331
                          Maximum number of users (-n): 932
                         Before-image file I/O (-r -R): Raw
                          Shared memory version number: 6412385
                             Number of semaphores used: 1267
                                         Broker status: Executing
                                      BI Writer status: Executing
                                      AI Writer status: Executing
                                       Watchdog status: Executing
                                Number of page writers: 1
                          Number of self-service users: 5
                                Number of remote users: 612
                                     Number of servers: 331
                                  Number of shut-downs: 0
                                    Number of monitors: 2
                                 LRU Disabled (-nolru): No
                           LRU force skips (-lruskips): 5
                         LRU2 force skips (-lru2skips): 0
                  Server message wait time (-Nmsgwait): 2
         Delay first prefetch message (-prefetchDelay): Disabled
    Prefetch message fill percentage (-prefetchFactor): 0
    Minimum records in prefetch msg (-prefetchNumRecs): 16

06/10/16        Activity: Page Writers
14:21:48        05/22/16 14:21 to 06/10/16 09:07 (450 hrs 46 min)

                                    Total         Per Min          Per Sec          Per Tx

Total DB writes                    297388K          11260           187.66            0.45
APW DB writes                      290817K          11011           183.52            0.44
    scan writes                  17315346             640            10.67            0.03
    apw queue writes             32614177            1206            20.10            0.05
    ckp queue writes               242058K           9165           152.75            0.37
    scan cycles                       307               0             0.00            0.00
    buffers scanned               4809449K         182096          3034.93            7.30
    bufs checkpointed              266230K          10080           168.00            0.40
Checkpoints                          1428               0             0.00            0.00
Marked at checkpoint               266810K          10102           168.37            0.40
Flushed at checkpoint             1046104              39             0.64            0.00

Number of APWs:                   1
 

TomBascom

Curmudgeon
You "seem to be having some issues with Checkpointing".

Perhaps you could describe the issues that you seem to be having? I see nothing in this data that suggests a problem. I certainly don't see any reason to think that your bi "cluster size could be too large". Nor any reason to add an APW.

The only things that I see to complain about are:

1) You are running Windows without the Linux patch.
2) 32 bits. Really? It is currently 2016. Why are you running a db server with 32 bit executables?
3) 10.2b07 -- 10.2b08 is the last service pack for oe10. It has been available for a very long time. 2.5 years now... You really ought to upgrade to oe11 but if you're going to stick with 10 at least get to service pack 8.
4) A 3rd party is responsible for the disks. Nothing good will come of that.
 

BigSlick

Member
upload_2016-6-10_15-56-59.png upload_2016-6-10_15-59-20.png Thanks for the reply Tom.

The slow responses we receive seem to tie in with the checkpoints. Particularly at night and during the morning.

The APW might help during the online backup was my thought when write percentages seem to be rather low.

The idea behind setting the bi cluster lower was to shorten the checkpointing times.

32-bit i know! We are in the process of upgrading to 11.6.2. 64-bit.

Third party - i know :( but not much i can do there.

Here is a list of times that it takes to process a web request. (this is usually 2-10 ms). Columns are date, time, time to process in ms.
Code:
[ATTACH=full]1468[/ATTACH]

And this is the checkpoints at that time
Code:
[ATTACH=full]1469[/ATTACH]

Maybe they arent related but something is causing the slowdown :(

Cheers
 

cj_brandt

Active Member
In performance tests, I found that BiBlocksize of 16kb offered the best performance. If you increase your blocksize, you might want to lower your clustersize.
Curious - how long does the db log show that it takes to backup the BI data ?

What was going on at 14:11:01 ? The disk sync time looks pretty bad during that time period. Try to post a few 60 second samples during times of high activity.
You want samples from promon for a couple minutes - showing 24 hours worth of data isn't much help because half the time your system might have been idle.

If you find your disk is on a large SAN, you probably won't see a performance boost in moving AI or BI extents around.

On systems I manage that struggle with write performance on disk, I have been able to get a decent improvement by having the database write to fixed length extents instead of variable length extents - for AI, BI and Storage Areas.
 

BigSlick

Member
Hi CJ,

Here is the database backup log:

Code:
[2016/06/07@20:31:37.242+0100] P-15900  T-33712 I BACKUP336: (-----) Login by User1.
[2016/06/07@20:31:37.267+0100] P-15900  T-33712 I BACKUP336: (12850) Backup blocks will be written to F:\db\live.bak.
[2016/06/07@20:31:37.268+0100] P-15900  T-33712 I BACKUP336: (1362)  Full backup started.
[2016/06/07@20:31:37.268+0100] P-15900  T-33712 I BACKUP336: (6686)  423704331 active blocks out of 459128452 blocks in G:\db\live will be dumped.
[2016/06/07@20:31:37.268+0100] P-15900  T-33712 I BACKUP336: (6688)  917448 BI blocks will be dumped.
[2016/06/07@20:31:37.268+0100] P-15900  T-33712 I BACKUP336: (9285)  Backup requires an estimated 1.6 TBytes of media.
[2016/06/07@20:31:37.269+0100] P-15900  T-33712 I BACKUP336: (9286)  Restore would require an estimated 425487783 db blocks using 1.6 TBytes of media.
[2016/06/07@20:31:37.282+0100] P-15900  T-33712 I BACKUP336: (3777)  Switched to ai extent F:\ai\live.a1.
[2016/06/07@20:31:37.282+0100] P-15900  T-33712 I BACKUP336: (3778)  This is after-image file number 20249 since the last AIMAGE BEGIN
[2016/06/07@20:31:42.867+0100] P-15900  T-33712 I BACKUP336: (5459)  Begin backup of Before Image file(s).
[2016/06/07@20:32:32.187+0100] P-15900  T-33712 I BACKUP336: (5460)  End backup of Before Image file(s).
[2016/06/07@20:32:32.187+0100] P-15900  T-33712 I BACKUP336: (5461)  Begin backup of Data file(s).
[2016/06/08@01:12:48.532+0100] P-15900  T-33712 I BACKUP336: (5462)  End backup of Data file(s).
[2016/06/08@01:12:48.553+0100] P-15900  T-33712 I BACKUP336: (13625) Wrote a total of 12508041 backup blocks using 1.6 TBytes of media.
[2016/06/08@01:12:48.744+0100] P-15900  T-33712 I BACKUP336: (1364)  Full backup successfully completed.

Sorry, by 24 hour i meant every 30/120 seconds for 24 hours 30 for APW, 120 for checkpoints.

If you find your disk is on a large SAN, you probably won't see a performance boost in moving AI or BI extents around.
- I feel that might be the case.

At 14:11 - will have been usually activity. It's probably at it's peak there with number of concurrent users. What data from Promon - just the summary?[/CODE]
 
Last edited:

TheMadDBA

Active Member
1) You don't want to shorten the total length of the checkpoint.. longer times are almost always better.

2) In the second part of your post there are buffers being flushed at checkpoints.(1046104).. those are bad but the sample size is too large. Try and track down when those are happening per checkpoint. APWs and larger cluster sizes can help with that (and/or code fixes) IF they are happening during peak times and not just tiny bits spread across tons of checkpoints.

3) If performance is bad during the backup you probably have disk IO issues. SAN guys (like the network guys) will talk about bandwidth... you should focus on latency. As in how long does it take to service an IO.

Edit: Based on your backup time( almost 5 hours ) I am going to jump out on a limb and say you have an IO issue. Those numbers aren't very impressive for a system that is supporting a 1.6TB database.
 

BigSlick

Member
Thanks TheMadDBA.

i've attached a full list of checkpoints for 24 hours at 120 second intervals and APW's at 30 second intervals for 24 hours. Note the backup is at 20:30 - 01:12.


I have a feeling our F: drive isnt the best storage available. Once i hear back from the 3rd party i can look at upgrading this.
 

Attachments

  • APW.xlsx
    419.7 KB · Views: 1
  • Checkpoint.xlsx
    25.4 KB · Views: 2

TomBascom

Curmudgeon
Your list of times that it takes to process a web request seems to be missing.

FWIW I wouldn't jump to the conclusion that web request times have much of anything to do with checkpoints. 999 out of 100 performance problems have more to do with read performance. And those problems are generally driven by crappy code making poor index selections. Or just generally bad algorithm choices.... or running on 32 bit db servers with inadequate tuning.

It's /possible/ that you have an issue with write performance and checkpoints. But it would be unusual.

And so far the evidence isn't there to support that idea.

With a 1.6TB db and a 32 bit db server I think it is a heck of a lot more likely that you have issues with read performance. Just a wild guess but you are probably very reliant on the 3rd and 4th layers of cache -- the OS and the SAN -- and when those get put under strain by your checkpoints your read performance goes in the tank.
 

TheMadDBA

Active Member
So you have one checkpoint with flushes (when the backup starts)... not a checkpoint issue. IO in general and the cache/code like Tom says are your likely issues.

You can try adding -Bp 64 to your probkup command to try and avoid clearing the OE buffer cache. OS and SAN cache is another matter issue. The fix for that is 64 bit OE and a large -B :)

You do have ProTop installed right?
 

cj_brandt

Active Member
A 1.6tb db on a 32 bit database. You are probably chewing up your disk resources on reads... Would be interested to know what promon shows or the VST show for disk reads and OS Disk Reads for each minute for 30 minutes.

Might change the lruskips from 5 to something larger - 50 or 100 I think are more common.

If you have any highly read tables that are small, you might want to pin them in memory with the B2 parameter.

I like the display promon shows for checkpoints which has the duration and sync time - the OEM report doesn't have that.
 

BigSlick

Member
The web log were there but they've formatted a little weird. I've attached.

Ok, some good news then i guess. i'll see what we can do code wise and see if that improves anything.

proTop inst installed yet. I've requested it though. Hopefully it will be available soon!
 

Attachments

  • WebLog.xlsx
    11.3 KB · Views: 1

TheMadDBA

Active Member
In preparation for ProTop being installed... make sure your -tablerangesize and -indexrangesize are set properly. Take a look and _UserTableStat and _UserIndexStat records for those web requests and see what floats to the top (compared to what they should be doing).

Put in a request for an OE upgrade to a recent 64 bit version while you are at it :)
 
Top