Question Database Performance Issue With Report

BigSlick

Member
Hi all,

We have a report that runs on a Live system every 15 minutes and usually takes between 1-2 minutes to run. From Thursday morning it started to take 40+ minutes to complete. Nothing I'm aware of has changed. Apparently this has happened before.

The user that has run this report has killed the process multiple times and restarted, but the last time he restarted was around 4pm (UK time) yesterday (Friday) and since then the report has started to take 1-2 minutes again.

Ultimately, i'd like to know why this happened and avoid it happening in the future.

My first thought is that the data wasnt in memory, but after it had run the first time it should be back in memory, unless this was being replaced every 15 minutes. Could there be other reasons why?

Code:
11:58:18        08/28/15 11:58 to 08/28/15 11:58 (10 sec)

Event                  Total  Per Sec |Event                  Total  Per Sec

Commits               16954    1695.4 |DB Reads               4634     463.4
Undos                   346      34.6 |DB Writes               267      26.7
Record Reads        1552035  155203.5 |BI Reads                  8       0.8
Record Updates        13683    1368.3 |BI Writes              2647     264.7
Record Creates        15434    1543.4 |AI Writes              2560     256.0
Record Deletes           21       2.1 |Checkpoints               0       0.0
Record Locks          76428    7642.8 |Flushed at chkpt          0       0.0
Record Waits              0       0.0 |Active trans             21

Rec Lock Waits     0 %    BI Buf Waits      0 %    AI Buf Waits      2 %
Writes by APW    100 %    Writes by BIW   100 %    Writes by AIW     0 %
DB Size:        1353 GB   BI Size:       3999 MB   AI Size:         58 MB
Empty blocks:20087957      Free blocks:   1137      RM chain:     94356
Buffer Hits       99 %    Primary Hits     99 %    Alternate Hits    0 %

301 Servers, 495 Users (36 Local, 459 Remote, 37 Batch), 1 Apws

Server 2008 64 Bit
384GB memory
4K blocksize

Code:
Shared Resources:
                     Busy After Image Extent: F:\ai\db.a1
                       Number of database buffers (-B): 15000000
            Number of database alternate buffers (-B2): 0
              Number of before image buffers (-bibufs): 100
               Number of after image buffers (-aibufs): 100
                      Excess shared memory size (-Mxs): 386
                   Before-image truncate interval (-G): 0
                              No crash protection (-i): Not enabled
             Maximum private buffers per user (-Bpmax): 64
                    Current size of locking table (-L): 35008
                          Locking table entries in use: 534
                         Locking table high water mark: 6054
            Maximum number of clients per server (-Ma): 3
             Max number of JTA transactions (-maxxids): 0
                     Delay of before-image flush (-Mf): 3
                       Maximum number of servers (-Mn): 331
                          Maximum number of users (-n): 932
                         Before-image file I/O (-r -R): Raw

Code:
12:17:31        08/28/15 12:17 to 08/28/15 12:17 (10 sec)

                                    Total         Per Min          Per Sec          Per Tx

Database Buffer Pool
Logical reads                     4181994        25091964        418199.40         3113.92
Logical writes                      16962          101772          1696.20           12.63
O/S reads                           11239           67434          1123.90            8.37
O/S writes                            331            1986            33.10            0.25
Checkpoints                             0               0             0.00            0.00
Marked to checkpoint                    0               0             0.00            0.00
Flushed at checkpoint                   0               0             0.00            0.00
Writes deferred                     12707           76242          1270.70            9.46
LRU skips                              71             426             7.10            0.05
LRU writes                              0               0             0.00            0.00
APW enqueues                           71             426             7.10            0.05
Database buffer pool hit ratio:  99 %

Primary Buffer Pool
Logical reads                     4181994        25091964        418199.40         3113.92
Logical writes                      16962          101772          1696.20           12.63
O/S reads                           11239           67434          1123.90            8.37
O/S writes                            331            1986            33.10            0.25
Marked to checkpoint                    0               0             0.00            0.00
Flushed at checkpoint                   0               0             0.00            0.00
Writes deferred                     12707           76242          1270.70            9.46
LRU skips                              71             426             7.10            0.05
LRU writes                              0               0             0.00            0.00
APW enqueues                           71             426             7.10            0.05
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
Certainly there's a report issue, with that big a jump in run time. Have you established that it's caused by a database issue? I don't know your system obviously but nothing in those numbers jumps out at me as awful.

The user that has run this report has killed the process multiple times and restarted, but the last time he restarted was around 4pm (UK time) yesterday (Friday) and since then the report has started to take 1-2 minutes again.
This really makes me think it's not a database issue.

Is the user in question a remote user or batch?
Has anything changed with the code base or the propath?
If he is remote, was there a network issue that would have affected throughput?
If he is remote, was there something running on his PC/terminal server that would have affected run time? (e.g. virus scanner or disk defrag kicks in, etc.)
If he is batch, is there some other workload on the DB server that could affect application performance?
Do you have periodic OS or database data captures from the machine in question? Is it paging/swapping excessively?

I'd set up a scheduled task to run a promon gather script every few minutes during the job (when it runs beyond a couple of minutes) to see whether the DB metrics change noticeably during the bad runs.

How about the application - is it in-house or third-party? Do you have source?

Do you have a test system where you can reproduce this issue and profile it?
 

BigSlick

Member
Hi Rob,

Thanks for the response.

In haven't established if it is the database, but the report hasn't changed and neither have the surroundings. The user is connecting remotely through an SQL connection. Or so is my understanding.

No code or propath has changed, it's in house and we have source.

As for workload, the issue started around 8am and improved after 4pm, but wasnt great, until it was restarted after 4pm on the Thursday. Between 4pm and 8am the report was quicker and there was a noticeable slow down at 8am. This ties in with database activity. I.e a lot of people finish at 4 and the majority at 5. Again, users are back in for 8 and the majority by 9.

I'd set up a scheduled task to run a promon gather script every few minutes during the job (when it runs beyond a couple of minutes) to see whether the DB metrics change noticeably during the bad runs.

I'll set that going the next time we receive the issue.

Do you have a test system where you can reproduce this issue and profile it?

We have a test system but it's not used in anywhere near the capacity of live, so it's may not be feasible.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
but the last time he restarted was around 4pm
until it was restarted after 4pm
When you said "he" restarted I assumed you meant the client's application or system. In your second quote does "it" also mean the client or was the database or database server restarted?

Do any other users run the same code/application or submit the same SQL queries?
Do any other users report performance issues in the same time frame as the affected user?
Do any batch jobs show increases in program run times?
 

BigSlick

Member
Sorry I wasn't clear at all.

The process is a batch process, but the guy that's looks after the reports killed the process and restarted it.

In your second quote does "it" also mean the client or was the database or database server restarted?

I was referring to the report. The database has been up for a good few months now.

Do any other users run the same code/application or submit the same SQL queries?

No, just the one process.

Do any other users report performance issues in the same time frame as the affected user?

I'm not aware of any. Mainly the reason why i can't get my head around this issue.

Do any batch jobs show increases in program run times?

Good question and I have no answer for that. Quite a lot of reports get run and i have next to no detail about these.

I was only asked to look at this as it was so slow. Users don't usually congratulate me when their reports are running fast. :(
 

TheMadDBA

Active Member
If it is a SQL query... how often are the statistics updated for this database? What does the explain plan look like when it is running well and how does that compare to when it is running slow?

Not knowing the exact nature of your application the stats you posted could vary from decent to terrifying. Quite a few record reads and more disk reads than I would like. If you are reading from disk that means the buffer pool is being flushed. The alternate buffer pool isn't being used, that can help a ton.

Is there an AIW running?

How about posting a larger sample size like 10-15 minutes so we can see more information about the buffer pools and BI activity.

Download ProTop and use it to get a better idea of which tables/indexes are being used and monitor what is happening with that one specific report.

ProTop Version 3
 

BigSlick

Member
Thanks TheMadDBA.

In regards to the SQL query; I couldn't tell you. That's something I'll have to find out.

The alternate buffer pool isn't active, but i have been given sign off to start using it! For this query in particular I was thinking of trying to use private buffers as it's just the one report or are they better off in the alternate buffer pool?

there is a BIW, APW and AIW running.

I was looking at trying to get a larger sample, I used option S.

The syntax of which says enter number of seconds but it ignored that parameter. Maybe I was doing it wrong...
 

TheMadDBA

Active Member
There is an option under the adjust monitor menu to change the number of seconds.

In my opinion it is far easier to press Z to zero the stats and then press U to update the stats. It gives you way more control over the amount of time you want to sample.

You can use private buffers to keep the query from killing the buffer pool for everyone else, but it will not make the query itself run faster.

The alternate buffer pool should hold your mostly static tables and indexes, focusing on the ones that are heavily used.
 

BigSlick

Member
Haha! Found it. Thank you!

Code:
08/28/15        Activity: Summary
16:49:11        08/28/15 16:34 to 08/28/15 16:49 (14 min 50 sec)

Event                  Total  Per Sec |Event                  Total  Per Sec

Commits              293370     329.6 |DB Reads             459812     516.6
Undos                  1466       1.6 |DB Writes            393008     441.6
Record Reads         132772K 152762.4 |BI Reads               1579       1.8
Record Updates       536943     603.3 |BI Writes            223980     251.7
Record Creates       339687     381.7 |AI Writes            223972     251.7
Record Deletes        91760     103.1 |Checkpoints               3       0.0
Record Locks        4671324    5248.7 |Flushed at chkpt          0       0.0
Record Waits              2       0.0 |Active trans             12

Rec Lock Waits     0 %    BI Buf Waits      0 %    AI Buf Waits      3 %
Writes by APW     96 %    Writes by BIW    99 %    Writes by AIW     0 %
DB Size:        1353 GB   BI Size:       3999 MB   AI Size:        237 MB
Empty blocks:20020253      Free blocks:   1137      RM chain:     94356
Buffer Hits       99 %    Primary Hits     99 %    Alternate Hits    0 %

301 Servers, 355 Users (37 Local, 318 Remote, 38 Batch), 1 Apws

Code:
08/28/15        Activity: Buffer Cache
16:50:14        08/28/15 16:35 to 08/28/15 16:50 (14 min 15 sec)

                                    Total         Per Min          Per Sec          Per Tx

Database Buffer Pool
Logical reads                      395260K       28403280        473387.99         1406.37
Logical writes                    4051623          284324          4738.74           14.08
O/S reads                          420531           29511           491.85            1.46
O/S writes                         382178           26820           446.99            1.33
Checkpoints                             3               0             0.00            0.00
Marked to checkpoint               334616           23482           391.36            1.16
Flushed at checkpoint                   0               0             0.00            0.00
Writes deferred                   3714516          260668          4344.46           12.91
LRU skips                              11               1             0.01            0.00
LRU writes                              0               0             0.00            0.00
APW enqueues                           11               1             0.01            0.00
Database buffer pool hit ratio:  99 %

Primary Buffer Pool
Logical reads                      395260K       28403280        473387.99         1406.37
Logical writes                    4051623          284324          4738.74           14.08
O/S reads                          420531           29511           491.85            1.46
O/S writes                         382178           26820           446.99            1.33
Marked to checkpoint               334616           23482           391.36            1.16
Flushed at checkpoint                   0               0             0.00            0.00
Writes deferred                   3714516          260668          4344.46           12.91
LRU skips                              11               1             0.01            0.00
LRU writes                              0               0             0.00            0.00
APW enqueues                           11               1             0.01            0.00
Primary buffer pool hit ratio:  99 %
 

TheMadDBA

Active Member
I am going to go out on a limb and say you have some query/index issues. That is an average of 430 reads per second per connection over almost 15 minutes. Every connection isn't going to be active all of the time and most should read much less than that for most common business functions. There are probably a few processes that are reading an obscene amount per second.

ProTop will help you find the active tables and indexes (by db totals and per user) to start looking into the code to find the issues.

Short term increasing the -B value will help out some, but if the code is bad enough you won't have enough memory to cover everything.

Double check to make sure the AIW is running because it isn't doing any work apparently. Check the values for -aibufs to make sure it is actually set since you are getting AI buffer waits.
 
Top