issue during probkup

jcvolkl

New Member
I have an odd problem that I hope someone else has possibly encountered, though I have not found anything like this on this or any other forum.
*******
Progress v10.2a, SP01,HF02 (Enterprise )
3 servers involved are all Window 2003, updated, and physical
Server 1 is DB server, runs the backup. Clustered with server 2, data and backups are on an attached SAN.
Server 2 runs the robot process program
Server 3 holds the hot folders that XML files are dropped to.
*******
We have multiple production plants, and are running 11 Progress DB’s. 3 of these locations receive a large number of XML files daily during the 4th quarter from a customer. I have robot processes (Progress program) that look for these XML files and create the orders in the ERP systems.
All day long this process just hums along, processing files as quickly as they come in for the most part, sometimes up to 15 files a minute into each of the 3 ERP’s.
I have a backup script that starts at 6pm and progresses through all 11 DB’s one at a time (excerpt here):
********************
call probkup online e:\hsi\data\pc\gams1.db e:\Backups\oa\pc.bak >> e:\Backups\oa\Pclog.txt
call probkup online e:\hsi\data\nave\gams1.db e:\Backups\oa\nave.bak >> e:\Backups\oa\navelog.txt
call probkup online e:\hsi\data\he\gams1.db e:\Backups\oa\he.bak >> e:\Backups\oa\Helog.txt
etc…
*********************
Here’s the problem: A couple minutes after the BU script starts, 1 of my DB’s come to a grinding halt– the DB stays up and those XML files will process, but instead of 15/minute it will sometimes take 20-30 minutes to process the same files as was being processed all day, and right up until that time.
The DB in question is the 2nd DB that is in the script to backup, yet is affected as soon as the 1st DB BU starts, and continues to be affected until the last DB BU has competed. At the end of the BU script all Progress BU files are copied off to another server (using RoboCopy) where they are then backed up to tape , but like I say, this DB in question returns to normal processing once that last DB probkup completes – the RoboCopy process does not effect it.
The ERP application itself is effected during this time as well, but does not seem to be as effected, it is mostly operable, though noticeably slow.
As I mentioned, there are 3 locations (3 separate DBs) that run this exact same process, on the same servers, of receiving and processing these same xml files, from this same customer – it is only this DB that is effected…the other 2 continue, during the whole BU process, to process those XML files at the standard 15/minute clip.
***********
A couple things I’ve tried or monitored:
*To prove my theory that it was the BU starting that was causing the problem, I moved the start from 6pm, to 630pm, and then again to 8pm. Each time this slowdown started 1-4 minutes after the BU script started.
*When this issue occurs I have checked that there is no unusual/excessive server activity happening - CPU and memory usage are low.
*ProMon – overall buffer hits are good, at least in the 95-98% range, but I can see that when this issue is happening, there seems to be little going on overall (when there should be a flurry of reads and writes to the DB).
*Tried commenting out the effected DB from the BU script, no effect (didn’t really expect this to help, but tried anyway)
*The DBs and the backup files created do sit on the same SAN, thinking that maybe there could be some conflict there, I changed the BU to create the files in a different location (not on the SAN) last night, still the same results.
*Neither the backup log files nor the DB log files reveal any issues.
***********
Appreciate any thoughts. The big problem I will have at this point it testing anything. This customer is winding down for the season and will not be active until 4th quarter next year…will have to probably come up with some creative test plan.
Thanks. Happy Holiday’s!
 
My best guess would be there is an IO bottle neck. Have you used perfmon to see the queue size on the disks that are holding the databases ?
Does processing an XML file require writes to multiple tables and to more than 1 of the 3 databases? Do you collect table stats to see the read / write activity ?
If the ERP performance isn't as bad, I would guess it does less writes.

Some db stats during the day when 15 XML files are processed in a minute and then stats when it takes 15 minutes to process 1 would be good to review. Also the windows perfmon stats from the same time period.
 
Is the bi file exceptionally large for the problem db? It might need to be truncated.

10.2a -- old... they have been working on backup performance but 10.2a isn't going to get any of that.
 
CJ -
I am not a server guy but have resources available to help with interpreting the perfmon data. And will check during the day vs evening. tonight (with the problem occuring as usual) the disk que seem to be inthe 350-450 range, occasionally a spike to 1900. the pages/sec are inthe 2500-4000 range...unsure if eithre of these are good/bad?

Process writes to 1 DB, and maybe 4-6 tables for each XML - basically an order header, line and ship to, plus some other necessary records. I am not familiar with collecting table stats, so not sure how to answer that question.
Tom – bi is not unusually large, 274 mb. I could try truncating...believe I did try this at one point hoever,

I just keep coming back to it being so odd that this DB runs just fine all day until a backup on another DB starts, and yet the identical processes on another DB, on the same hardware runs fine, even when the BU is running. Maddening!
At any rate, thank you both for your suggestions. I will look at the perfmon especially over the next few days and get some help with that info.
John
 
I think that disk queue is an issue. I would look to somehow lower the disk queue - whether by moving files or adding disk. Do the disks have write cache ? That is probably another long discussion.

Pages / sec is activity reading from disk and loading the contents into memory. That is expected during backups or other activity. During normal database processing, those numbers should drop because in theory the data needed is already loaded in the db buffers.
 
Back
Top