SAN changed - performance dropped

RealHeavyDude

Well-Known Member
OpenEdge10.1c on Sun Solaris 64Bits, database blocksize 8K

After capacity management switched to a different SAN for one of our production servers the performance dropped dramatically - now it's half, for example the backup and almost all batch jobs which load data into the database take twice the time to complete.

This is the response I got from the so-called specialists:
EMC analyze:

- The analyze results in the follow.
- EMC has recognised that the V-Max has a problem handling I/O size whit Meta device larger then 128KB.
- large read IO block size is causing prefetch not to work properly
- There for they recommend upgrade to 5874.242.191
-Also they recognized the Devices have +50ms response time that looks like they are doing IO with block size of 1MB

Summery:

- The I/O problem is related to all servers attached to V-Max with I/O rates higher then 128KB
- Problem is known by EMC and is fixed by M-Code 5874.242.191 ore higher.


Recommendation:

- 1. We should change the I/O block size on the application to recommended 256 KB (should be the easiest fix)
- 2. Upgrade V-Max to 5874.242.191 what should solve the Problem (will be released end of Q1)
Now I wonder where would I be able to change such a I/O setting to the recommended 128 KB in my OpenEdge environment?

Thanks in advance, RealHeavyDude.
 
There is no such setting in OpenEdge. The 1MB IO ops are probably coming from the OS rather than Progress. Are the filesystems striped at the OS level?

Your reported performance problems appear to be related to updates.

The EMC response is focused on read performance. They're right -- 50ms sucks and should be fixed.

I'm shocked, shocked that they have even looked at the problem -- so even though it completely misses the point it's a step in the right direction. (Yes, I'm just a wee bit cynical about SAN "engineering" teams.)

If I had to guess, which I do, I would guess that the "capacity management" change was to switch you to RAID5 storage. Someone probably looked at your average IO rates and decided that they didn't warrant the throughput that you had (actually the cynic in me says that they probably didn't think about throughput at all -- they probably decided that you were "wasting space"). They apparently forgot to take your peak update activity into account.

Recommendation #1: roll back to the old configuration.
Recommendation #2: require realistic performance testing before implementing future SAN changes.
 
Hello Tom,

thanks for your swift reply - really appreciate it.

Regarding RAID 5: It is RAID 5, always was because that's the standard layout for all SANs at my company. I need to check back with the operating team whether they SANs are stripped at the OS level.

But, I am very lucky: Mostly we run Oracle databases in our company. We use the OpenEdge database and we were to first to complain about performance drops, therefore nobody really cared much and blamed you guess who. But now the situation is completely different as the Oracle DBAs are complaining too. It's somewhat frustrating arguing with capacity management on behalf of a niche technology, but now the problem affects the mainstream ...

Thanks again for your thoughts and Best Regards,
RealHeavyDude.
 
Finally ...

Having capacity management switching back to the old SAN restored performance. At least that proves that the issue is the either the new SAN itself or its configuration.

Thanks and best regards, RealHeavyDude.
 
It only took them 6 weeks? That's not bad. Most times these sorts of outfits will never roll back or it will take 6 months to a year of user hell before they will do anything.
 
Hello Tom,

believe it or not - the times they are changing ...

For our company that means that more and more IT departments have to stick to the Golden Rule - those who have the Gold make the rules. In other words, all we do in the scattered IT departments of this company is paid by and must support the business. Somehow mangers of capacity management became aware (don't ask me how) of that and for sure they don't want to be held responsible for violating a service level agreement affecting B4B relationships. Having the power of around 700 business people using your applications behind you, in this case, was a good thing.

At least I feel to recognize kinda of a change in attitude. :cool:

Regard, RealHeavyDude.
 
Back
Top