Locking errors / "physical redo" slowness

bjag · Sep 18, 2008

Hi, folks

I recently converted the database from Type 1 storage to Type 11, using binary dump & load. The database had already been upgraded to 10.1B, sp 1, last fall and was running fine until the D&L. Now, we receive daily locking errors on one heavily-used table. It appears to occur sporadically and we cannot replicate the problem on any test database, as it appears to do with database loading. Nor are any errors generated in the .lg file or anywhere on the server; only on the user's Windows session. The user gets a pop-up, saying that the "xxx table is locked. Record create failed. (328) (6527)" OR when updating: "FIND FIRST/LAST failed for table xxx (565)".

Using the -lkrela startup parameter (to use the old locking algorithm) did not fix the problem.

Also, we are now seeing extreme slowness during the "sanity check" during the database startup, hanging at the "Physical redo" stage. This can take up to an hour sometimes. Truncating the bi helps (we do it weekly anyway), but it still builds up very quickly in between truncates. This step had only taken 1 second before the D&L, regardless of whether we truncated the bi. I suspect that this issue is related to the locking issue, since both issues surfaced after the D&L to Type 2 storage areas, but can't seem to find the cause - or a solution.

Any light that any of you gurus could shed on these issues would be greatly appreciated!

TomBascom · Sep 18, 2008

There were a number of issues related to locking algorithms addressed in the 10.1B services packs. 10.1B03 got most of them but the "PURGE LOCK" issue remained. 10.1C fixes all of the issues that I (currently) know of. I would upgrade to 10.1C and go from there.

bjag · Sep 18, 2008

Tom,

Thanks for your response. Yes, we're aware that there are some bugs with 10.1B, sp1, and had hoped that using the old locking algorithm
(-lkrela) would do the trick.

Unfortunately, the vendor has not tested their application on 10.1C, and as such, cannot recommend moving to it yet. And, their other customers have not seen this issue. Moreover, the problem with the slow "Physical redo" has not been identified as a bug in our version (at least, not that I am aware of). Hence, we're still thinking that there could be an issue with our database, and not just buggy software.

TomBascom · Sep 18, 2008

Slow physical redo is a well known problem in all releases and on all platforms.

Only some people experience it. There doesn't seem to be any rhyme or reason to who does and who doesn't -- which makes it difficult for Progress development to get a handle on a cure.

I, personally, don't see it very often. My own opinion is that it is probably at least partly related to the quality of the IO subsystem -- brand names are nice but bad configuration will easily turn a million dollar SAN into a boat anchor.

As for vendor endorsement of specific releases... my opinion is well known :awink:

bjag · Sep 22, 2008

I wish it were that simple. Our disks may be of the "boat anchor" variety, but they're the same "boat anchors" that we used before the D&L. And, the slowness follows the database. As soon as I restore that data to a new database on another server, it starts having the same slow "Physical redo" phase as the original database.

The database is still in the structure that I had to implement after the D&L, with the datafiles on one disk and the .st and .db file on the original disk with the old database. Do you think that could be an issue?

TomBascom · Sep 23, 2008

I doubt it. But if the problem follows the db as you say then it is easy to test. Just restore into a more "proper" configuration.

Have you truncated the bi file since the d&l? That sometimes helps. (Make sure that the bi cluster size is reasonable too...)

In any event I'd also be making serious plans to clean up your whole storage area design now that you're in a somewhat better place to do so.

bjag · Sep 23, 2008

Tom,

Thanks so much for your response! One detail that I didn't mention: truncating the bi helps immensely - for about 1 day. We do it after our weekly cold backups, and the database comes back up very quickly afterwards. But, if I have to restart the dbs thereafter, it takes much longer - unless I truncate the bi first. Sometimes, just truncating the bi takes 20 minutes. Could it be that something is preventing the data from being written from the bi to the storage areas in a timely fashion? And, that the bi is filling up, as a result? Whether this same issue could also cause our locking problems is also a question.

At this point, I'm considering creating a separate Type1 S.A. for the table that has the locking problems, to rule out any glitches with Type2. Do you think this could help?

Thanks again,
Brenda

TomBascom · Sep 23, 2008

Is the problematic table in a mixed storage area? Or by itself? What about its indexes?

I doubt that moving it to a type 1 area will change anything. If it does then I would expect that moving it to an equally isolated type 2 area would have the same impact.

Are there active transactions when you shutdown? That might have more to do with the long redoes than anything else.

Also -- I would probably treat this as two distinct problems.

timk519 · Sep 25, 2008

The long redo is partly tied to PSC's not doing an fdatasync on shutdown, so on startup it redoes the last 2 bi clusters, unless a TX spans clusters, in which case it goes backwards through the clusters until it comes up clean.

10.1C does an fdatasync on shutdown, and only does the last 2 op bi clusters on startup.

One possible solution in the OP's case is to make their bi cluster size smaller, so the redo phase'll stop sooner. However, this could have an operational performance impact in that the system may checkpoint more often when under load.

bjag · Sep 29, 2008

Thanks for your answers, Tom & Tim!

I believe the problem with the "slow physical redo" has been resolved. I discovered that the Progress 10.101B database (using Type1 storage) had used a bi cluster size of 12,288. But, when I re-created the new database in preparation for the load to Type2 storage areas, I had taken the default bi cluster size of 512. When I resized the bi cluster to 12,288, I saw much faster "physical redo's" (from 1.5-8 minutes to 1-5 seconds!). One would think that smaller bi clusters would make for faster "physical redo's", but just the opposite was true in my case.

Also, the "table locking" issue appears to have been resolved by using the -nosavepoint option in the client config files. Not sure why this works, and why the problem only manifested itself right after converting to Type2 storage areas. FYI - we had upgraded from 9.1C to 10.1B01 almost a year ago, w/o problem, but were still using Type1 storage areas. When I did a D&L to go to Type2 storage areas, we starting seeing the locking problem immediately afterwards. Seems like quite a coincidence, but Progress support felt that this issue was unrelated to database change.

Why we suddenly started seeing the problem is still a mystery, but we haven't had a "table locked" error message for a couple of days now - a marathon for us, as we had been having a flurry of "locked table" complaints several times a day ever since going to Type2. Life is good. :biggrin:

Locking errors / "physical redo" slowness

bjag

New Member

TomBascom

Curmudgeon

bjag

New Member

TomBascom

Curmudgeon

bjag

New Member

TomBascom

Curmudgeon

bjag

New Member

TomBascom

Curmudgeon

timk519

New Member

bjag

New Member