Rob Fitzpatrick
ProgressTalk.com Sponsor
I have an idea for a possible DB enhancement that's been rattling around in my head for a bit. I thought I'd write it here, as it was inspired by an idea from Tom Bascom, to solicit feedback from Tom and others. Also, writing it down forces me to organize my thoughts, so if it does have merit I'll have a better chance of writing a coherent enhancement request on the Community site.
I'll start with Tom's idea: a "tertiary buffer pool". I'll admit I didn't understand the concept (or rather, the utility of it) when I first heard it. But it eventually sunk in. I believe the idea can be stated as follows: when you read a huge table, especially historic data that isn't otherwise useful, say for a monthly or annual report, you're dragging all that RM and IX data through the primary buffer pool and evicting the data that is useful to the other DB clients, temporarily harming their future performance until the cache "warms up" again with their data. But if you had a way of specifying that reads from Huge Table X always (or maybe on demand, programmatically) go to a tertiary buffer pool, then the useful contents of the primary buffer pool wouldn't be flushed when you read big swaths of Huge Table X. However, on the down side, you would still have to manage this pool manually, you would have to know the behaviour of your application code/BI queries in advance, and you would have to pre-allocate some amount of memory for this buffer pool on startup even if you weren't going to use it often. So there are some challenges with this approach (if I understand it correctly). And one could argue that private read-only buffers could also address this use case.
That got me thinking about caching in general, and the contrast between caching techniques used in OSes and DBs. There are some similarities between the two, e.g. demand-based paging, the use of shared memory, the Windows kernel's modified page writer thread which is sort of analogous to an APW, etc. But there are obvious differences too. Modern OS memory managers use a lot of optimizations like pre-fetching. When an OS or an application starts, there are lots of files (drivers, libraries, executables, etc.) that are always needed from disk, and often in the same order, so their layout on disk can be optimized and they can be pre-fetched into memory so that by the time they are first accessed by the code it's a cache hit. DB read I/O is in some ways a lot less predictable, e.g. due to following different application code paths caused by code changes, data changes, user behaviour, or other factors the DB can't reliably predict or model.
All the server knows with certainty is what the client wants right now; pre-fetching (i.e. speculative reading) is hard to do well. If the client is in the fifth iteration of a FOR EACH customer block, he might want to read the sixth customer record next (and very soon), but he also might hit a RETURN, QUIT, PUBLISH, RUN, LEAVE, etc., and not want to read that sixth customer until an hour from now, or maybe never. And if the server did speculatively page in the next IX block in the bracket chosen for the query or the RM block containing the next rowid after the current one and the client never requested it then (a) that would have been an expensive wasted physical I/O to get it into the buffer pool, (b) it would result in a big increase in memory pressure, paging out valuable cache contents (as the other servers would be behaving the same way) and (c) it would take -B number of other unique buffer accesses until this probably-useless block and others like it were flushed out of the buffer pool.
Now imagine that a server processing a particular query for a given client can selectively tune its read behaviour, not unlike the self-tuning an APW does. If a client reads two records quickly in a FOR EACH or QUERY, it may or may not request the third and subsequent records equally quickly; the probability is relatively low, especially if it's an interactive client. By contrast, if it is a batch client that has barrelled through 999 records in a query, there is a much higher probability that it will equally quickly request the 1000th, 1001st, etc. So if the server can see this behaviour and start to pre-fetch potentially useful blocks, say reading the next IX block and the next RM block or two that are referenced by the rowids in that IX block, then maybe by the time the client requests those records they will already be in the buffer pool and it won't have to wait synchronously on those physical I/Os. If the server kept guessing correctly that a block would be requested, and it was (within some time frame), it could increment a "confidence score" about the usefulness of pre-fetching for that query, and maybe become more aggressive with it over time. And if it guessed wrong (as it must, as all queries eventually end), it could decrement or zero out the score and revert to demand-based paging for that client. I suppose the database would need a way of distinguishing speculatively-read blocks from demand-paged blocks in the buffer pool, and a way of determining the interval between a speculative block read and a data read from that block to score the block read as having been valuable. Obviously these comments about servers also apply to self-service clients acting as their own server.
Part of the problem, as I understand it, is that there are only two ways to access the LRU chain: you can start at the head (or MRU end) and traverse backwards, or start at the tail (or LRU end) and traverse forward. So a block that is added to the chain can only be added at the head, as happens today, or at the tail, which would be pointless as it would fall off almost immediately. But what if there were a third pointer into the chain, say 80% of the way (or whatever) toward the LRU end, where a speculatively-read (rather than demand-paged) block could be inserted into the chain? This would somewhat alleviate problem (c) above as the cost of putting a useless block into the primary buffer would be waiting for only 0.2 * -B unique block accesses for that block to be paged out. And the 80% most-recently accessed blocks (probably, depending on your use of -lruskips) would be unaffected.
I don't know how I/O is done in SQL Server or Oracle, and I don't imagine this is a novel idea (whether good or bad). But I have to think that there is, at least in theory, a more efficient way of caching data than to wait until the instant that the client requests it as OpenEdge does today.
Thoughts?
I'll start with Tom's idea: a "tertiary buffer pool". I'll admit I didn't understand the concept (or rather, the utility of it) when I first heard it. But it eventually sunk in. I believe the idea can be stated as follows: when you read a huge table, especially historic data that isn't otherwise useful, say for a monthly or annual report, you're dragging all that RM and IX data through the primary buffer pool and evicting the data that is useful to the other DB clients, temporarily harming their future performance until the cache "warms up" again with their data. But if you had a way of specifying that reads from Huge Table X always (or maybe on demand, programmatically) go to a tertiary buffer pool, then the useful contents of the primary buffer pool wouldn't be flushed when you read big swaths of Huge Table X. However, on the down side, you would still have to manage this pool manually, you would have to know the behaviour of your application code/BI queries in advance, and you would have to pre-allocate some amount of memory for this buffer pool on startup even if you weren't going to use it often. So there are some challenges with this approach (if I understand it correctly). And one could argue that private read-only buffers could also address this use case.
That got me thinking about caching in general, and the contrast between caching techniques used in OSes and DBs. There are some similarities between the two, e.g. demand-based paging, the use of shared memory, the Windows kernel's modified page writer thread which is sort of analogous to an APW, etc. But there are obvious differences too. Modern OS memory managers use a lot of optimizations like pre-fetching. When an OS or an application starts, there are lots of files (drivers, libraries, executables, etc.) that are always needed from disk, and often in the same order, so their layout on disk can be optimized and they can be pre-fetched into memory so that by the time they are first accessed by the code it's a cache hit. DB read I/O is in some ways a lot less predictable, e.g. due to following different application code paths caused by code changes, data changes, user behaviour, or other factors the DB can't reliably predict or model.
All the server knows with certainty is what the client wants right now; pre-fetching (i.e. speculative reading) is hard to do well. If the client is in the fifth iteration of a FOR EACH customer block, he might want to read the sixth customer record next (and very soon), but he also might hit a RETURN, QUIT, PUBLISH, RUN, LEAVE, etc., and not want to read that sixth customer until an hour from now, or maybe never. And if the server did speculatively page in the next IX block in the bracket chosen for the query or the RM block containing the next rowid after the current one and the client never requested it then (a) that would have been an expensive wasted physical I/O to get it into the buffer pool, (b) it would result in a big increase in memory pressure, paging out valuable cache contents (as the other servers would be behaving the same way) and (c) it would take -B number of other unique buffer accesses until this probably-useless block and others like it were flushed out of the buffer pool.
Now imagine that a server processing a particular query for a given client can selectively tune its read behaviour, not unlike the self-tuning an APW does. If a client reads two records quickly in a FOR EACH or QUERY, it may or may not request the third and subsequent records equally quickly; the probability is relatively low, especially if it's an interactive client. By contrast, if it is a batch client that has barrelled through 999 records in a query, there is a much higher probability that it will equally quickly request the 1000th, 1001st, etc. So if the server can see this behaviour and start to pre-fetch potentially useful blocks, say reading the next IX block and the next RM block or two that are referenced by the rowids in that IX block, then maybe by the time the client requests those records they will already be in the buffer pool and it won't have to wait synchronously on those physical I/Os. If the server kept guessing correctly that a block would be requested, and it was (within some time frame), it could increment a "confidence score" about the usefulness of pre-fetching for that query, and maybe become more aggressive with it over time. And if it guessed wrong (as it must, as all queries eventually end), it could decrement or zero out the score and revert to demand-based paging for that client. I suppose the database would need a way of distinguishing speculatively-read blocks from demand-paged blocks in the buffer pool, and a way of determining the interval between a speculative block read and a data read from that block to score the block read as having been valuable. Obviously these comments about servers also apply to self-service clients acting as their own server.
Part of the problem, as I understand it, is that there are only two ways to access the LRU chain: you can start at the head (or MRU end) and traverse backwards, or start at the tail (or LRU end) and traverse forward. So a block that is added to the chain can only be added at the head, as happens today, or at the tail, which would be pointless as it would fall off almost immediately. But what if there were a third pointer into the chain, say 80% of the way (or whatever) toward the LRU end, where a speculatively-read (rather than demand-paged) block could be inserted into the chain? This would somewhat alleviate problem (c) above as the cost of putting a useless block into the primary buffer would be waiting for only 0.2 * -B unique block accesses for that block to be paged out. And the 80% most-recently accessed blocks (probably, depending on your use of -lruskips) would be unaffected.
I don't know how I/O is done in SQL Server or Oracle, and I don't imagine this is a novel idea (whether good or bad). But I have to think that there is, at least in theory, a more efficient way of caching data than to wait until the instant that the client requests it as OpenEdge does today.
Thoughts?