Buffer hash Table and how it works

progress user

New Member
Hey guys,
Can any one please explain how buffer hash tables are used in openedge? Also how the BHT latch mechanism happens with in the bufferpool.
Any links shared about these parameters is also highly appreciated.
Thanks in Advance.
 

Rob Fitzpatrick

ProgressTalk.com Sponsor
To start, some OpenEdge RDBMS storage hierarchy:
  • an OpenEdge database comprises a collection of logical containers called storage areas;
  • each storage area comprises a collection of one or more physical containers (files in the file system) called extents
    • exceptions:
      • the control area, aka dbname.db, which has only one extent
      • after-image areas each have one extent
  • each "storage object" is assigned to one and only one storage area
    • a storage object can be:
      • a table (non-partitioned)
      • an index
      • a BLOB or CLOB column
      • a table partition (in a multi-tenant or partitioned table)
      • each partition-specific b-tree of a partition-aligned index of a partitioned table, where it has been defined as a local index
  • in a Type II storage area (i.e. a data-storage area where the area type (_area._area-type) is 6 and the cluster size (_area._cluster-size) is 8 or 64 or 512), a storage object is a chain of one or more clusters of database blocks, only containing data for that one storage object
  • a cluster is a collection of logically-contiguous database blocks
  • all data-storage blocks (i.e. blocks in areas excluding the primary recovery area (aka before-image "file") and after-image areas) are the same size: the database block size (_area._area-blocksize)
All physical I/O in the database is done at the block level. If a client wants to read a particular index key, or sequence value, or record, the entire block containing that logical entity must be read into the buffer pool, if it is not already there. (A record may be fragmented, i.e. it may be in two or more pieces in separate blocks; when it is accessed, all blocks containing its fragments must be read into the buffer pool so the entire record may be assembled, written to a record buffer, and passed to the client.) Similarly, when a block in the buffer pool has been modified and must be written to disk, the entire block is written back to disk.

All logical I/O is done via the buffer pool. Clients and servers do not read or modify data directly on disk. Read requests go to the storage engine which fetches the necessary block(s) from disk and loads them into the buffer pool, at which point their contents may be accessed and returned to the requesting client. Similarly, writes are done by modifying the blocks in the buffer pool; those modified buffers are later written to disk asynchronously by an asynchronous page writer (APW) or server or self-service client.

As alluded to above, when a client wants to access a block, it is read into the buffer pool if necessary. Therefore the storage engine must have an efficient mechanism to determine whether a particular block is or is not currently in the buffer pool. (In this context, "the buffer pool" means either the primary or alternate buffer pool.) A block is uniquely identified by two pieces of information: the number of the storage area in which it resides, and its dbkey, which is a unique physical block address within that storage area.

One could imagine a naive mechanism by which a block could be found, by scanning the header of each block in the buffer pool, until either the desired block was found (a "cache hit") or the end of the chain of blocks in the buffer pool was reached (a "cache miss"). But clearly this would not be efficient, as the cost of such a scan would grow linearly with the size of the buffer pool (the sum of -B and -B2). Adding to that computational cost would be the concurrency penalty that no other process could be allowed to modify that chain of blocks while the requesting process was scanning the chain, to ensure a consistent state during the read operation.

Instead a data structure called a hash table is used. In this scheme, a collection of entities (in this case, database blocks) is categorized using metadata that is unique to each entity (in this case, the combination of storage area number and dbkey) and it is assigned a place (an "index") in the table. This unique value is passed through a hash function to determine the index value for that block. The entry in the hash table at that index value consists of a linked list of block headers. Each block header contains some metadata about that block in the buffer pool and its state (e.g. which queue it is on, if any, whether it has been modified, etc.).

When a server (or self-service client) wants to access a block, it already knows the block's area number (from its _storageobject record) and its dbkey. It can then compute the hash value for that block, allowing it to go directly to that offset in the buffer hash table. That hash table entry consists of a relatively short (more on that below) linked list of zero or more block headers that can then be scanned sequentially to find the desired block. If it is found, it is a cache hit and the data can be accessed from the buffer. If it is not, the block is not in the buffer pool and it must be read in by the storage engine (making room for it by evicting a buffer from the end of the LRU or LRU2 chain as appropriate, if necessary). Thus each buffer lookup in the buffer hash table (BHT) consists of a quick indexed lookup followed by a short, relatively fixed-length sequential scan, regardless of the size of the buffer pool.

The reason why the sequential scan of a given hash chain is relatively short is that the size of the BHT is a function of the size of the buffer pool. The size of the BHT is determined by the value of the -hash primary broker startup parameter. (While this means you can set it manually, you shouldn't; let the broker do it.) I don't know the exact formula for determining the default value of -hash, but it is something like the next prime number larger than (-B + -B2) / 4. In other words, the BHT has roughly a quarter as many entries as the buffer pool, regardless of the size of the buffer pool. Assuming a good hash function that gives a uniform distribution of hash values based on arbitrary inputs, this means that even when the buffer pool is full, each hash chain would have about four entries on average, making that sequential scan quick.

While this algorithm is vastly more efficient than the naive mechanism described above, it still requires consistent reads. But concurrency is much better because only the single hash chain being scanned must be locked by the process that is reading it, not the entire table. It does this by obtaining a lock on the BHT latch that is associated with that chain. In recent OpenEdge releases up to and including 11.7.2, there is a fixed number of 1024 BHT latches for the buffer hash table, so each latch protects about (-hash / 1024) hash chains.

But there can still be increased contention as the buffer pool grows, because the number of BHT latches does not, so the ratio of hash chains to latches increases. So if two users are trying to find two different buffers at the same time, but those buffers hash to different but adjacent hash values protected by the same latch, they will both try to lock it and (at least) one will have to wait.

An enhancement was added to OE 11.7.3 and later to address this potential problem by making the number of BHT latches configurable. It is now a percentage of -hash and that percentage is determined by the value of a new primary broker startup parameter -hashLatchFactor. It defaults to 10%. So imagine a database started with -B 6,000,000. In 11.7.2 or earlier, -hash will be roughly 1,500,000, or over 1400 hash chains per BHT latch. By contrast, the same database started with 11.7.3 or later would have about 150,000 BHT latches, a dramatic increase. This could help with concurrency in random data access, assuming a busy system that previously suffered from BHT latch contention.

Further reading:
OpenEdge 12.0 Database Performance and Server-Side Joins
http://pugchallenge.org/downloads2018/Banville_Performance.pdf

KB:
Is there a parameter to reduce the BHT latch contention?
https://knowledgebase.progress.com/articles/Article/000040267/p
11.7.3 startup parameter -hashLatchFactor
https://knowledgebase.progress.com/articles/Article/1173-startup-parameter-hashLatchFactor/p
 

progress user

New Member
Thanks a lot, Rob. That was really some good stuff about Hash tables.
 

TomBascom

Curmudgeon
Excellent post Rob.

With regards to when a DBA is likely to have to worry about the buffer hash table:

0) Latch contention requires very high concurrent activity levels. It is not a problem for small systems with modest workloads.

1) The BHT latch does not typically become a problem if LRU latching is occurring. Thus if you do NOT have -lruskips set to a non-zero value you probably will not see BHT contention.

1a) BHT latch contention is when you have "latch waits" on the BHT latch. Use the _Latch VST (or ProTop) to see these. Problems are "brewing" if you consistently see more than a handful of waits per second but the problems probably aren't really critical until you are seeing waits in the hundreds per second.

2) Once LRU contention has been dealt with you might see BHT latch waits. This is often associated with very high read activity levels on very small tables. Knowing which tables are driving that activity and which bits of code are involved goes a long ways. You can use the _tablestat and _indexstat VSTs along with client statement caching to find out. ProTop provides this information for you in a nice easy to use format ;)

2a) It is much better to address the underlying code design than to accept that there are valid business reasons to read a 20 record table billions of times per day.

3) If you actually have issues with BHT latch contention then it can make sense to adjust -hashLatchFactor and -hash. One approach is to try setting -hash to a prime number larger than -B. But don't expect miracles on the order of -lruskips. (IMHO lruskips is the best "go fast" option introduced since -spin...) . There is no benefit to doing any of this if you do not actually have a latch contention problem.

4) There is another related enhancement coming called "optimistic cursors". This will greatly reduce the need to lock buffers. I don't work for Progress and I cannot promise anything but I am expecting to see it in 12.x and hoping that x = 1.

5) Latch contention is a significant contributor to the problems with "too many cores". Latches are "spin locks" -- at a high level they are a tight loop that performs an atomic "test and set" operation on a memory location (the actual assembly language instruction varies but "test and set" is easy to talk about and understand). If you start poking into chip architectures and how caches work you will quickly see that such a loop suffers greatly when multiple physically distinct processors must coordinate coherent access to that memory location. In other words -- you want to minimize the number of cores that are involved and you definitely do not want them to have to communicate over a bus. The cost of that communication is *huge*. If you are suffering from these sorts of latch contention issues they can occur quite suddenly and the impact on users is like hitting a brick wall.

5a) This is why it can make sense to remove cores from a server (or VM) that has many more cores than it actually needs. IOW -- if you have a db server with 48 cores and 3% CPU utilization you should cut that down to 2 cores... you're not doing yourself any favors with all of those extra cores.

5b) Ideally you configure db servers and VMs with no more cores than the number that are on the physical chip. Once you have to communicate between sockets or, worse, between boards your users are going to be exposed to the potential for very poor performance. If your workload is light the OS will likely save you from yourself by keeping things running on a small numbers of cores (this is called "processor affinity") but if the workload increases the OS will not be able to do that for you. (And if the workload is light you are wasting money allocating too many cores.)

5c) Not related to latching, but related to "too many cores" and virtualization: if you configure a vm with more cores than it really needs you are making it difficult for the hypervisor to schedule it. The hypervisor has to find enough cores to run your VM. It is much easier to find 2 free cores than it is to find 48... plus you want those cores to be close together ("affinity")... If you have a VM that has lots of cores, hardly any workload but which seems to be running very, very slowly you may very well be suffering from this sort "co scheduling" problem. ProTop's bogoMIPs metric is a very handy way to detect this sort of issue.
 
Top