Answered Why Group Tables By Rpb Instead Of Usage?

nix1016 · Aug 9, 2016

I was wondering what the reason it is that when designing Type II storage areas that we have to group the tables with similar RPB into the same storage area and not consider their relationships instead? For example if I have invoice headers and invoice charges which will most likely always be read together, should I not put them in the same storage area despite them being quite different in record size?

Another question I have is with a particular table that has its mean record size inflated by some of the records having very large 'notes' field (basically at the 32KB limit). If I don't count those records then the mean record size is around 300-400 bytes but with the notes inflation, it pushes the mean record size to 1100+. Those are not anomalies and will happen every now and then. Should I size the RPB based on the mean size with the inflated record size or should I ignore those and size based on the average record size of the majority of the records since there will be fragmentation anyway?

Thanks in advance!

Rob Fitzpatrick · Aug 9, 2016

nix1016 said:
I was wondering what the reason it is that when designing Type II storage areas that we have to group the tables with similar RPB into the same storage area and not consider their relationships instead?

Actual RPB for a table will be determined by record size, database block size, storage area cluster size, and the create and toss limits. So trying to assign objects by RPB would lead to a circular process, and would be a waste of time.

Perhaps what you meant was "Do I need to assign tables to storage areas based on their mean record size?" The simple answer is no. You could choose a very simple structure like this, with all tables in one area and all indexes in the other:

Code:

d "data":7,128;8 .
d "index":8,128;8 .

That might be fine for you. Or it might not. There's no one structure that will work well for everyone.

The motivation for relating record size to RPB is that if RPB is too high it can cause record fragmentation, especially if your application creates records at a small initial size (e.g. with optional character fields like comments initially blank) and then grows them with subsequent updates. And if RPB is too low then each block may still have lots of empty space when the row directory is full and no more creates are allowed. That means the data is stored inefficiently and this wastes space in the buffer pools and unnecessarily reduces the ratio of logical to physical I/O.

nix1016 said:
For example if I have invoice headers and invoice charges which will most likely always be read together, should I not put them in the same storage area despite them being quite different in record size?

I can't think of any reason to group tables in a storage area by relationship. You may know your tables' relationships, but the OE RDBMS doesn't. And given that each table in Type II storage has its own discrete cluster chain, doing reads in Table A doesn't make it more efficient to do reads in Table B, regardless of which area they reside in.

nix1016 said:
Should I size the RPB based on the mean size with the inflated record size or should I ignore those and size based on the average record size of the majority of the records since there will be fragmentation anyway?

If possible, you may want to consider changing your data model, creating a child table for the notes field with a foreign key like a GUID relating the notes record back to the parent record. That might change the distribution of record sizes in the parent table to be much more uniform. It could also make the application more efficient if the user doesn't always need to see the notes entry, especially if the application is client/server. But that approach does entail schema changes, code changes, data migration, and some detailed knowledge of the application, so it may not be viable for you.

But if you're going to have a table with very large records but a large variance in record sizes, I would say go with a higher RPB. With multi-thousand-byte records, you're going to have record fragmentation anyway no matter what you do. I assume this table has many records (millions), otherwise it may not matter much one way or the other.

TomBascom · Aug 9, 2016

For small databases it isn't worth bothering with anything fancy. Rob's simple "data" and "index" suggestions is fine.

When you say that you are grouping by RPB I think that you probably mean that you are first computing a value for the "ideal" RPB probably based on the mean record size and the target db block size.

If that is correct then the reasoning is fairly simple -- the knobs that you have available to twist are all "technical". None of them are "functional". Grouping objects into storage areas by function doesn't have a consistent relationship to rows per block, blocks per cluster, create limit, toss limit or anything else.

My personal rationale for grouping by RPB is that if you have correctly calculated that value and your data is reasonably well behaved then you get a storage area that does not "waste" a lot of potential recids. I personally find this useful when randomly selecting recids to perform statistical analysis on them.

I suppose that it also makes it so that the create and toss limits can potentially be set for the area as a whole rather than table by table (if in fact you need to adjust them). If your tables are wildly different then the area wide setting is very unlikely to be correct.

Other than that I don't think there is much of an advantage these days.

Those are pretty small advantages. But they are actual advantages whereas grouping by function has no usefulness at all that I've ever come across.

Also -- Rob is right. That notes field ought to be split off into its own table.

nix1016 · Aug 9, 2016

Thanks Rob and Tom for theses detailed explanations and yes I meant mean record size when I said RPB. It makes a lot more sense to me now.

The reason I asked was that I thought that grouping by functionality would help in that when it retrieves clusters of record blocks of the first table that it might also contain blocks of the other related table so that when it's time to retrieve data for the second table, there would be more chance of those already being in the buffer?

In regards to the notes field, it actually sometimes grows to be bigger than 32kb but we are currently trimming it down in order to fit it in the character field. So in order to resolve it, I was thinking of introducing a clob field into the same table and storing it in a different lob storage area which will have RPB of 1 and Blocks per cluster of 1. There will still be large variance in size for that particular field, but fixes the most important issue of them being capped. Is this the best way to do this?

And yes the table does contains millions of records and will be in its own area with its indexes separated out.

Thanks!

Rob Fitzpatrick · Aug 9, 2016

Using LOBs introduces its own challenges and sometimes, judging from the experiences of others, lasting regret. But it does give a programmer the option of reading the rest of the record without reading the large contents of the LOB field. I've never used LOBs personally, so I can't tell you whether they are the best approach for you. Be aware that unless you are on OpenEdge 11.6 or later, you will have no information about your LOB fields in your dbanalys report. Don't put any other objects in the LOB's storage area.

Don't use blocks per cluster of 1; that is the defining characteristic of a Type I area. You never want to make any area Type I with the exception of the Schema Area, where you have no choice. BPC for data areas should be chosen based on the anticipated growth rate. If you expect an area to grow quickly, use 512 blocks/cluster. If it will be relatively small and static, use 8. When a variable-size extend grows, it does so one cluster at a time (though if I'm not mistaken, the minimum allocation is 64 blocks). During the extend operation the client(s) waiting on the I/O to complete are waiting synchronously, so they are temporarily "frozen". Choosing a larger cluster size causes these allocations to happen in larger contiguous chunks (which may later optimize sequential reads) and less often. The trade-off with large clusters is the potential waste of disk space. Each object takes at least one cluster, even if it is logically empty. With 8 KB DB blocks and 512 blocks per cluster, each object uses a minimum of 4 MB of disk space. The average waste will be half a cluster per object.

nix1016 said:
The reason I asked was that I thought that grouping by functionality would help in that when it retrieves clusters of record blocks of the first table that it might also contain blocks of the other related table so that when it's time to retrieve data for the second table, there would be more chance of those already being in the buffer?

In a Type I storage area, a single RM (record) block may contain records from different tables. So in a multi-table area you could accidentally cache records in Table A by reading records from Table B. But if you think about it, that also means the data in Table A is logically scattered, so it occupies more blocks on disk than should be required as a minimum. In other words, reads from such an area are likely to be less efficient as opposed to a Type II area, all other things being equal. Never use Type I areas!

In a Type II area, every storage object (I'm ignoring multitenancy and table partitioning for the sake of simplicity), i.e. every table, index, or LOB column, comprises a chain of one or more clusters, and each cluster comprises a physically contiguous collection of blocks. The number of blocks in a cluster is configured at the area level. So each block, and the cluster it is a part of, contains data from only one storage object. And while allocations happen at the cluster level, reads still happen at the block level; the block is the atomic unit of database I/O. So your "FOR EACH invoice_header..." code will cause invoice_header record and index blocks to be cached, on demand, but it will never cause invoice_charge records or index keys to be cached because they are stored in separate blocks.

nix1016 · Aug 9, 2016

Thanks Rob, I was missing the part that the data retrieve would not contain records of other types so it makes perfect sense to me now as to why grouping by functionality would not work.

With the cluster size, does the size of the records affect it as I was told to use a cluster size of 1 for LOB areas since they tend to be very large records? But then again, in this case, these 'notes' will not always be large fields so I guess it makes sense to use a different cluster size (in which case should be 512 for this table as it grows heavily). This would mean that I would have to create a different lob area for this field since the rest of the fields in the lob area will most likely be large.

If not using LOB areas, how do you deal with character fields that are larger than 32k?

Rob Fitzpatrick · Aug 10, 2016

Again, I'll wait for someone with experience to weigh in on LOB area best practices.

nix1016 said:
If not using LOB areas, how do you deal with character fields that are larger than 32k?

A record can't exceed 32KB in size, nor can a character field.

RealHeavyDude · Aug 10, 2016

There is always debate whether it does make sense to store LOBs ( like large chunks of character data or office documents ) in the database. I don't want to get into the pros and cons here. Personally I am not a big fan of it because you might end up with a database of which 2/3 or is occupied by LOB data ... an RDBMS is not a perfect storage for documents. But then again, it can make sense to have all data related to a business process covered with one disaster recovery strategy ( the one for the RDMS ).

Nevertheless, if you do store LOBs in the databse put them in their own distinct storage area. In the database schema ( data definition ) there is a property for LOB type fields: Area - use it!

Heavy Regards, RealHeavyDude.

jdpjamesp · Aug 10, 2016

RHD is right - if you go down the LOB route then they MUST be in a distinct area that has only LOBs and nothing else. We fell foul of missing that in a previous job and it's a nightmare.
My thoughts on LOBs:
If they aren't going to take up much space in relation to the size of the DB itself then they have big benefits so long as they are defined correctly and in a distinct area.
BUT, if they are going to be large then you are creating a world of hurt for yourself. At my previous employer we used LOBs to store files that were attached to records, so PDF invoices and the like. All in all, for a 400GB database we had over 200GB of LOBs. As a result any database maintenance such as dumping and loading became a lot harder than it needed to be.
I did leverage the .Net capabilities of Progress to compress the files on the way in to limit the bloat, but it was only short term.
Ultimately we were using the database to work as a file server which it's not designed for. We spent a lot of time and effort redesigning things so that we actually reverted to a file server for the files that was locked down - only the Progress application could access it - and then we used AppServer calls to retrieve and store the files which were named using a naming convention to make them unique, but to bind them to the record they were associated with.
The actual effort of getting 200GB+ of LOBs back out of the database makes your eyes water.

So - don't do it unless it actually makes sense to store the info in the database.

TomBascom · Aug 10, 2016

"Notes" fields are usually reasonable things to implement as a LOB. They are fairly obviously related to particular records in the application and the referential integrity advantages are obvious.

PDFs etc maybe not so much.

If you do create LOBs -- yes, they need to go in a dedicated type 2 storage area. You probably want rows per block to be 16 or so (think small fragments at the tail end of the LOB). I would use 512 blocks per cluster.

If you stick with plain old character fields and need to exceed 30k then you need some sort of continuation flag and a counter field.

nix1016 · Aug 10, 2016

Yes we plan to use it for notes field only, no documents storage as of now and by the sound of it, probably ever!

Thanks for your suggestions Tom, it sounds like LOB might be the easier way to go for us. Cheers!

Answered Why Group Tables By Rpb Instead Of Usage?

nix1016

New Member

Rob Fitzpatrick

ProgressTalk.com Sponsor

TomBascom

Curmudgeon

nix1016

New Member

Rob Fitzpatrick

ProgressTalk.com Sponsor

nix1016

New Member

Rob Fitzpatrick

ProgressTalk.com Sponsor

RealHeavyDude

Well-Known Member

jdpjamesp

ProgressTalk.com Moderator

TomBascom

Curmudgeon

nix1016

New Member