Not the best index selection in Progress 4GL
The case
Performance tuning and identifying bottlenecks in Progress applications is quite different than in other environments. Sometimes I think that it doesn’t work with any strict rules, logic or mathematics. In many cases it needs developer instinct, experience or hunch. In most cases developers need to know how database is designed.One of the most significant differences is that Openedge database does not offer statistics. Recently I’ve got a case where this functionality would be very useful. In most cases Profiler or xRef listing helps to solve almost all issues. Almost…
In short, what happens in my case? Got a very simple table that stores imported records.
Significant in this case will be three fields: BatchId, ImportOrder and State. Records were imported in batches and every record from same batch has the same BatchId. Each record from Batch got its own import order number as ImportOrder. State represents status of each record: IMPORTED, OK, ERROR… ect. Table, among others had combined index on idx_BatchId_ImportOrder and a single field index idx_State.
I wanted to report currently imported records that failed somehow and got ERROR status. Used for that simplest query.
I could expect that Progress will use idx_BatchId_ImportOrder index, or will combine indexes that have indexing those fields
Nothing more far from the truth. It uses idx_State and query runs awfully slow. For that, even order of conditions in query does not matter.
Tests
To confirm that this has an impact on performance I’ve done very simple test on a sample environment with 1000000 records iterating on them 100 times and each time asking for a different BatchId, getting around 900 records in each response. I’ve tested above presented code against this one:Average single query times were:
- With default index – around 950 milliseconds
- With USE-INDEX forced index – around 2 milliseconds!
Reason
How Progress selects indexes?First idea would be to use two indexes for that query, but it can be done only when all the components of each index are involved in equality matches. ImportOrder field does not take part in query, and that makes it impossible.
Similar reason prevents idx_BatchId_ImportOrder to be used as a single index. General rules from newest documentation are those:
Use the index specified in a USE-INDEX option.
If there is a CONTAINS clause (which is legal only for word indexed fields), use the word index.
If an index is unique, and all of its components are used in active equality matches, use the unique index.
Use the index with the most active equality matches. Equality matches are active if both of the following conditions are met:
- They apply to successive, leading index components.
- They are joined by ANDs (not ORs or NOTs).
- Use the index with the most active range matches. For a range match to be active it must stand alone or be connected to other selection criteria by ANDs. In addition, it must apply to an index component having any one of two properties:
- The component is the first or only one in the index.
- All preceding components in the index key have active equality matches.
If there is a tie-in other words, if multiple indexes have the same number of active equality, range, and/or sort matches-use the index that comes first alphabetically. If the PRIMARY index is one of the indexes in the tie, then use the PRIMARY index.
Use the primary index.
Idx_BatchId_ImportOrder was eliminated for that query by rule number 4. It was not considered for this case because it has 2 index fields and only one of them is used. Idx_State, by this rule has 100% useful fields, and that’s why it was chosen. For Progress it’s the best choice because it utilizes most index fields for query.
Now we know why, but could we somehow monitor our code to prevent that kind of situation?
Solution
I’m using xRefAnalys free tool by Matt Verrinder, which helped me on a great number of occasions, and which I’ve improved a little bit, but still; it didn’t raise a red flag that there was not the best index selected in that case. It marks queries red, but only if WHOLE-INDEX is used. In my case, it didn’t raise any significant warning because index was selected.xRefAnal.w
I’ve had thousands of “ERROR” status records from thousands of imports session, and only couple hundreds with the same BatchId. I had to look at it three times to find out that index selected by progress was the reason why it was so slow.
This brought me the idea to build a tool that could analyses data against query and suggest index that will use advantage of narrowest record distinctions. Of course, in my case, choice was quite obvious. I’ve expected three different values in status field (from which most of them would have PROCESSED value followed by ERROR) and very selective BatchId field.
But in other situations, solutions could not be so obvious, especially when developers do not know environment and database content in a way that it could help them. Solution for that could be counting distinct values in indexed fields and dividing that by total number of records. This will give us a sense of selectivity of index. More distinct values, quicker index operation, less data scanned.
Made a simple tool that scans whole table and counts distinctive values in fields used in indexes. Result of that is a simple file.
Value represents proportion of different values in the field to a total number of records in whole table. If number reaches 1, it means that every record in table has different values, so selectivity over that field will be the best. The lower the number is, the distinction falls, and follow by that, selectivity on that field also decreases. Most effective query should use indexes on fields with higher selectivity.
IndexSelectivity.p
When I came there, thought, why not go further and combine that information with xRefed code in Matt tool.
Now xRefAnalyzer is alarming about WHOLE-INDEX usage and also, pointing out queries where potentially could be better index to use. And You see a score for each potential index that could be used for that query.
For the combined indexes I’ve summarize fields selectivity, if fields in those indexes consecutively, in order appears in query.
There are of course steps to improve that. Now my data analyzer tries to calculate result for different table on each run. But it could be prepared for batch mode, to be run in less occupied hours. Timestamp could mark when calculated result will outdate. Plan to implement time limiter that will prevent it from working more than two hours. There’s always room to improve.
Results, outcomes
- Always check what index was selected for query.
- Think, how distinctive data You’re operating.
- If You design data structures remember that multiple field indexes are very useful but at the same time, they are less preferred that single field indexes. Golden rule 100% index had to be used.
- Code good and test your product.