Fastest Way to Get Progress Database Table Into Memory

bulklodd

Member
You told us you have 4G on progress server but progress uses only 160M you definitely need to increase -B parameter. Try to increase it at least to 400000 (~1,6G).
 

BCM

Member
bulklodd -
The example I gave, above, where I incremented variable N by 1 for each row in the IR table takes 40 minutes. Fetching the data using a Pass-Through SELECT query from SQL Server to Progress takes 1 hour. I don't see that I would be gaining anything overall to write out those SQL Insert statements.

My true question is Why Does It Take Progress So Long to Gather and Send the Data? This is my question because this type of data transfer executes in 25-30% of that time on relational database servers like Sql Server and Oracle.

Anytime a set of data must be handled, it will always be faster to handle the data as a set rather than on a row-by-row basis. Do Progress programmers avoid handling data in sets? We recently created our own SQL-based routine to create a table that combines useful data values and/or sums of values from all the tables into one table. The vendor's 4GL routine takes 3 hours. Handling the data in sets cuts it down to 3 minutes.
 

BCM

Member
bulklodd -

I am trying your recommendation now.

Changed -B 40000 to -B 400000

Database restarted.

Copy jobs are executing. I will let you know how it works.

Thank you.
 

bulklodd

Member
Do Progress programmers avoid handling data in sets? We recently created our own SQL-based routine to create a table that combines useful data values and/or sums of values from all the tables into one table. The vendor's 4GL routine takes 3 hours. Handling the data in sets cuts it down to 3 minutes.

There're a lot of reasons.
Progress had no decent SQL support until 9.1D.
The use of SQL denies the use of 4GL.
Don't ask me about ESQL, anyway it's deprecated (or depreciated) now and it's got nothing to do with all other 4gl stuff, although I use it SELECT COUNT from time to time to count records in the table.
 

bulklodd

Member
BCM said:
buldlodd - The parameter change has no effect on the performance.

It's possible if you've read each table once. The second run will be faster. How often do you restart database?
 

BCM

Member
We restart the database once each day at 3:00 a.m. in the morning. Then the database tables are copied to SQL Server.
 

bulklodd

Member
We restart the database once each day at 3:00 a.m. in the morning. Then the database tables are copied to SQL Server.

It means you reset db cache each time before copying, it's not good it kills performance. Why can't you perform copying before the database restart? Another possibility is to warm up db cache before copying it can be done, as Tom suggested, by 'proutil dbname -C tabanalys'. I think it should be started either before or together with reading.

However 4gl reading can be improved with increasing size of network packages, -RO client parameter and more accurate use of indexes, as for SQL i have no idea what else can be done.
 

TomBascom

Curmudgeon
This database looks to be essentially untuned.

As Bulklodd suggested a larger -B ought to help. Although the amount of help is limited when the query is a large sequential read. (Big caches are most effective with random access.) You might also look at any options you might have for setting read-ahead buffers on the disk subsystem (especially if it's a disk array) or in the OS. But even that is probably of limited utility given your apparently "out of the box" database installation (see below for advice on how storage areas can help with that).

Setting -spin to something more reasonable (like 20,000 or perhaps 50,000) could help too.

On the SQL side -- I only see one client connection in the .lg file (dummyID01) so I'm not sure why you think you have more than one attempting to read. But you might try setting 1 connection per server to force multi-threaded access if there really are multiple processes trying to connect.

In general 40 minutes is a long time to read 4.8 million records. You should be able to read roughly 20,000 - 100,000/sec on reasonably modern PC hardware.

http://www.greenfieldtech.com/articles/how_fast_will_it_go.shtml

So it really shouldn't take more than about 10 or 15 minutes if you're properly tuned. With 9.1d one of the biggest things you can do to improve large sequential reads like this is to put that table into a dedicated storage area.

http://www.greenfieldtech.com/articles/storage_areas.shtml

You'll need a bit of downtime to do it though. I would, of course, be happy to help on a professional basis :)
 

BCM

Member
Tom,
I am sharing your response with our administrator. We have two SQL connections, Broker 2 and Broker 3. On SQL Server we have two ODBC connections defined. One connecting on port 2509 and one on port 5018.

03:00:31 BROKER 2: Started for 2509 using TCP, pid 2780. (5644)
03:00:31 BROKER 2: This is an additional broker for this protocol. (5645)
03:00:31 BROKER 2: This broker supports SQL server groups only. (8864)
03:00:33 BROKER 3: Started for 5018 using TCP, pid 2700. (5644)
03:00:33 BROKER 3: This is an additional broker for this protocol. (5645)
03:00:33 BROKER 3: This broker supports SQL server groups only. (8864)


One issue on which I have been seeking clarification is:

Two processes run concurrently on SQL Server.
Process #1 - DummyID01 accesses Progress via Broker 2 (port 2509)
Process #2 - DummyID02 accesses Progress via Broker 3 (port 5018)

Will Progress execute both processes at the same time? The reason I ask is because the performance does not clearly indicate that SQL requests are processed concurrently by Progress. When I import 68 tables through one Broker and userid, the elapsed time is 3 hours or so. When I split the tasks with the 6 largest tables through one broker and the other 62 tables through the other broker using a different userid the elapsed time is still 3 hours or so. My hypothesis is that the granularity with which Progress sees my requests is at the client (box) level rather than the userid/session level. If my hypothesis is false, then Progress is just plain slow in handling our requests.

Thank you for all your thoughts and recommendations.
 

joey.jeremiah

ProgressTalk Moderator
Staff member
we could give you suggestions, but we both know you wont try any

just goofing off, hope it wont be taken the wrong way :)
 

TomBascom

Curmudgeon
I would expect that, in a perfect world, if everything is as you describe that you do in fact have two distinct threads making requests and that that should have a positive impact on performance.

You should be able to see this by looking in PROMON or ProTop (ProTop is, IMHO, a much better tool for this purpose but either would work...) at the IO by users screens. You should see both dummy01 ad dummy02 and they should both be getting logical and physical IOs.

Of course either way essentially the same number of IO requests have to be made -- so if the disk subsystem is swamped adding a second thread isn't going to help. To fix that you need to either add spindles or arrange the data in such a way that each IO is more effective (that's what dumping and loading and building dedicated storage areas will do for you...)

The world is rarely perfect ;)
 

TomBascom

Curmudgeon
One more thing...

The Progress 4GL engine always uses an index to access data. So 4GL code that loops through large data sets is going to be slower than SQL code that does the same if you're using the newer Progress SQL releases that can scan a table without an index. (I think that started with 9.1d and it has got a lot of attention in the 10.x series as well.)

Some day I hope that the 4GL engine will support NO-INDEX table scans.
 

joey.jeremiah

ProgressTalk Moderator
Staff member
star schema is in the works and 4gl still doesn't have fast table scans (no-index) ?

one big thing 4gl has over sql, performance wise, is direct access without the need to pass huge volumes of data, and it always will

i'm getting used to the idea of using sql for reporting and intensive read centric operations
 
TomBascom said:
One more thing...

The Progress 4GL engine always uses an index to access data. So 4GL code that loops through large data sets is going to be slower than SQL code that does the same if you're using the newer Progress SQL releases that can scan a table without an index. (I think that started with 9.1d and it has got a lot of attention in the 10.x series as well.)

Some day I hope that the 4GL engine will support NO-INDEX table scans.

What happens if you deactivate all indexes on the table, and compile with
-noinactiveidx parameter?

I have no idea if this is a sensible question, just curious. :)
 

methyl

Member
Off topic BI stall

One day this will save your life. You need to set -bithold (to say 800 Mb) and -bistall.

BTW, the best single throughput improvement (at the expense of BI file size) is to up the BI cluster size. Try looking at PROMON database checkpoints - if you are in the thousands not the tens an increase in BI cluster size should improve matters.
 
Lee Curzon said:
What happens if you deactivate all indexes on the table, and compile with
-noinactiveidx parameter?

Ok, a quick (not comprehensive, I'm falling asleep) test seems to confirm that if all indexes are deactivated, the primary is used despite -noinactiveidx parameter.
 

bulklodd

Member
Ok, a quick (not comprehensive, I'm falling asleep) test seems to confirm that if all indexes are deactivated, the primary is used despite -noinactiveidx parameter

it's curious, but I think so-called default index will be used in that case because progress can't query without index at all except for recid/rowid, of course. I got wind of NO-INDEX option but it's still unavailable anyway.
 

TomBascom

Curmudgeon
methyl said:
One day this will save your life. You need to set -bithold (to say 800 Mb) and -bistall.

It is not necessary in the v9 and later world where there is no longer a 2GB bi size limit.

IMHO the best life saver, across all versions of Progress, is to enable after-imaging.

BTW, the best single throughput improvement (at the expense of BI file size) is to up the BI cluster size. Try looking at PROMON database checkpoints - if you are in the thousands not the tens an increase in BI cluster size should improve matters.

That's very useful for improving transaction throughput but irrelevant to read throughput.
 
Hello all,

I just joined in this conversation, and also need something similar :

I have a table and I need to count the number of records of the table, devided into several types. What is the fastest way to make my count? I have an index to all the appropriate fields.

method 1 : dynamic query & query-prepare with "preselect each table where..."
methed 2 : select count(*) into iCount from table where..
method 3 : do preselect each table where...

Other suggestions are of course also welcome.:)

Edit : I did some small tests on a small table (2500 records)

The fastests & easiest way seems to be method 1, given the fact that you use the fields(fieldname) option. When I omit the fieldname, and use empty brackets, I have a small performance drop.

Test 1 : empty where statement


Some test results with data in cache :

method 1 : dynamic query : results vary from 40 to 60 ms to count.
method 2 : sql select count : results vary from 100 to 110 ms to count
method 3 : static query : results vary from 50 to 90 ms to count.
method 4 : static for each : results vary from 50 to 100 ms to count.

Some test results without data in cache (no run in the last minutes):

method 1 : dynamic query : result : 80 ms on the first run.
method 2 : sql select count : result : 151 ms on the first run.
method 3 : static query : result : 120 ms on the first run.
method 4 : static for each : result : 100 ms on the first run.

Test 2 : Indexed where statement -- result = 400 rows

Some test results with data in cache :

method 1 : dynamic query : results vary from 10 to 20 ms to count.
method 2 : sql select count : results vary from 10 to 10 ms to count
method 3 : static query : results vary from 10 to 10 ms to count.
method 4 : static for each : results vary from 0 (!) to 20 ms to count.

Some test results without data in cache (no run in the last minutes):

method 1 : dynamic query : result : 10 ms on the first run.
method 2 : sql select count : result : 10 ms on the first run.
method 3 : static query : result : 10 ms on the first run.
method 4 : static for each : result : 20 ms on the first run.

Test 3 : where statement field <> "" ==> gives all results = 2500 rows

Because of the huge speeds I encountered, I wanted to try something strange. In fact, I wanted the same result set as in the first test, but using a where statement. The results :

method 1 : dynamic query : results vary from 50 to 80 ms to count.
method 2 : sql select count : results vary from 50 to 80 ms to count
method 3 : static query : results vary from 70 to 80 ms to count.
method 4 : static for each : results vary from 50 to 100 ms to count.

Some test results without data in cache (no run in the last minutes):

method 1 : dynamic query : result : 70 ms on the first run.
method 2 : sql select count : result : 80 ms on the first run.
method 3 : static query : result : 80 ms on the first run.
method 4 : static for each : result : 90 ms on the first run.

Conclusion

Dynamic query gives for my tests the most reliable and fastests results. All other queries gain performance when you mention a where clause.
 
Top