ODBC SELECT works remotely, hangs locally

jsveiga

New Member
Hello,

We are running OpenEdge 10.1A-02 (I know, old version, came with our ERP), and I have an application which periodically (every 10 min) pulls data from the Progress DB (read only ODBC connection) to a MySQL DB.

The ODBC driver came with OpenEdge, 05.10.0039.

The DBs and the app runs on a Win2003 64bits, and this was working Ok until yesterday (18/Aug), 14:50 (last successful sync). After this the app hangs waiting for SELECT results from the ODBC connection. SELECTS returning just a few rows seem to work, but breaking the original SELECT into smaller ones still hangs on the third or second one.

Interestingly, if I run the app from my notebook on the network (Vista, same OpenEdge ODBC version, accessing the same DBs in the Win2003 server), it works without a hiccup (but it takes 5min instead of the 2min I used to get running on the server).

I tried opening/closing the ODBC connection on every SELECT (splitting the original big one in smaller ones), both enabling and disabling the "connection pool" option on the Windows ODBC manager, but then things get *extremely* slow running on the server (and small difference running from my notebook).

The only changes I believe happened between before yesterday and after were:
- Our data keeps increasing; maybe a limit was hit
- Patch Tuesday; Microsoft updates happened yesterday

The ERP supplier also has a similar "sync" app which uses the OpenEdge ODBC (but read/write), Their app has been hanging like that since many months ago. I have been saying that the problem was their app, since mine worked, but now I have my feet.

On some rare occasions, after a "hangup" (for which there is no error message, just no answer) the next access gets the "server crash likely" error, but this is very rare. Normally there is just no reply to the SELECT.

Would someone have some light to shed? I'm really lost, and the ERP supplier too.

Thanks!

Joao S Veiga
 

jsveiga

New Member
I got the app running on the server once (it returned the "big" SELECT).

To put things in numbers:

The "big" SELECT returns about 25900 rows, then I have to fetch more info (no "LIMIT" or "TOP" on 10.1A...) from other tables, with other SELECTs, one for each of these 25900 rows.

In my notebook (vista 32, Core2Duo 2GHz, 4GB RAM) over the network, that goes at about 77 rows per second (note that I get data from OpenEdge in the server, then write back to MySql on the same server over a 100Mbps ethernet).

I USED to have about 216 rows per second running in the server.

Since yesterday, I'm getting about 1.5 rows per second. Server load is still the same. While this runs, average CPU usage is at 30% (varying from 8 to 70%).

Server has only 14 progress clients (4GL), and it is a dual core Xeon 3065, 2.33GHz, 4GB RAM (1.1GB RAM available), Win2003R2 64bits.

So: 25900 row query most of the times is hanging with no response. When it gets an answer, the subsequent smaller queries are 140 times slower than before, 50 times slower than doing them over the network.

(I've read about the 10.1A cursor limit when looping, but I don't get those cursor errors)
 

Casper

ProgressTalk.com Moderator
Staff member
How is the database activity when performing that query on the server? (In terms of reads, being logical reads and db reads).

I remember having similiar problems with 10.1A, but we never decided to use it, we went to 10.1B. In our case it had to do with apparant changes between V9 and 10 (where we came from), the sql engine turned out to be more picky then before..

Anyhow, what does your queryplan has to tell about your query? Do you run update statistics on a regular basis?
We found out that it is neccesary to run update statistics for index, table and column statistics to keep our queries runing smoothly.
But reading your posts, I guess you already done that and that wouldn't explain the difference in behavior connection from the notebook and direct connection from the server. (both are essentially remote connections, but the connection at the server is more local :)).
The only thing that crosses my mind is maybe the default_isolation-settings, but I suspose you haven't changed it. We had problems with settings other then read uncomitted. Maybe worthwhile to take a look at that too.

Casper.
 

jsveiga

New Member
Thank you Casper,

For the one time I managed to make the app run in the server yesterday but got horrible performance I believe I got the explanation: I had left ODBC Tracing enabled in the server.

For the original problem (hanging on the big query), the Tracing may have given me a clue: I found that on the ERP supplier app they have a "SELECT * FROM PUB.item". If I run this query from a GUI ODBC client it returns empty, and all subsequent queries on the same connection fails (too big a result for the ODBC driver?).

I am wondering if that query on the ERP app "breaks" a connection that is in the pool, which then my app may (or not) tries to reuse. That could explain why running from my notebook (where the sole ODBC client is my app) works every time, but running on the server (where it shares the connection pool with the ERP app) hangs most of the times.

I suspended the schedule of the ERP app, and then mine started working fine - for 8 "runs", then started hanging again on the "big" query.

The default isolation was left at the default (empty in the MS odbc manager), and I'll try to run update statistics to see if I get better performance once I get the mistery solved.

Thanks again!,

Joao S Veiga
 

netrista

New Member
Hello Joao,

I just wanted to add that the fact that the problem shows up after application of patches on Windows makes me wonder if the patches caused the problem. Do you know if any of the patches were network or firewall related? Also, does your Progress native DSN have anything like a Fetch Buffer or Row Buffer size that you can modify such that you can see if returning all of the 25900 rows at increments of 1 row at a time, 100 at a time, and so forth prevents hanging? This would be somewhat akin to but slightly different than breaking up the single, large SELECT into smaller ones.

Best regards,
Netrista Khatam
Technical Services Manager
OpenLink Product Support
 

jsveiga

New Member
Hi Netrista,

(btw, one of the things I tried was to install the OpenLink Progress ODBC driver 6.1 with a trial license, but when defining the DSN it said "OS not licensed" for the OpenEdge 9.1 driver - maybe because I tried in Vista and in Win2003 64bits?, and says there is a missing dll for the SQL-92)

I have reviewed the installed MS patches, and there were actually no Windows updates installed on the last patch Tuesday; the latest I see on Add/Remove Programs are Win2003 security updates installed on 8/13/2009. The ERP vendor is having the same problem with their "import" program (similar to mine) since months ago - I have been blaming their app; mine worked, so it couldn't be an OpenEdge ODBC problem, right? Now we're on the same boat.

Yes the OpenEdge-supplied ODBC driver has a "Fetch Array Size" in the Advenced DSN setting, and the default is 50. I tried going up to 1000 (2000 would "server crash likely" even from my notebook), but got no conclusive results.

I seem to get more chances of it running (3 out of 5 still hangs or "crash likely") with this: (sorry, I could not isolate which factor does it; every run takes from 2 to 4 minutes on the server, and nothing solved 100%, so it is hard to find out)

- Fetch Array Size : 500
- SQL_ATTR_ACCESS_MODE=SQL_MODE_READ_ONLY;SQL_ATTR_CONNECTION_POOLING=SQL_CP_OFF
- READ UNCOMMITED
- I broke the 25900 row (21 columns) result query into chunks of maximum 100 rows (I fetch by product code ranges, but not all codes exist)
- close the connection and open again in every loop (in 4 cases this took twice the time to run: 4min, but in 2 case it took even less than the usual 2min: 1min48s (and nothing was skipped).

Yet, in most of the runs the program will still hang waiting for the response of one of the queries or "server crash likely" (this after looping through 10, or 100 of the small chinks).

I'm already considering ugly workarounds, such as abort-and-restart from the last chunk that worked, or dumping the whole Progress tables to text, importing to MySql and working from there, or hide my notebook in the server rack just to do the data importation.

I ran ODBC traces, but it didn't help much; when it "server crash likely" it shows in the trace, but also on my app's log, coming from the ODBC driver. When it hangs, I can see the query "going", and the trace stops there.

Is it possible to trace on the Progress DB side, to see if the problem is on that side and not on the ODBC middleman?

What puzzles me is that my app does the same thing every time. It just pulls data from the Progress db to MySQL. There is no user interaction. It is really frustrating to get random results from the same actions. I miss the time when computing was an exact science.

Here's an excerpt from my (not the debug one) log:
2009 Aug 20 16:20:42 : Ok - Time: 4min 19s
2009 Aug 20 16:25:52 : Ok - Time: 4min 8s
2009 Aug 20 16:30:22 : Ok - Time: 4min 13s
2009 Aug 20 16:52:35 : SELECT "it-codigo","desc-item","desc-inter",un,"compr-fabric","ge-codigo","fm-codigo","codigo-refer","inform-compl","codigo-orig","class-fiscal","aliquota-iss","peso-liquido","peso-bruto","preco-ul-ent","data-ult-ent","deposito-pad","cod-localiz","usuario-alt","fm-cod-com",narrativa FROM PUB.item WHERE "cod-obsoleto" = 1 AND "it-codigo" > '011000' AND "it-codigo" < '011999'
[DataDirect][ODBC Progress OpenEdge Wire Protocol driver][OPENEDGE]Server crash likely. QODBC3: Unable to execute statement
2009 Aug 20 16:55:24 : Ok - Time: 1min 48s
2009 Aug 20 17:23:10 : SELECT "it-codigo","desc-item","desc-inter",un,"compr-fabric","ge-codigo","fm-codigo","codigo-refer","inform-compl","codigo-orig","class-fiscal","aliquota-iss","peso-liquido","peso-bruto","preco-ul-ent","data-ult-ent","deposito-pad","cod-localiz","usuario-alt","fm-cod-com",narrativa FROM PUB.item WHERE "cod-obsoleto" = 1 AND "it-codigo" > '010000' AND "it-codigo" < '010999'
[DataDirect][ODBC Progress OpenEdge Wire Protocol driver][OPENEDGE]Server crash likely. QODBC3: Unable to execute statement
2009 Aug 20 17:33:40 : Ok - Time: 1min 48s
2009 Aug 20 17:57:19 : SELECT "it-codigo","desc-item","desc-inter",un,"compr-fabric","ge-codigo","fm-codigo","codigo-refer","inform-compl","codigo-orig","class-fiscal","aliquota-iss","peso-liquido","peso-bruto","preco-ul-ent","data-ult-ent","deposito-pad","cod-localiz","usuario-alt","fm-cod-com",narrativa FROM PUB.item WHERE "cod-obsoleto" = 1 AND "it-codigo" > '010000' AND "it-codigo" < '010999'
[DataDirect][ODBC Progress OpenEdge Wire Protocol driver][OPENEDGE]Server crash likely. QODBC3: Unable to execute statement

Note that inbetween those there are the ones where the app hangs (so I kill the process, so there's nothing in the log). Note too that when it sends the query for {"it-codigo" > '010000' AND "it-codigo" < '010999'} or {"it-codigo" > '011000' AND "it-codigo" < '011999'} this means it already looped 100 or 110 times and worked!

Thanks!

Joao S Veiga
 

netrista

New Member
Hello Joao,

Regarding our OpenLink errors. The license error suggests that you received a license for a Win32 platform, when you need a license for Win64. This is easily rectifiable. You can always email me at technical.support@openlinksw.com, to get a new license. The .dll error (probably esql92.dll) would result from a lack of that .dll on the machine or a need for the directory that contains it to be passed by the system PATH.

Regarding, your continued import problems, it may actually be your ERP at fault here. Is there any more definitive information in the Progress native error log concerning the alleged server crash so that you can rule Progress in or out as the source of the problem?

I'll try and get back with more tips later, after I analyze your latest post in more depth.

Regards,
Netrista
 

jsveiga

New Member
Thank you very much Netrista,

I will probably try the driver again soon, and will contact support for the correct trial license.

The problem finding the crash in the server error log is that there is no crash :-/

Note that on most of the cases, I just get no answer for the query, and my program just keeps waiting until I kill it. On _rare_ occasions, I get "server crash likely" back as an error from the ODBC driver, but there is nothing on the Progress log.

The only error/warnings I can see in my log (now growing very fast, as I connect/disconnect for every loop and do no pooling) are likely unrelated:
- errors trying to create a key with the Event Viewer
- warnings informing the softlimit for -l was automatically increased
(these happen not necessarily when my app is importing, but while users access the Progress ERP)

Could it be that the ODBC driver can't "wait" for the -l (local buffers) to be automatically increased when it needs more, and then hangs??

I got consistent runs now by doing everything as "un-optimized" as possible:
- Fetch array size is still 500
- still SQL_ATTR_ACCESS_MODE=SQL_MODE_READ_ONLY;SQL_ATTR_CONNECTION_POOLING=SQL_CP_OFF
- still READ UNCOMMITED
- Broke the big query in ranges of 50 part numbers (so it returns a maximum of 50 rows)
- Close the connection, **destroy the database instance in the C++ program, create it again**, and open the connection again.

By using 100 part numbers ranges in this configuration I got about 2min5s per run, but was still skipping 1 out of 5 imports.

By using 10 part number ranges, I skipped nothing, but it took 6min to run each time.

With 50, I get about 2min40s per run, and no skipping so far (15 runs in this setup).

I'm thinking about trying to:
- keep the exact setup, but enable connection pooling
- that, and eliminate the destroying/creating connection, but adding a delay between loops
- that, and eliminate the close/open connection

I'm trying to figure out if what's helping here is the actual close/destroy/create/connect + no pooling, or just the added delay all this is creating (which would explain why I can run from my notebook without any of the un-optimizations, since there's the added network delay)

BR!

Joao
 

TomBascom

Curmudgeon
If "server crash likely" is potentially the same as "it is taking a long time to get a result, so I give up" then running update statistics could very well solve your problem. It sounds like performance is poor and getting worse which also seems to point to a need to run update statistics.
 

jsveiga

New Member
Thank you Tom (sorry it takes me so long to answer; for every change I have to let the import run sometimes to see if the result is consistent)

I just did this test:
1 - removed the "un-optimization" of closing/destroying/creating/opening the ODBC connection every time (which works fine). This lowered my runtime from 2min35s to 1min27s! But hung in 2 of my 3 attempts, waiting for the query answer (even with the broken-in-50-coderanges loops). No "crash likely"; just no answer.
2 - Changed the ODBC driver setting from READ_UNCOMMITTED to <empty> (or UPDATE STATISTICS wouldn't run) and
ran UPDATE TABLE STATISTICS AND INDEX STATISTICS FOR PUB.item a couple of times. This took less than a second to run. For COLUMN it returned -10002 (cannot execute sql statement)
3 - back to READ_UNCOMMITTED, tried the same build as in (1), and it still hangs
4 - went back to the close/destroy/create/open connection routine, and it works again, same run times as before (up to 2min40s)

When I get "server crash likely", this is returned immediately by the ODBC driver after I submit the query. When I get no response, it can sit there for 5 minutes, until I kill the process. The "server crash likely" case is *very* rare. I'd say I've seen it in my app debug log 1 time out of 50 hangs.

I also changed to SQL_ATTR_CONNECTION_POOLING=SQL_CP_ONE_PER_HENV and ONE_PER_DRIVER from SQL_CP_OFF. This made no noticeable difference in times, so I wonder if it is working at all.

Now, would the performance drop due to lack of UPDATE STATISTICS act this randomly? I mean, when it works Ok, the 50-partnumber-limited query is returned within the same second. When it does not, the same query, for the same range of partnumbers hangs forever.

I'm certain now that it is not the close/destroy/create/open connection which is making it work, but the added delay. I went back to instanciating/opening the connection only once (yet with the broken-in-chunks query), but added a 2ms "sleep" when processing each item (with queries to other progress db) and ran 9 times without hanging, with a ~2min40s time.

Changed the sleep to 1ms, and (up to now) ran 6 times, ~2min10s, no hanging.

Unfortunately my "sleep" resolution is of 1ms, so I cannot "tune" it down to the minimum.

Why do I have to throttle down queries to the OpenEdge ODBC driver is a mystery to me. Once the driver/db returned the answer to a query, shouldn't it be ready for another? My import program is not threaded; it only sends the next query after it gets the answer for the previous one and processes it.

Also, if some "cumulative" problem is degrading the performance, I'd expect the running times to go gradually slower, not randomly either run at normal speed or hang forever without an error message or timeout.

Best regards!

Joao S Veiga
 

jsveiga

New Member
Got a better solution (although still no explanation), WITHOUT the sleeps.

I reduced my "chunk" size to 20 (i.e., now my big query which returned 25900 items runs divided in part number ranges of max 20 codes - it was 50 up to last tests).

With this I was able (up to now) to run 9 times without errors or hanging.

My query now returns 24 columns, so I'm wondering if this is related to "fetch array size" being columns x rows, and not "rows" as in the documentation. My Fetch Array Size is 500.

I then tried to set FAS to 1250 and back to chunk size of 50, but that got me "server crash likely" on another query in the program (which was not failing so far), and hangups on my "usual" trouble query.

I also tried FAS 1000 and chunks of 40, but got hangups.

So I'm running with FAS 500 and chunks of 20. Run time is about 1min30s.

Any ideas why this works?

Best regards,

Joao S Veiga
 

jsveiga

New Member
... and remember that up to Aug/19 14:30 it worked fine with my query returning 29500 rows and the fetch array size was the default 50!
 
Top