After 12-15 hours No more new connections to the database

ddegroot

Member
Hi Guys,

We ran into a problem after moving to a new machine. We are running Tru64 5.1B with an old version of Progress 7.3E15. We need this old version to support running MFGPro 8.5E.

Machine:
--------
Digital Alpha ES40 (4 CPU, 8 Gb)
Tru64 5.1B
Progress 7.3E15
MFGPro 8.5E
Attachments: prostats files of the last 4 hours:
View attachment prostats_04-24-08_1910.prod.txt / View attachment prostats_04-24-08_2010.prod.txt / View attachment prostats_04-24-08_2110.prod.txt / View attachment prostats_04-24-08_2210.prod.txt / View attachment prostats_04-24-08_2310.prod.txt

The problem is that we are running fine for about 12-14 hours and then the prod (prod.pf) database does not allow any more login's or logout's. Even connecting new APWS, Promon etc connections will not work. The session just does not start. The connection attempt is not even written to the prod.lg file. There is no filesystem activity or cpu activity at this moment. OS is fully reactive and now errors in any OS log files.

We have been trying to solve this problem over the last couple of days, and have nog been able to find any real problem that might cause it. /var/log/messages, /var/log/kern.log /var/log/deamon.log all don't mention anything particular that might be related. All other databases running on the server (prod_gl.pf, lnprod.pf normally connected to the same session for every application logon) work just fine and are still able to connect new users.

We use self serving clients for 90% of the connections, only a few users use GUI. We have a maximum of 88 clients connections allowed and normally about 60 users online.

7.3E15 only allows 1024 blocksize

prod db:
--------
filesystem: 20 Gb
size: 9Gb in 100 Mb extends
highwaterlevel: 48%

bi file prod.db
-------------
filesystem: 20 Gb
size: 700 Mb in 100 Mb extends
biclustersize = 4096
biblocksize=8.

# Progress parameter file for Production DB
#
# Client <-> Server parameters:
-H mfgpro2
-N tcp
-S prod

-n 88 # Maximum aantal gebruikers
-Mn 4
-Ma 12
-Mf 90

-L 100000
-B 400000

-spin 25000
-bibufs 50
-napmax 5000
-q

(We would like to increase the -B parameter later to about 1200000 but this needs to be solved first.)

Checkpoint Information
----------------------
Code:
Ckpt                         ------ Database Writes ------
 No. Time      Len   Dirty   CPT Q    Scan   APW Q Flushes
  77 23:14:58  127    1888    1887     255       0       0  100   123    1888  <-- End Batch. Server again allows now new logins.
  76 23:12:49  129    1862    1861       0       0       0  100   125    1862
  75 23:10:43  126    1849    1848       0       0       0  100   122    1849
  74 23:08:57  106    1834    1833       0       0       0  100   102    1834
  73 23:07:46   71    1286    1285       0       0       0  100    71    1286
  73 23:07:46   71    1286    1285       0       0       0  100    71    1286
  72 23:06:57   49    2311    2310       0       0       0  100    48    2311
  71 23:05:36   81    2352    2351      11       0       0  100    81    2352
  70 23:03:28  128    2273    2272      20       0       0  100   123    2273
  69 23:01:12  136    2178    2176       5       0       0  100   131    2178
  68 22:58:58  134    2224    2222      67       0       0  100   130    2224
  67 22:56:39  139    2147    2146      65       0       0  100   134    2147
  66 22:54:27  132    2139    2138      60       0       0  100   127    2139
  65 22:52:13  134    2803    2802      34       0       0  100   129    2803
  64 22:49:58  135    1796    1795      42       0       0  100   131    1796
  63 22:48:39   79    1321    1245      18       0       0  100    76    1321
  62 22:47:56   43    1559    1483       0       0      75  100    43    1484
  61 22:46:50   66    1379    1378       0       0       0  100    65    1379
  60 22:46:00   50    1985    1984       0       0       0  100    48    1985
  59 22:44:49   71    2122    2121      29       0       0  100    69    2122
  48 22:28:30   86    2397    2396     105       0       0
  57 22:42:19   62    1908    1907       0       0       0  100    60    1908
  56 22:41:08   71    2094    2093       0       0       0  100    69    2094
  55 22:39:28  100    2154    2153       0       0       0  100    98    2154
  54 22:38:04   84    2834    2833      33       0       0  100    82    2834
  53 22:36:16  108    2435    2433      49       0       0  100   105    2434
  52 22:34:47   89    1483    1473      32       0       0  100    86    1481
  51 22:34:03   44    3455    3437      17       0       0  100    44    3440
  50 22:32:02  121    2645    2643       7       0       0
  49 22:29:56  126    2060    2059      38       0       0
  48 22:28:30   86    2397    2396     105       0       0  100    81    2397
  47 22:26:49  101    2356    2345     145       0       0  100    99    2355
  46 22:24:49  120    2095    2060     196       0       0  100   119    2091
  45 22:22:37  132    1818    1748     296       0       0  100   129    1782
  44 22:20:13  144    2926    2923     130       0       0  100   140    2924
  43 22:19:09   64    2341    2334      35       0       0  100    57    2341
  42 22:18:02   67    2885    2876      22       0       0  100    64    2883
  41 22:17:01   61    3479    3421     128       0       0  100    59    3423
  40 22:15:53   68    1774    1758      66       0       0  100    66    1764
  39 22:14:45   68    1796    1783      64       0       0  100    66    1790
  38 22:13:50   55    3258    3256       8       0       0  100    53    3257
  37 22:12:42   68    2727    2726      12       0       0  100    66    2727
  36 22:11:41   61    6061    6060       4       0       0  100    60    6061
  35 22:10:44   57    7236    7235      77       0       0  100    55    7236
  34 22:09:59   45    6764    6763      51       0       0  100    44    6764
  33 22:09:18   41    7483    7482      24       0       0  100    41    7483
  32 22:08:35   43    8795    8794       0       0       0  100    43    8795
  31 22:07:53   42    9684    9683      43       0       0  100    42    9684
  30 22:07:12   41     625     526      42       0       0  100    40     625
  29 22:06:05   67    1138     999      55       0      98  100    67    1040  <-- Start Batch MRP
  28 22:05:03   62    2971    2925       0       0       0  100    62    2971
  27 22:04:07   56    2598    2552       0       0      45  100    56    2553
  26 22:03:07   60    3504    3503      21       0       0  100    60    3504
  25 22:02:01   66    1786    1785       0       0       0  100    66    1786
  24 22:01:01   60    1758    1757       0       0       0  100    57    1758
  23 21:20:03 2458     164       0     240       9       0  100     1       0  <-- Batch started 22:00
  22 20:21:27 3516     673     671     542       0       0  100   568     672
  21 17:10:06 11481    247     235    2179       0       0  100   238     242
  20 15:57:18 4368    2825    2823    2999       0       0  100  2377    2824
  19 15:35:04 1334    3617    3615    1103       0       0  100  1186    3616
  18 15:08:44 1580    3699    3684     328       0       0  100  1520    3698
  17 14:57:24  680    1972    1957       0       0      13  100   680    1958 
  16 14:44:58  746    1902    1897       5       0       0  100   744    1901
  15 13:58:12 2806     834     825    3050       0       0  100   276     826
  14 13:01:35 3397    3455    3453    3672       0       0  100  1147    3454 
  13 12:14:02 2853    2163    2161    1045       0       0  100   698    2162
  12 11:10:07 3835     898     896    2605       0       0  100   150     897
  11 10:12:35 3452    4076    4073    4128       0       0  100   681    4075
  10 09:56:49  946    4583    4580      45       0       0  100   749    4582
   9 09:40:55  954    4850    4848      74       0       0  100   741    4849
   8 09:25:37  918    3064    3061     465       0       0  100   437    3063
   7 08:51:58 2019    1918    1916    1586       0       0
   6 08:35:33  985    3927    2473     379    3874       0
   5 08:17:55 1058    2879    2878     409       0       0
   7 08:51:58 2019    1918    1916    1586       0       0
   6 08:35:33  985    3927    2473     379    3874       0
   5 08:17:55 1058    2879    2878     409       0       0
   4 07:59:34 1101    2540    2539     539       0       0 <== Rebooted machine at 7:00
View attachment CheckpointTable.txt

If you guys need any more information to give us a hand in solving this problem just ask. We should be able to provide it.

Thanks in advance

Diederik de Groot :confused:
 

ddegroot

Member
Side note:

Could NIS updates be causing these problems. I have seen that NIS sometimes takes up to 30 seconds - 1 minute to complete the username/password updates from our server.

Can any one tell me if user login/logout is handled sequentially by Progress 7.3E15 and that this might cause the problem with users being unable to login or out because one username can't find his UID/GID anymore, and won't recheck ?

Does anyone have experience with running Progress and NIS on the same machine ?

Diederik
 

TomBascom

Curmudgeon
Tru64 + Progress 7.3 + networking problems (long login times) sure sounds like asking for trouble to me...

You're in serious need of some upgrades.

Aside from that one "quick fix" would be to change -n 88 to -n 200.

-n has nothing to do with licensed users. It is simply the number of database connections. If you started 88 APWs you'd run out of -n without having any users (actually you'd run out around 85...). Once you run out you can't get new connections. The symptoms that you describe are entirely in line with such a problem and increasing -n (dramatically) should solve that problem.
 

ddegroot

Member
Hi Tom,

Thanks for your answer. I know it's an unsupported version. But it will have to do for the comming two years.

We only need something like 60 concurrent connections and i've set 88 to have an upper limit. The problem occured even when only four users where connected. I'm a quite sure the user count of 88 is not a problem to be honest, we used 77 on our previous machine until a week ago. I can set it to any number i'd like it wouldn't matter.

I think i'm on track of the possible NIS Issue. NIS Users aren't cached on Tru64 and once an update is started from the NIS Server to this Slave it will show UID's and GID's when i run 'ls -l /home' meaning it can't find those accounts. This only happens when either machine is under heavy load and the NIS updates take a while.

I have copied all NIS Passwd/Group content to the standard /etc/passwd & /etc/group files now and switched of NIS temporarily to see what the result is. It looks very promissing.

This machine has been running in a test environment for about 6 months but we just migrated the actual user to this machine on monday. The problems started a couple hours later.

Maybe you have some observations besides usercount

Thanks again for your response,

Regards,

Diederik
 

ddegroot

Member
We have not observed the same problem for the last 24 hours after switching of NIS and move back to /etc/passwd authentication. I might be on to something here.

Does anyone experience similar problems with NIS and Progress ?

I would still like some kind of single sign on though. ASU is out of the option, LDAPCD doesn't seem to work, because Password sync won't compile, Kerberos would need KTelnet clients which are not available in the version of netterm we use. Is there anything against running NIS in the background and not authenticating against it (no +: at the end of the /etc/passwd file) and upditing /etc/passwd using a script every hour (, running at nice -n -20 ) ? Any other options.

Diederik
 

TomBascom

Curmudgeon
I'm not sure that your problem description is clear.

Sure, if NIS et al aren't working properly then logins aren't going to function. But that has nothing to do with Progress. Other network issues (like reverse name lookup problems) might, however, impact Progress connections. Especially if you are using client/server connections. Are -H & -S part of you client startup parameters? If you're using telnet connections, and you seem to be, then you shouldn't be needing -H or -S.

If you're getting beyond the OS login and trying to start a Progress session then issues like -n come into play.

If the problem is that you're running out of -n entries then, as I said, a quick fix is to increase -n.

If this is the case then a longer term fix, of course, needs to focus on why you're running out of -n (if you are in fact running out).

You say that the problem occurs even with 4 users. I wonder how many connections Progress thinks that you have when you think you have 4 users. You might want to check PROMON to make sure that users that you think have been disconnected have actually been disconnected. There have been some bugs that have allowed users to linger after disconnect causing the connection table to fill up. That would lead to eventually running out of -n. If you look at the "user control" screen in PROMON and there are lots of users that shouldn't be there that might be a clue about what's going on. Then we would need to dig deeper into why they aren't going away.

OTOH you may have fixed the problem with the changes from you last post.
 

ddegroot

Member
Hi Tom,

Thanks again for your reply. I checked my prostats.log files against /var/adm/wtmp and the are pretty synchronous. User logs out of database and then out of telnet. 95 % of our users are telnet users BTW. We are running WatchDog and it catches all user who kill the telnet client without logging out nicely. Happens approx 4 times a day. We took the [X] out of the windows telnet client, so user have to go into the Task Manager to actually kill the telnet client. Logging out nicely is mostly their only option.

So no stale logged in clients as far as i can tell. Taking NIS out of the loop however seems to have fixed our problems regarding progress. We have pretty tight userrights on the machine to prevent tampering and it might cause part of the problem, when NIS can't update fast enough. I have see NIS updates of about 1-2 minutes, during which users do not all exist. I can imagine that logged in users who suddenly do not exist anymore, and there for are not allowed to read/write database-BI-log files might have a bit of a problem. Funny thing is i would expect progress to handle these events but it doesn't .

I'm going to use our test database to see what happens when i'm logged in and i delete my own user account in the meantime. If this causes the same problem. I know i'm on the right track. If not, i will have to search further. It's a shame that Tru64 5.1B Doesn't cache user accounts properly during updates... It should as far as the documentation is concerned, than again 5.1B seems to have loads of other users with problems that were not there in 4.0?.

THanks Tom for your comments and insight,

Diederik
 

TomBascom

Curmudgeon
Thanks again for your reply. I checked my prostats.log files against /var/adm/wtmp and the are pretty synchronous. User logs out of database and then out of telnet.

Where does "prostats.log" come from? It looks like a PROMON screen scraper of some sort.

It also looks like it might be the source of your problem -- I see lots of MON processes running. I cannot think of a good reason for that. Something is fishy about that.

95 % of our users are telnet users BTW.

Ok, but telnet by itself doesn't rule out client/server. (OTOH your prostats data does seem to since they do appear to be connecting as SELF service clients rather than REMC).

We are running WatchDog and it catches all user who kill the telnet client without logging out nicely.

That isn't what WDOG does. WDOG removes, from the database, self-service database connections that have no corresponding OS process.

Happens approx 4 times a day. We took the [X] out of the windows telnet client, so user have to go into the Task Manager to actually kill the telnet client. Logging out nicely is mostly their only option.

Never under-estimate the cleverness of a motivated user :( Alt-F4 or The Big Red Button also work to close windows rudely. "X" is actually relatively benign -- it should result in SIGHUP being delivered to _progres and a nice clean exit. Unless you're doing something counter-productive like trapping signals on the UNIX side.

So no stale logged in clients as far as i can tell.

The kinds of problems that I was referring to would not show up in that manner. You would see them as connections that won't go way and which WDOG doesn't disconnect either. They get sort of "hung" in the disconnect process.

Taking NIS out of the loop however seems to have fixed our problems regarding progress. We have pretty tight userrights on the machine to prevent tampering and it might cause part of the problem, when NIS can't update fast enough. I have see NIS updates of about 1-2 minutes, during which users do not all exist. I can imagine that logged in users who suddenly do not exist anymore, and there for are not allowed to read/write database-BI-log files might have a bit of a problem. Funny thing is i would expect progress to handle these events but it doesn't.

File r/w permissions are established when a file is opened. When a Progress session starts it opens all of the database file handles using its effective UID (usually setuid to "root" -- check the perms on the $DLC/bin/_progres executable...) It then falls back to the real userid relinquishing the setuid state on the way. These files are never closed and reopened so deleting a user would have no effect. The only files that are opened after startup are input and output streams used by 4GL programs and none of them would be subject to being opened once the decision to terminate a process has been made -- whether that is as a result of a "nice" logout or some kind of abnormal ending.

I'm going to use our test database to see what happens when i'm logged in and i delete my own user account in the meantime. If this causes the same problem. I know i'm on the right track. If not, i will have to search further. It's a shame that Tru64 5.1B Doesn't cache user accounts properly during updates... It should as far as the documentation is concerned, than again 5.1B seems to have loads of other users with problems that were not there in 4.0?.

THanks Tom for your comments and insight,

Diederik

Hopefully it's useful :awink:
 

ddegroot

Member
Where does "prostats.log" come from? It looks like a PROMON screen scraper of some sort.

It also looks like it might be the source of your problem -- I see lots of MON processes running. I cannot think of a good reason for that. Something is fishy about that.

Hi Tom,
Yes it is a screenscraper which runs continuously using one input fifo and one output fifo.

The other monitor processes where my own. I was just looking if i could spot anything. Normally we have one monitor on the production database only.

Ok, but telnet by itself doesn't rule out client/server. (OTOH your prostats data does seem to since they do appear to be connecting as SELF service clients rather than REMC).
I know, but i'm sure. We do not use any -S -H in our client connections from the terminal. There are only 5-6 GUI clients who connect via Client/Server

"We are running WatchDog and it catches all user who kill the telnet client without logging out nicely."

That isn't what WDOG does. WDOG removes, from the database, self-service database connections that have no corresponding OS process.
Does that not include Self-Service sessions with telnet connections / _progress running anymore ?

Never under-estimate the cleverness of a motivated user :( Alt-F4 or The Big Red Button also work to close windows rudely. "X" is actually relatively benign -- it should result in SIGHUP being delivered to _progres and a nice clean exit. Unless you're doing something counter-productive like trapping signals on the UNIX side.
Nope they do not work any more not in my version of the telnet application :) The big grey button on the front of the computer and Task Manager are about there only option (besides pskill).
We are not trapping SIGHUP anymore. Had to take that out of the startscript delivered by the QAD implementation partnet (?? years ago). They made some more "intelligent" choices that had to be corrected. Like installing on RAID 5.

The kinds of problems that I was referring to would not show up in that manner. You would see them as connections that won't go way and which WDOG doesn't disconnect either. They get sort of "hung" in the disconnect process.
I now see were you are going. I thought of that one too, and was looking for zombie processes, disconnected telnet sessions, etc... But to no avail. Clients we still connected and could not get out of the session, using the standard exit method from mfmenu.p

File r/w permissions are established when a file is opened. When a Progress session starts it opens all of the database file handles using its effective UID (usually setuid to "root" -- check the perms on the $DLC/bin/_progres executable...) It then falls back to the real userid relinquishing the setuid state on the way. These files are never closed and reopened so deleting a user would have no effect. The only files that are opened after startup are input and output streams used by 4GL programs and none of them would be subject to being opened once the decision to terminate a process has been made -- whether that is as a result of a "nice" logout or some kind of abnormal ending.
How about this one. On a self-service client, which process writes to the database .lg file ? If you change the right to 0655 on the database .lg file. User can not log in not log out. Changing the rights back fixes this issue of loggin, but not when they are logged in and want to log out. At least nog in 7.3E. Strange behaviour from an OS standpoint indeed. But i must say i haven't figured this one out.

The Problem in them meantime:
It doesn't exist anymore since the moved back to /etc/passwd and /etc/group. All other parameters, qua BI file and .pf files are back to the situation before the problems occured.

I have copied the current production machine using a bootable tape to our standby machine to perform some testing and have managed to generate the same problem using NIS again and running several bzip2 processes to produce heavy load on the machine. Again NIS update took up to a minute and my progress processes where stranded inside mfgpro without the option to log out even after the NIS update had occured finally and all bzip2's was done. No logging in .lg files and no errors in /var/log/* or /var/adm/*.

Hopefully it's useful :awink:

Thanks again Tom,

Most helpfull, nice to have a sounding board when fixing problems.

Funny nobody else had these problems. I have noticed though how many people are experiencing problems with the NIS implementation in Tru64 5.1-4 compared to previous implementations on 4.0F. Update problems seem to be prevalent in this version. HP have not notified any of these or fixed any of them up till now...

Regards,

Diederik
 

TomBascom

Curmudgeon
I suspect that you aren't seeing a lot of similar problems in the community because:

1) Tru-64 is a pretty small community in the world at large.

2) Ditto for Progress.

3) Progress on Tru-64 is an even tinier slice of #1.

4) You're dealing with out dated and obsolete versions of both Tru-64 and Progress.

As for the problem itself... I don't know for sure but the .lg file may be an exception to my earlier comment about files being opened at startup and never again. I have a vague memory of needing to make sure that the .lg file is always writable under some circumstances. You might want to try a simple chmod 666 dbname.lg.
 

ddegroot

Member
Where does "prostats.log" come from? It looks like a PROMON screen scraper of some sort.

It also looks like it might be the source of your problem -- I see lots of MON processes running. I cannot think of a good reason for that. Something is fishy about that.



Ok, but telnet by itself doesn't rule out client/server. (OTOH your prostats data does seem to since they do appear to be connecting as SELF service clients rather than REMC).



That isn't what WDOG does. WDOG removes, from the database, self-service database connections that have no corresponding OS process.



Never under-estimate the cleverness of a motivated user :( Alt-F4 or The Big Red Button also work to close windows rudely. "X" is actually relatively benign -- it should result in SIGHUP being delivered to _progres and a nice clean exit. Unless you're doing something counter-productive like trapping signals on the UNIX side.



The kinds of problems that I was referring to would not show up in that manner. You would see them as connections that won't go way and which WDOG doesn't disconnect either. They get sort of "hung" in the disconnect process.



File r/w permissions are established when a file is opened. When a Progress session starts it opens all of the database file handles using its effective UID (usually setuid to "root" -- check the perms on the $DLC/bin/_progres executable...) It then falls back to the real userid relinquishing the setuid state on the way. These files are never closed and reopened so deleting a user would have no effect. The only files that are opened after startup are input and output streams used by 4GL programs and none of them would be subject to being opened once the decision to terminate a process has been made -- whether that is as a result of a "nice" logout or some kind of abnormal ending.



Hopefully it's useful :awink:

The lg file rights problem, was my own dumb fault. I wrapped script in script in script and therefor dit not see the cause. The backup script runs the truncate_log script and it doesn't do the job correctly. I was blaming probkup but it couldn't do anything for it. Truncate_log script had change from the old machine to the new one and i of course never gave it a second look :) During testing it never got to truncate the .lg file because it hadn't grown big enough.

I guess testing should be done with users :)

THanks for all the input. I'm almost starting to think you have to much time on your hands helping me out like this... :awink:

I'll promise i'll help out some more users in the future, like you (Where i can of course). Live has been eventfull in the last week :eek:.

Regards,

Diederik
 
Top