From apwww@hyperreal.org Thu Oct 2 08:49:22 1997 Received: (from apwww@localhost) by hyperreal.org (8.8.5/8.8.5) id IAA06661; Thu, 2 Oct 1997 08:49:22 -0700 (PDT) Message-Id: <199710021549.IAA06661@hyperreal.org> Date: Thu, 2 Oct 1997 08:49:22 -0700 (PDT) From: Eugene Crosser Reply-To: crosser@average.org To: apbugs@hyperreal.org Subject: server processes in keepalive state do not die after keepalive-timeout X-Send-Pr-Version: 3.2 >Number: 1190 >Category: os-solaris >Synopsis: server processes in keepalive state do not die after keepalive-timeout >Confidential: no >Severity: serious >Priority: medium >Responsible: apache >State: closed >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: apache >Arrival-Date: Thu Oct 2 08:50:01 1997 >Closed-Date: Tue Mar 26 06:25:16 PST 2002 >Last-Modified: Tue Mar 26 06:25:16 PST 2002 >Originator: crosser@average.org >Release: 1.3.0 >Organization: >Environment: SunOS phobos 5.5.1 Generic_103640-03 sun4m sparc SUNW,SPARCstation-20 SunOS mars 5.5.1 Generic_103640-05 sun4u sparc SUNW,Ultra-Enterprise gcc 2.7.2 >Description: Under unknown circumstances, server processes stay in `K' (keepalive) status infinitely. On a server with 1300 hits/day, there are 5 to 15 such processes a day. TCP connection is in `ESTABLISHED' state, and later disappears from `netstat' display. When you try to `kill -ALRM' the process (as if timeout expired), nothing happens. If you attach the process in `gdb', you see that it is peacefully reading from the socket. As days pass, more and more `K' processes are hanging around and eventually reach MaxClient limit. This does not happen with 1.1.3 running with *exactly* same config. This apparently does not happen on other operating systems. I could *not* reproduce it by telnetting, requesting a file with keepalive and waiting: in this situation server gracefully closes connection after keepalive-timeout. >How-To-Repeat: Just let the server run for a few hours... >Fix: No >Release-Note: >Audit-Trail: State-Changed-From-To: open-feedback State-Changed-By: Lars.Eilebrecht@unix-ag.org State-Changed-When: Sat Oct 18 05:35:51 PDT 1997 State-Changed-Why: I cannot verify your problem (I'm using Solaris 2.4/2.5 myself). Please mail your httpd.conf (at least the important directives, eg. Timeout, KeepAlive* etc.). And please check if you have the lasted Sun (tcp-)patches installed on your system. State-Changed-From-To: feedback-closed State-Changed-By: Lars.Eilebrecht@unix-ag.org State-Changed-When: Tue Dec 16 08:34:09 PST 1997 State-Changed-Why: No response from submitter, assuming problem resolved. State-Changed-From-To: closed-analyzed State-Changed-By: Lars.Eilebrecht@unix-ag.org State-Changed-When: Tue Dec 16 18:10:29 PST 1997 State-Changed-Why: Sorry, PR was closed by mistake (I overlooked your reply). But I still don't have an idea why you see that much processes in keep-alive state... Do you see any messages in your error log? What are your settings in httpd.conf for KeepAliveTimeout MaxKeepAliveRequests etc.? Do you have any special settings for your tcp driver (e.g. do you use 'ndd' to tune /dev/tcp values)? Your logfile directory (and thus the lockfile) isn't located on an NFS mounted filesystem, isn't it? (Maybe you want to try our latest 1.3beta of Apache.) Release-Changed-From-To: 1.2.0 and 1.2.4-1.2.4 Release-Changed-By: Lars.Eilebrecht@unix-ag.org Release-Changed-When: Tue Dec 16 18:10:29 PST 1997 From: Dean Gaudet To: Eugene Crosser Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Thu, 18 Dec 1997 10:34:54 -0800 (PST) Do you have any third party modules compiled in? Are you using IdentityCheck? Are you using the proxy module? 1.3 includes a rewritten alarm system... so it may actually work. But this is an odd problem I've never seen before. Including on ultra 2, 2 processor 2.5.1 systems under high load. (In your original description you wrote "1300 hits/day" ... I think you mean a few orders magnitude more, right? :) It's possible we still have a race condition which manifests itself only on solaris. Did you ever try doing kill -ALRM to one of the stuck keepalive children? Did it recover? Dean From: Eugene Crosser To: Dean Gaudet Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Fri, 19 Dec 1997 01:54:13 +0300 (MSK) > Do you have any third party modules compiled in? Normally I do but I specially compiled a virgin version to check, and it behaves exactly the same way. > Are you using IdentityCheck? Probably no as I don't know what's that ;) > Are you using the proxy module? No. > 1.3 includes a rewritten alarm system... so it may actually work. But I will compile and try 1.3, maybe next week. I will report the results. > you wrote "1300 hits/day" ... I think you mean a few orders magnitude > more, right? :) No. The server is not *really* busy, I just mean that this cannot be reproduced on a `test only' server, you need people coming from various places and leaving connections up for a long time. > It's possible we still have a race condition which > manifests itself only on solaris. It very well might be a bug in Solaris, but as 1.1 works fine, a workaround must exist... > Did you ever try doing kill -ALRM to one of the stuck keepalive children? > Did it recover? I wrote it twice in my reports: yes I tried to kill the processes with -ALRM, and *no*, the process does not notice it! But if I kill it with -TERM, it gracefully terminates. Eugene From: Dean Gaudet To: Eugene Crosser Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Thu, 18 Dec 1997 16:32:49 -0800 (PST) On Fri, 19 Dec 1997, Eugene Crosser wrote: > > you wrote "1300 hits/day" ... I think you mean a few orders magnitude > > more, right? :) > > No. The server is not *really* busy, I just mean that this cannot be > reproduced on a `test only' server, you need people coming from various > places and leaving connections up for a long time. But 1300 hits/*day* is barely any load at all. That's why I'm asking what the real number is. > > Did you ever try doing kill -ALRM to one of the stuck keepalive children? > > Did it recover? > > I wrote it twice in my reports: yes I tried to kill the processes with > -ALRM, and *no*, the process does not notice it! But if I kill it with > -TERM, it gracefully terminates. I just had to ask to be certain. This pretty much confirms that the signal handler hasn't even been installed. One more thing you can try if you can get to this state again is to send a kill -PIPE to the pid -- it should close up the connection and continue. That'll narrow it down a wee bit more. Dean From: Eugene Crosser To: Dean Gaudet Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Sat, 20 Dec 1997 11:37:49 +0300 (MSK) > But 1300 hits/*day* is barely any load at all. That's why I'm asking what > the real number is. Oh, yes, you are right! Probably it really was 13,000/day. And approximately one request of 2000 was causing a lethargic process. Anyway, from the first try, 1.3b3 does *not* have the reported problem: Current Time: Sat Dec 20 11:10:01 1997 Restart Time: Fri Dec 19 21:52:17 1997 Server uptime: 13 hours 17 minutes 44 seconds Total accesses: 14386 - Total Traffic: 13.8 MB CPU Usage: u0.75 s1.18 cu0 cs0 - 0.00403% CPU load 0.301 requests/sec - 302 B/second - 1005 B/request 1 requests currently being processed, 9 idle servers and there are no sleeping keepalive processes. Of course I need to run it more and watch. If things go wrong I will report it but for now it seems that the new version has cured my problem. (Bad news is that there is no SSL patch for 1.3 yet, but you cannot have everything ;) Thanks you all for your assistance, and for an excellent www server. Eugene State-Changed-From-To: analyzed-closed State-Changed-By: dgaudet State-Changed-When: Sat Dec 20 12:21:04 PST 1997 State-Changed-Why: User reports problem is solved in 1.3. While it'd be nice to know what race condition we're avoiding we just don't have the bandwidth to track it down. Dean From: Eugene Crosser To: apbugs@apache.org Cc: Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Wed, 17 Jun 1998 15:48:50 +0400 (MSD) Dear apache team, the problem in PR#1190 that disappeared in 1.3beta3, is present again in 1.3.0 release. For full description, please see the PR transcript. In sort, approximately every 1000th request leaves the server process in "Keepalive" status *forever*, it does not disconnect after Keepalive-Timeout, and after a few hours of operation, all MaxClient processes are in "K" status and no new requests are processed. The system is SunOS mars 5.5.1 Generic_103640-18 sun4u sparc SUNW,Ultra-Enterprise running on a dual processor Ultra 4000, previously, it was observed on dual processor SS20. I never tried 1.3 beta releases other than 1.3b3, and I have no access to the sources of beta releases. If you tell me where I can get betas from 4 to the last (7?), I will look at which exactly release did the problem reappear. Thank you. Eugene , State-Changed-From-To: closed-open State-Changed-By: coar State-Changed-When: Wed Jun 17 15:18:53 PDT 1998 State-Changed-Why: [Problem is back..] Release-Changed-From-To: 1.2.4-1.3.0 Release-Changed-By: coar Release-Changed-When: Wed Jun 17 15:18:53 PDT 1998 From: Dean Gaudet To: Eugene Crosser Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Wed, 17 Jun 1998 22:56:12 -0700 (PDT) Hmm, we don't appear to have all the betas online. Bleh. Here's a few suggestions. - If your ServerRoot is on NFS then you need to use the LockFile directive to move the lock file. - Try editing src/include/httpd.h, search for OPTIMIZE_TIMEOUTS, and comment out that line, so that it is not defined. Then recompile and see if that helps any. - You're not using any other options for compiling are you? Are you using gcc or solaris cc? Dean From: Dean Gaudet To: Eugene Crosser Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Wed, 17 Jun 1998 23:36:49 -0700 (PDT) Maybe you could try this patch. It looks like there's a small race condition with keepalive timeouts... but I don't understand why we don't see it more frequently. BTW, please set "LogLevel debug" in your httpd.conf and tell me if you get any of those "possible nested timer" warnings. Thanks Dean Index: main/http_main.c =================================================================== RCS file: /export/home/cvs/apache-1.3/src/main/http_main.c,v retrieving revision 1.365 diff -u -r1.365 http_main.c --- http_main.c 1998/06/16 03:37:27 1.365 +++ http_main.c 1998/06/18 06:18:53 @@ -975,6 +975,7 @@ } else { /* abort the connection */ ap_bsetflag(current_conn->client, B_EOUT, 1); + ap_bclose(current_conn->client); current_conn->aborted = 1; } } @@ -1045,9 +1046,11 @@ alarm_expiry_time = time(NULL) + x; } #else - if (x) { - alarm_fn = fn; + if (alarm_fn && x && fn != alarm_fn) { + ap_log_error(APLOG_MARK, APLOG_NOERRNO|APLOG_DEBUG, NULL, + "ap_set_callback_and_alarm: possible nested timer!"); } + alarm_fn = fn; #ifndef OPTIMIZE_TIMEOUTS old = alarm(x); #else Index: main/rfc1413.c =================================================================== RCS file: /export/home/cvs/apache-1.3/src/main/rfc1413.c,v retrieving revision 1.24 diff -u -r1.24 rfc1413.c --- rfc1413.c 1998/05/18 21:56:11 1.24 +++ rfc1413.c 1998/06/18 06:18:53 @@ -229,9 +229,8 @@ if (get_rfc1413(sock, &conn->local_addr, &conn->remote_addr, user, srv) >= 0) result = user; - - ap_set_callback_and_alarm(NULL, 0); } + ap_set_callback_and_alarm(NULL, 0); ap_pclosesocket(conn->pool, sock); conn->remote_logname = result; From: Eugene Crosser To: Dean Gaudet Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Thu, 18 Jun 1998 16:00:49 +0400 (MSD) Dean, > Hmm, we don't appear to have all the betas online. Bleh. Here's a few > suggestions. I've already found all needed betas at ftp://ftp.apache.org/httpd/dist/ and downloaded them a minute ago. When I have results, I'll post. > - If your ServerRoot is on NFS then you need to use the LockFile directive > to move the lock file. No it is not on the NFS. > - Try editing src/include/httpd.h, search for OPTIMIZE_TIMEOUTS, and > comment out that line, so that it is not defined. Then recompile and see > if that helps any. I'll try it and tell you. > - You're not using any other options for compiling are you? Are you using > gcc or solaris cc? I currently cannot compile "clean" server: I am using PHP3 and a few modules of my own, *but* 1.3b3 works well in *exactly same* configuration. I am using gcc (2.7.2.2 I think). (You know, the problem only reveals if there is a considerable traffic... And I cannot have production server without extra modules) I'll be back when I have more information. Eugene From: Eugene Crosser To: Dean Gaudet Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Thu, 18 Jun 1998 17:59:28 +0400 (MSD) > Maybe you could try this patch. It looks like there's a small race > condition with keepalive timeouts... but I don't understand why we don't > see it more frequently. Did not change anything: after processing ~12,000 requests, some ~70 server processes fall into "permament keepalive". > BTW, please set "LogLevel debug" in your httpd.conf and tell me if you get > any of those "possible nested timer" warnings. No such messages. Started 1.3b6 right now, will have results in a few hours. Eugene From: Eugene Crosser To: Dean Gaudet Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Thu, 18 Jun 1998 20:40:58 +0400 (MSD) OK, I am now pretty certain that the fatal change happend between beta 5 and beta 6. Eugene From: Eugene Crosser To: Dean Gaudet Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Thu, 18 Jun 1998 21:18:14 +0400 (MSD) > - Try editing src/include/httpd.h, search for OPTIMIZE_TIMEOUTS, and > comment out that line, so that it is not defined. Then recompile and see > if that helps any. I did this: no change. Now, I did a bit more investigation. First, sometimes the processes that stayed in "Keepalive" status several minutes (i.e. much longer than the KeepaliveTimeout) still finishes. I assume this may happen when the other end explicitely closes TCP connection. Next, if I choose a process that is staying "keepalive" for a long time and send it "kill -ALRM" it does not notice it and stays in the same status. If I send it "kill -PIPE" it gracefully resets and is ready to serve next requests. That's all for now. Please tell me what else I can do to help chasing the problem. Eugene From: Dean Gaudet To: Eugene Crosser Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Thu, 18 Jun 1998 10:36:10 -0700 (PDT) 1.3b5 used USE_PTHREAD_SERIALIZED_ACCEPT on solaris, and 1.3b6 uses USE_FCNTL_SERIALIZED_ACCEPT (as do all the 1.2.x and earlier versions). We switched back to fcntl because the pthread stuff was proving unreliable for a lot of folks... You could try adding EXTRA_CFLAGS=-DUSE_PTHREAD_SERIALIZED_ACCEPT and reconfiguring/compiling. But I really don't see how this will help. That's the only thing I can find... Dean On Thu, 18 Jun 1998, Eugene Crosser wrote: > OK, I am now pretty certain that the fatal change happend between > beta 5 and beta 6. > > Eugene > From: Dean Gaudet To: Eugene Crosser Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Thu, 18 Jun 1998 10:37:24 -0700 (PDT) Yeah this sounds like the signal handler has been removed... which is why I sent the patch yesterday. You said you're using php, right? Is it the latest php? I believe php also plays with timeouts... do you connect to any sql databases with it? Dean On Thu, 18 Jun 1998, Eugene Crosser wrote: > > - Try editing src/include/httpd.h, search for OPTIMIZE_TIMEOUTS, and > > comment out that line, so that it is not defined. Then recompile and see > > if that helps any. > > I did this: no change. > > Now, I did a bit more investigation. First, sometimes the processes that > stayed in "Keepalive" status several minutes (i.e. much longer than the > KeepaliveTimeout) still finishes. I assume this may happen when the other > end explicitely closes TCP connection. Next, if I choose a process that > is staying "keepalive" for a long time and send it "kill -ALRM" it does > not notice it and stays in the same status. If I send it "kill -PIPE" it > gracefully resets and is ready to serve next requests. > > That's all for now. Please tell me what else I can do to help chasing > the problem. > > Eugene > From: Eugene Crosser To: Dean Gaudet Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Thu, 18 Jun 1998 22:04:48 +0400 (MSD) > Yeah this sounds like the signal handler has been removed... which is why > I sent the patch yesterday. > > You said you're using php, right? Is it the latest php? I believe php > also plays with timeouts... do you connect to any sql databases with it? Yes I do, it's Oracle. Back in October 97, it was my first idea to check. Back then, I did build a "clean" server and the problem persisted. Also note that betas 3 and 5 are working flawlessly with exactly same PHP (it's 3.0 release). Beta 3 (with an older PHP3) was runnning here in production without a minor problem for half a year! Now, 1.3.0 release compiled with -DUSE_PTHREAD_SERIALIZED_ACCEPT is running here for 25 minutes. It seems that it still suffers the problem. Switching back to beta5 and going home for some sleep... Eugene State-Changed-From-To: open-feedback State-Changed-By: lars State-Changed-When: Sat Jul 18 13:18:50 PDT 1998 State-Changed-Why: [This is a standard response.] This Apache problem report has not been updated recently. Please reply to this message if you have any additional information about this issue, or if you have answers to any questions that have been posed to you. If there are no outstanding questions, please consider this a request to try to reproduce the problem with the latest software release, if one has been made since last contact. If we don't hear from you, this report will be closed. From: Marc Slemko To: apbugs@apache.org Cc: Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout (fwd) Date: Sun, 19 Jul 1998 10:35:50 -0700 (PDT) ---------- Forwarded message ---------- Date: Sun, 19 Jul 1998 14:41:31 +0400 (MSD) From: Eugene Crosser To: lars@apache.org Cc: apache-bugdb@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout > This Apache problem report has not been updated recently. > Please reply to this message if you have any additional > information about this issue, or if you have answers to > any questions that have been posed to you. If there are > no outstanding questions, please consider this a request > to try to reproduce the problem with the latest software > release, if one has been made since last contact. If we > don't hear from you, this report will be closed. The problem did *not* disappear automagically. I sent all the information that I thought is relevant, and followed all advice that I got from the apache team. So far, it did not help. If I can do further investigation, please advice. So far, the situation is as follows: apache 1.1.x works when keepalive is used, apache 1.2.x and 1.3.0 *releases* do not work for me with keepalive. Although 1.3 betas 3 to 5 do work. Currently, I have to run 1.3.0 with keepalive disabled. Eugene State-Changed-From-To: feedback-open State-Changed-By: coar State-Changed-When: Mon Sep 7 05:52:21 PDT 1998 State-Changed-Why: [Issue is still open..] State-Changed-From-To: open-feedback State-Changed-By: lars State-Changed-When: Sat Nov 14 08:17:02 PST 1998 State-Changed-Why: Can you please recompile Apache with "-DNO_WRITEV" and test if this 'fixes' your problem? From: Eugene Crosser To: apbugs@Apache.Org Cc: Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Wed, 9 Dec 1998 11:56:38 +0300 (MSK) On the request: > Can you please recompile Apache with "-DNO_WRITEV" > and test if this 'fixes' your problem? I can report, that I have a server (1.3.3) built with mod_ssl, and the latter adds "-DNO_WRITEV" option. No, on this server I still observe hanging keepalive processes. Apparently, "-DNO_WRITEV" does not fix the problem. I cannot build a "clean" server (without any third party modules) at the moment, because the problem can only be observed if the hit rate is sufficiently high, i.e. on a production server only... Eugene Comment-Added-By: coar Comment-Added-When: Wed May 3 13:16:48 PDT 2000 Comment-Added: Is this still a problem with 1.3.12? Comment-Added-By: coar Comment-Added-When: Wed May 24 10:37:29 PDT 2000 Comment-Added: [This is a standard response.] This Apache problem report has not been updated recently. Please reply to this message if you have any additional information about this issue, or if you have answers to any questions that have been posed to you. If there are no outstanding questions, please consider this a request to try to reproduce the problem with the latest software release, if one has been made since last contact. If we don't hear from you, this report will be closed. If you have information to add, BE SURE to reply to this message and include the apbugs@Apache.Org address so it will be attached to the problem report! From: "David J. MacKenzie" To: apbugs@apache.org Cc: djm@uu.net, rse@engelschall.com Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Mon, 20 Nov 2000 23:16:39 -0500 (EST) We have just started experiencing what seems to be the same problem as http://bugs.apache.org/index.cgi/full/1190 which was reported by a Solaris 2.5.1 user in 1998 and never resolved. That person was also using mod_ssl and PHP, which seems to be relevant. Also http://bugs.apache.org/index.cgi/full/6211 may be related, though today I applied the patch in that PR to no apparent effect. We are using the newest versions of (almost) everything, on BSDI BSD/OS 4.0.1. I have some additional data which should be helpful. In short, the finger *seems* to point at mod_ssl as the culprit, though I haven't looked at the code to see how that might be plausible. A week ago UUNET upgraded our server farm of about 800 servers, of which a few dozen have SSL, from apache 1.3.12 (for most servers) or Stronghold 2.4.2 (for those that have SSL). They are now running: apache 1.3.14, with two patches from bugs.apache.org to fix corrupting PDF files and mod_rewrite maps (the Bugtraq patch) mod_ssl 2.7.1 OpenSSL 0.9.5a PHP 4.0.3pl1 mod_auth_kerb configured for Kerberos v5 All modules except http_core and mod_so are loaded as DSO's. All of the servers are using the same apache binary and DSO's, compiled with EAPI, but we only LoadModule mod_ssl for those servers that have SSL keys and certs. We're not using Java or Perl modules, or anything that multithreads. The BSD/OS pthreads are user-space anyway. root@enniskillen 39 $ ldd /usr/local/libexec/apache libkrb5.so => /usr/local/krb5/lib/libkrb5.so (0xc054000) libk5crypto.so => /usr/local/krb5/lib/libk5crypto.so (0xc0b4000) libmm.so.11 => /usr/local/lib/libmm.so.11 (0xc0ce000) libdl.so => /shlib/libdl.so (0xc0d2000) libgcc.so => /shlib/libgcc.so (0xc0d5000) libc.so => /shlib/libc.so (0xc0d8000) libcom_err.so => /usr/local/krb5/lib/libcom_err.so (0xc15b000) Our new apache+mod_ssl installation is not always handling HTTP Keepalive correctly. It's configured to keep connections alive for 5 seconds, but it's not letting some of them go. We see the same behavior described in PR 1190, in which over the course of a few hours gradually most of the process slots become filled with Keepalive connections that are much older than is supposed to be allowed. Eventually our monitoring systems start alerting that they can't connect to the servers. Some of the old connections eventually go away on their own, perhaps those from dialup lines; I'm not sure. I sampled the mod-status pages of several of our customers, loading the page, waiting 30 seconds or more, and loading it again in a second window, and comparing the lists. I looked for which child processes had connections in the Keepalive state, and checked whether the amount of data transferred had changed. The random sample of about a dozen non-SSL customers I checked all looked normal. Some of the customers I checked who have SSL showed the problem. For example, one server got a few http (not https) requests at 7:29 this morning from IP address 212.250.100.120, and none since. 12 hours later, the TCP connection is still open, and taking up 3 apache process slots in the Keepalive state. The browser is "Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)". Another server shows the same sort of problem, with a connection at 1:13 this afternoon from 192.44.136.113 which lasted 3 seconds but is still open: root@platform-33: netstat -an | grep 192.44.136.113 tcp 0 0 208.240.90.209.80 192.44.136.113.39653 ESTABLISHED tcp 0 0 208.240.90.209.80 192.44.136.113.39650 ESTABLISHED tcp 0 0 208.240.90.209.80 192.44.136.113.39598 ESTABLISHED Their mod-status page confirms that 3 child processes are still in the Keepalive state for this IP address. The browser is "Mozilla/4.5 [en] (Win98; I)". That address is pingable: root@platform-31: ping 192.44.136.113 PING 192.44.136.113 (192.44.136.113): 56 data bytes 64 bytes from 192.44.136.113: icmp_seq=0 ttl=246 time=23.961 ms So the problem doesn't seem to depend on the browser (Netscape or MSIE). I've seen it with clients on Windows 95/98 (mainly) and MacOS, and I think on NT. Most or all of the requests involved have been for static content. The affected servers aren't using PHP. Some of our SSL servers aren't showing the problem, but they are doing little volume. Late this afternoon I temporarily turned Keepalive off for the two servers affected the worst, who keep failing our monitoring because all child processes are used. They went from 40-60 child processes being used simultaneously, to 2-13, though this wasn't in the busiest part of the day. I also found this comment on Slashdot from a year ago, at http://slashdot.org/apache/99/12/22/1711203.shtml: I've tried both, and while admittedly mod_ssl looks cleaner, is easier to set up, and is updated more frequently, we mad several problems with Microsoft and AOL clients connecting via SSL. All of these problems went away once we moved over to Apache-SSL. We tried fiddling with the keepalive and "unclean shutdown" settings to no avail with mod_ssl but it didn't seem to do any good. I haven't tried Apache-SSL yet. From: Tony Finch To: "David J. MacKenzie" Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Tue, 21 Nov 2000 05:36:59 +0000 "David J. MacKenzie" wrote: > > In short, the finger *seems* to point at mod_ssl as the culprit, > though I haven't looked at the code to see how that might be plausible. If that is the case then you'll have to speak to the mod_ssl authors because this bug database is only for the core Apache code. Do I correctly gather that the stuck connections can be either SSL or not? Do your non-SSL customers who do not have the problem run Apache with EAPI? Tony. -- f.a.n.finch dot@dotat.at fanf@covalent.net Chad for President! From: djm@web.us.uu.net (David J. MacKenzie) To: dot@dotat.at Cc: apbugs@apache.org Subject: Re: os-solaris/1190: server processes in keepalive state do not die after keepalive-timeout Date: Tue, 21 Nov 2000 09:13:00 -0500 (EST) > If that is the case then you'll have to speak to the mod_ssl authors > because this bug database is only for the core Apache code. Yes--I cc'd rse. At this point the cause is only a guess, though. > Do I correctly gather that the stuck connections can be either SSL or > not? Do your non-SSL customers who do not have the problem run Apache > with EAPI? None of the stuck connections that I've investigated so far have turned out to be SSL connections, which proportionately aren't very common. The non-SSL servers are indeed the same apache binary, with everything compiled with EAPI (which I said before if you look closely enough :-) If any of them are having this problem, it hasn't been severe enough to show up in our monitoring or common enough to show up in my spot checking. BTW, I don't see anything abnormal in the servers' error logs. State-Changed-From-To: feedback-closed State-Changed-By: jim State-Changed-When: Tue Mar 26 06:25:16 PST 2002 State-Changed-Why: [This is a standard response.] No response from submitter, assuming issue has been resolved. >Unformatted: