Received: (qmail 28877 invoked by uid 2012); 20 Nov 1997 19:31:57 -0000 Message-Id: <19971120193157.28876.qmail@hyperreal.org> Date: 20 Nov 1997 19:31:57 -0000 From: Randy Wiemer Reply-To: wiemer@law.missouri.edu To: apbugs@hyperreal.org Subject: Apache stops responding - kill -TERM PID leads to zombie must reboot X-Send-Pr-Version: 3.2 >Number: 1441 >Category: os-aix >Synopsis: Apache stops responding - kill -TERM PID leads to zombie must reboot >Confidential: no >Severity: critical >Priority: medium >Responsible: apache >State: closed >Class: sw-bug >Submitter-Id: apache >Arrival-Date: Thu Nov 20 11:40:00 PST 1997 >Last-Modified: Mon Dec 15 09:58:31 PST 1997 >Originator: wiemer@law.missouri.edu >Organization: >Release: 1.2.4 >Environment: RS6000 SP2 4-way 604 with AIX 4.2.1 (was fine with 4.2.0) Apache 1.2.1 and 1.2.4 have same problem. uname -a AIX sp2n13 2 4 00040362A400 >Description: AIX 4.2.1 upgraded last Sunday from AIX 4.2.0 with Apache 1.2.1 running reliably for several months. We needed 4.2.1 to support a processor upgrade from 4-way 604 to 8-way 604e in two weeks. Since upgrade Apache stops responding to service page requests but processes look fine from ps. Attempting to stop and restart Apache results in zombie process owned by the account running httpd holding a lock on port 80 preventing a restart of the httpd daemon. We switched to Apache 1.2.4 but continue to experience the same problem. >How-To-Repeat: I can't cause it to happen other than it seems load related because the outages occur during peak usage times. >Fix: Even if the outage can't be addressed is there a way to kill the httpd daemon that doesn't result in the zombie%3 >Audit-Trail: From: Dean Gaudet To: Randy Wiemer Cc: apbugs@hyperreal.org Subject: Re: os-aix/1441: Apache stops responding - kill -TERM PID leads to zombie must reboot Date: Thu, 20 Nov 1997 13:17:34 -0800 (PST) This feels like an OS bug ... have you tried any aix specific newsgroups? Dean From: Dean Gaudet To: apbugs@apache.org Cc: Subject: Re: os-aix/1441: Apache stops responding - kill -TERM PID leads to zombie must reboot -Reply (fwd) Date: Thu, 20 Nov 1997 15:53:16 -0800 (PST) ---------- Forwarded message ---------- Date: Thu, 20 Nov 1997 17:50:06 -0600 From: Randy Wiemer To: dgaudet@arctic.org Subject: Re: os-aix/1441: Apache stops responding - kill -TERM PID leads to zombie must reboot -Reply I also believe (know) it to be an OS bug but IBM is slow to act so I am pursuing all available options in trying to resolve the problem. IBM actually conducted some tests on my machine today and we are sending them a backup tape of the root volume so they can put it in their lab and test it. At this point they believe it is caused by an efix they had us apply. They claim to be running Apache 1.2.0 on AIX 4.2.1 in their lab with no problems. We took delivery of this machine in June and have encountered one problem after another simply running Oracle and Apache. Until now Apache has been rock solid and even now I don't really believe the fault is in the Apache code. IBM is less certain simply because it is only Apache that seems to be dying. Randy >>> Dean Gaudet 11/20/97 03:17pm >>> This feels like an OS bug ... have you tried any aix specific newsgroups? Dean From: Randy Wiemer To: wiemer@law.missouri.edu Cc: apbugs@Apache.Org Subject: Re: os-aix/1441: Apache stops responding - kill -TERM PID leads to zombie must reboot -Reply Date: Thu, 20 Nov 1997 18:28:19 -0600 I also believe (know) it to be an OS bug but IBM is slow to act so I am pursuing all available options in trying to resolve the problem. IBM actually conducted some tests on my machine today and we are sending them a backup tape of the root volume so they can put it in their lab and test it. At this point they believe it is caused by an efix they had us apply. They claim to be running Apache 1.2.0 on AIX 4.2.1 in their lab with no problems. We took delivery of this machine in June and have encountered one problem after another simply running Oracle and Apache. Until now Apache has been rock solid and even now I don't really believe the fault is in the Apache code. IBM is less certain simply because it is only Apache that seems to be dying. Randy >>> Dean Gaudet 11/20/97 03:17pm >>> This feels like an OS bug ... have you tried any aix specific newsgroups? Dean From: Randy Wiemer To: dgaudet@arctic.org Cc: apbugs@Apache.Org Subject: Re: os-aix/1441: Apache stops responding - kill -TERM PID leads to zombie must reboot -Reply Date: Thu, 20 Nov 1997 18:38:15 -0600 I have looked deeper into the Apache bug database. Problem 869 looks to raise three separate issues. I have resolved two of the three but the third problem seems to persist and might be my root cause. Problem 869 includes a make report with 4 items of unsigned long assigned to int, one error about an infinite loop program may not terminate and the ERROR: Undefined symbol: .__set_errno128 I can fix the unsigned long to int type by retyping the variables. I can fix the ERROR: Undefined symbol: .__set_errno128 by using flags -lm variable. I cannot fix the compiler's warning about the infinite loop. Might this be my zombie process? Might it also lead to the condition of the server ceasing to serve pages? I am going to find an AIX 4.2.0 machine and do the compile there and see if I get any of these errors. Randy From: Marc Slemko To: Randy Wiemer Cc: Apache bugs database Subject: Re: os-aix/1441: Apache stops responding - kill -TERM PID leads to zombie must reboot -Reply Date: Thu, 20 Nov 1997 18:18:08 -0700 (MST) On 21 Nov 1997, Randy Wiemer wrote: > The following reply was made to PR os-aix/1441; it has been noted by GNATS. > > From: Randy Wiemer > To: dgaudet@arctic.org > Cc: apbugs@Apache.Org > Subject: Re: os-aix/1441: Apache stops responding - kill -TERM PID > leads to zombie must reboot -Reply > Date: Thu, 20 Nov 1997 18:38:15 -0600 > > I have looked deeper into the Apache bug database. Problem 869 looks to raise > three separate issues. I have resolved two of the three but the third problem > seems to persist and might be my root cause. > > Problem 869 includes a make report with 4 items of unsigned long assigned to > int, one error about an infinite loop program may not terminate and the ERROR: > Undefined symbol: .__set_errno128 > > I can fix the unsigned long to int type by retyping the variables. I can fix > the ERROR: Undefined symbol: .__set_errno128 by using flags -lm variable. I > cannot fix the compiler's warning about the infinite loop. Might this be my > zombie process? Might it also lead to the condition of the server ceasing to > serve pages? No. It is unrelated. From: Randy Wiemer To: marcs@znep.com Cc: apbugs@apache.org Subject: Re: os-aix/1441: Apache stops responding - kill -TERM PID leads to zombie must reboot -Reply -Reply Date: Thu, 20 Nov 1997 20:05:46 -0600 IBM took my httpd binary put it on one of their 4.2.1 machines and were able to kill the process without any problems. We have fedexed them a tape backup of our root that they will install on one of their SMP machines to try to uncover the problem. I put both 1.2.0 and 1.2.4 on an AIX 4.2.0 RS6000 and got the same results during the make. The infinite loop warning still troubles me. I am installed gcc and will try with it. Randy State-Changed-From-To: open-feedback State-Changed-By: dgaudet State-Changed-When: Sat Dec 13 18:01:45 PST 1997 State-Changed-Why: When IBM gets back to you with any info please send it to us. Thanks Dean From: Randy Wiemer To: apache-bugdb@apache.org,dgaudet@apache.org, dgaudet@hyperreal.org Cc: apbugs@Apache.Org Subject: Re: os-aix/1441: Apache stops responding - kill -TERM PID leads to zombie must reboot -Reply Date: Mon, 15 Dec 1997 11:27:03 -0600 IBM provided us with an EFIX to replace an earlier EFIX that was applied because our SP2 node was locking-up running Oracle. The EFIX replaced two AIX 4.2.1 files: -r-xr-xr-x 1 root system 2691098 Nov 16 11:43 /usr/lib/boot/unix_mp -r-xr-xr-x 1 root system 152748 Nov 22 15:27 /usr/lib/drivers/netinet The author of the EFIX theorized that the cause was timing related and that we didn't experience the problem on AIX 4.2.0 because it manifested itself only when asynchronous events occured within a couple of instruction cycles of one-another. IBM revealed that we were not the only customer experiencing the problem. Randy Wiemer University of Missouri >>> 12/13/97 08:01pm >>> [In order for any reply to be added to the PR database, ] [you need to include in the Cc line ] [and leave the subject line UNCHANGED. This is not done] [automatically because of the potential for mail loops. ] Synopsis: Apache stops responding - kill -TERM PID leads to zombie must reboot State-Changed-From-To: open-feedback State-Changed-By: dgaudet State-Changed-When: Sat Dec 13 18:01:45 PST 1997 State-Changed-Why: When IBM gets back to you with any info please send it to us. Thanks Dean State-Changed-From-To: feedback-closed State-Changed-By: dgaudet State-Changed-When: Mon Dec 15 09:58:31 PST 1997 State-Changed-Why: Thanks for the info. I'm closing this out then. Thanks for using Apache! Dean >Unformatted: [In order for any reply to be added to the PR database, ] [you need to include in the Cc line ] [and leave the subject line UNCHANGED. This is not done] [automatically because of the potential for mail loops. ]