Received: (qmail 9656 invoked by uid 2012); 17 Feb 1999 21:30:58 -0000
Message-Id: <19990217213058.9655.qmail@hyperreal.org>
Date: 17 Feb 1999 21:30:58 -0000
From: Phillip Ezolt <ezolt@perf.zko.dec.com>
Reply-To: ezolt@perf.zko.dec.com
To: apbugs@hyperreal.org
Subject: Under high load, server hangs in "flock or fnctl".
X-Send-Pr-Version: 3.2

>Number:         3911
>Category:       os-linux
>Synopsis:       Under high load, server hangs in "flock or fnctl".
>Confidential:   no
>Severity:       critical
>Priority:       medium
>Responsible:    apache
>State:          closed
>Class:          sw-bug
>Submitter-Id:   apache
>Arrival-Date:   Wed Feb 17 13:40:01 PST 1999
>Last-Modified:  Tue Apr 20 16:34:39 PDT 1999
>Originator:     ezolt@perf.zko.dec.com
>Organization:
>Release:        1.3.4
>Environment:
Linux crappy.zko.dec.com 2.2.1 #5 Fri Feb 12 09:07:00 EST 1999 i686 unknown
gcc version egcs-2.91.60 19981201 (egcs-1.1.1 release)
OS version: Redhat 5.2
Intel PII/266 with FDDI interface card.

Kernel compile with the following things changed: 
I have changed the following kernel values in /usr/src/linux/include/net/tcp.h 
to:

#define TCP_HTABLE_SIZE         2048 (was 512)
#define TCP_LHTABLE_SIZE        128 (was 32)
#define TCP_BHTABLE_SIZE        2048   (was 512)                     

Everything is on a local filesystem.  The lockfiles are NOT on NFS.
>Description:
I can repeatably get the server to stop responding after signifcantly stressing
the system.   Initially, I had apache compiled with flock serialization.  After
a while, a large number of the httpd processes were stuck in the following state:

#0  0x400d49c1 in flock ()
#1  0x805aaa9 in accept_mutex_on ()
#2  0x805d6a5 in child_main ()
#3  0x805dc68 in make_child ()
#4  0x805dfe1 in perform_idle_server_maintenance ()
#5  0x805e4e9 in standalone_main ()
#6  0x805ea7b in main () 

There were a few with the following (What they SHOULD be.. )
#0  0x400de5c2 in __libc_accept ()
#1  0x805d7bc in child_main ()
#2  0x805dc68 in make_child ()
#3  0x805dd17 in startup_children ()
#4  0x805e328 in standalone_main ()
#5  0x805ea7b in main () 

When I would try to connect to the server (lynx http://127.0.0.1), it would 
just hang.  Normally, the response would be instaneous. 

I tried to recompile apache with FCNTL support, and the same thing occurs. 
This time the stack trace is:

0  0x400d4974 in __libc_fcntl ()
#1  0x1 in ?? ()
#2  0x805d66d in child_main ()
#3  0x805dc30 in make_child ()
#4  0x805dcdf in startup_children ()
#5  0x805e2f0 in standalone_main ()
#6  0x805ea43 in main ()  

There is some kind of race condition that occurs under a very heavy load. 

I am not sure if it is a linux, apache, or even glibc bug, but I really want to
get a good result here. 



>How-To-Repeat:
The load is SPECWeb96.  When I try to push my system above 60 Ops/Sec, this
occurs. I don't have an easy way for an external site to repeat it, but for the 
next week and a half, it is all I will be working on.  So, I can easily try out
any patches that anyone may have.  
>Fix:
None. 
>Audit-Trail:
State-Changed-From-To: open-analyzed
State-Changed-By: dgaudet
State-Changed-When: Tue Mar 16 08:22:22 PST 1999
State-Changed-Why:
Uh I certainly hope there's no more than one process in accept(),
otherwise the locking is completely broken.

This is almost certainly a kernel bug.  Perhaps try a 2.0.36
kernel instead.

Dean


State-Changed-From-To: analyzed-feedback
State-Changed-By: dgaudet
State-Changed-When: Tue Mar 16 08:23:18 PST 1999
State-Changed-Why:
er, stick this in feedback... I'm hoping you can test 2.0.36
and report back, thanks
State-Changed-From-To: feedback-closed
State-Changed-By: dgaudet
State-Changed-When: Tue Apr 20 16:34:39 PDT 1999
State-Changed-Why:
At any rate we switched back to fcntl locking in apache 1.3.6.
>Unformatted:
[In order for any reply to be added to the PR database, ]
[you need to include <apbugs@Apache.Org> in the Cc line ]
[and leave the subject line UNCHANGED.  This is not done]
[automatically because of the potential for mail loops. ]
[If you do not include this Cc, your reply may be ig-   ]
[nored unless you are responding to an explicit request ]
[from a developer.                                      ]
[Reply only with text; DO NOT SEND ATTACHMENTS!         ]