Received: (qmail 9656 invoked by uid 2012); 17 Feb 1999 21:30:58 -0000 Message-Id: <19990217213058.9655.qmail@hyperreal.org> Date: 17 Feb 1999 21:30:58 -0000 From: Phillip Ezolt Reply-To: ezolt@perf.zko.dec.com To: apbugs@hyperreal.org Subject: Under high load, server hangs in "flock or fnctl". X-Send-Pr-Version: 3.2 >Number: 3911 >Category: os-linux >Synopsis: Under high load, server hangs in "flock or fnctl". >Confidential: no >Severity: critical >Priority: medium >Responsible: apache >State: closed >Class: sw-bug >Submitter-Id: apache >Arrival-Date: Wed Feb 17 13:40:01 PST 1999 >Last-Modified: Tue Apr 20 16:34:39 PDT 1999 >Originator: ezolt@perf.zko.dec.com >Organization: >Release: 1.3.4 >Environment: Linux crappy.zko.dec.com 2.2.1 #5 Fri Feb 12 09:07:00 EST 1999 i686 unknown gcc version egcs-2.91.60 19981201 (egcs-1.1.1 release) OS version: Redhat 5.2 Intel PII/266 with FDDI interface card. Kernel compile with the following things changed: I have changed the following kernel values in /usr/src/linux/include/net/tcp.h to: #define TCP_HTABLE_SIZE 2048 (was 512) #define TCP_LHTABLE_SIZE 128 (was 32) #define TCP_BHTABLE_SIZE 2048 (was 512) Everything is on a local filesystem. The lockfiles are NOT on NFS. >Description: I can repeatably get the server to stop responding after signifcantly stressing the system. Initially, I had apache compiled with flock serialization. After a while, a large number of the httpd processes were stuck in the following state: #0 0x400d49c1 in flock () #1 0x805aaa9 in accept_mutex_on () #2 0x805d6a5 in child_main () #3 0x805dc68 in make_child () #4 0x805dfe1 in perform_idle_server_maintenance () #5 0x805e4e9 in standalone_main () #6 0x805ea7b in main () There were a few with the following (What they SHOULD be.. ) #0 0x400de5c2 in __libc_accept () #1 0x805d7bc in child_main () #2 0x805dc68 in make_child () #3 0x805dd17 in startup_children () #4 0x805e328 in standalone_main () #5 0x805ea7b in main () When I would try to connect to the server (lynx http://127.0.0.1), it would just hang. Normally, the response would be instaneous. I tried to recompile apache with FCNTL support, and the same thing occurs. This time the stack trace is: 0 0x400d4974 in __libc_fcntl () #1 0x1 in ?? () #2 0x805d66d in child_main () #3 0x805dc30 in make_child () #4 0x805dcdf in startup_children () #5 0x805e2f0 in standalone_main () #6 0x805ea43 in main () There is some kind of race condition that occurs under a very heavy load. I am not sure if it is a linux, apache, or even glibc bug, but I really want to get a good result here.  >How-To-Repeat: The load is SPECWeb96. When I try to push my system above 60 Ops/Sec, this occurs. I don't have an easy way for an external site to repeat it, but for the next week and a half, it is all I will be working on. So, I can easily try out any patches that anyone may have. >Fix: None. >Audit-Trail: State-Changed-From-To: open-analyzed State-Changed-By: dgaudet State-Changed-When: Tue Mar 16 08:22:22 PST 1999 State-Changed-Why: Uh I certainly hope there's no more than one process in accept(), otherwise the locking is completely broken. This is almost certainly a kernel bug. Perhaps try a 2.0.36 kernel instead. Dean State-Changed-From-To: analyzed-feedback State-Changed-By: dgaudet State-Changed-When: Tue Mar 16 08:23:18 PST 1999 State-Changed-Why: er, stick this in feedback... I'm hoping you can test 2.0.36 and report back, thanks State-Changed-From-To: feedback-closed State-Changed-By: dgaudet State-Changed-When: Tue Apr 20 16:34:39 PDT 1999 State-Changed-Why: At any rate we switched back to fcntl locking in apache 1.3.6. >Unformatted: [In order for any reply to be added to the PR database, ] [you need to include in the Cc line ] [and leave the subject line UNCHANGED. This is not done] [automatically because of the potential for mail loops. ] [If you do not include this Cc, your reply may be ig- ] [nored unless you are responding to an explicit request ] [from a developer. ] [Reply only with text; DO NOT SEND ATTACHMENTS! ]