Received: (qmail 18125 invoked by uid 2012); 12 Aug 1998 07:50:03 -0000 Message-Id: <19980812075003.18124.qmail@hyperreal.org> Date: 12 Aug 1998 07:50:03 -0000 From: Jay Soffian Reply-To: jay@cimedia.com To: apbugs@hyperreal.org Subject: Seeing lots of[Wed Aug 12 02:41:56 1998] access to /index_layout.html failed for 172.16.20.2, reason: stat: Stale NFS file handle (errno = 151) in error log X-Send-Pr-Version: 3.2 >Number: 2834 >Category: os-solaris >Synopsis: Seeing lots of[Wed Aug 12 02:41:56 1998] access to /index_layout.html failed for 172.16.20.2, reason: stat: Stale NFS file handle (errno = 151) in error log >Confidential: no >Severity: serious >Priority: medium >Responsible: apache >State: closed >Class: change-request >Submitter-Id: apache >Arrival-Date: Wed Aug 12 02:00:01 PDT 1998 >Last-Modified: Sun Aug 23 21:19:18 PDT 1998 >Originator: jay@cimedia.com >Organization: >Release: 1.3.1 >Environment: SunOS web22 5.5.1 Generic_103640-19 sun4u sparc SUNW,Ultra-2 >Description: We have a large number of web-servers that all share a common NFS file system which contains all of our html documents. Recently, we began using staging servers to develop our content, and then using a package such as rsync to synchronize the content from our staging servers to our web servers. When we started doing this, we all of a sudden started getting lots of the Stale NFS file handle error messages in our logs. We have tracked the problem down to Solaris's rnode cache. The rnode cache is used to maintain a cache between filenames and NFS file handles on the client side. Although I cannot confirm this, I believe that the expiration for the cache is LRU. Emperical testing seems to indicate that items in the cache do not have an explicit expiration time. Rather, items are expired from the rnode cache after receiving a Stale NFS file handle error from the server. >How-To-Repeat: Create an index.html document on a NFS file system that includes another document (say test.html). The NFS client in this case should be Solaris 2.5.1, although other OS's may experience this same issue. The from either another NFS client or from the NFS server, repeatedly do the following while also repeatedly loading the web page in a browser: mv test.html test.html~ && cp test.html~ test.html && rm test.html~ Eventually, you should get a '[an error occured while processing this directive]' where the test.html file should have been included, and a 'Stale NFS file handle' error message in the error log. Note that it is probably not necesssary to involve a SSI document in this test, but that is where we see the problem occur most often. >Fix: The following patch has fixed the problem for us: *** http_request.c.orig Wed Aug 12 03:28:38 1998 --- http_request.c Wed Aug 12 03:28:42 1998 *************** *** 211,216 **** --- 211,218 ---- errno = 0; rv = stat(path, &r->finfo); + if (rv < 0 && errno == ESTALE) /* workaround for Stale NFS Filehandle problem */ + rv = stat(path, &r->finfo); /* with Solaris's rnode cache */ if (cp != end) *cp = '/'; It seems that the first stat call which fails also expires the file from the rnode cache. The second stat then succeeds. I am not familiar enough with the NFS implementations on other OS's to know if this patch is relevant to more that Solaris 2.5.1. However, it certainly doesn't seem like it could hurt anything. Also, while there are numerous other calls to stat() in the apache code, this seems to be the only one that is generating any errors to our error log. Admitedly, this instance of the stat() gets called much more frequently than elsewhere in the code. However, perhaps it would be prudent to define stat() as either a macro, or as ap_os_stat() so that the handling of ESTALE may be applied throughout the code. >Audit-Trail: From: Dean Gaudet To: Jay Soffian Cc: apbugs@hyperreal.org Subject: Re: os-solaris/2834: Seeing lots of[Wed Aug 12 02:41:56 1998] access to /index_layout.html failed for 172.16.20.2, reason: stat: Stale NFS file handle (errno = 151) in error log Date: Wed, 19 Aug 1998 11:16:50 -0700 (PDT) On 12 Aug 1998, Jay Soffian wrote: > errno = 0; > rv = stat(path, &r->finfo); > + if (rv < 0 && errno == ESTALE) /* workaround for Stale NFS Filehandle problem */ > + rv = stat(path, &r->finfo); /* with Solaris's rnode cache */ But what do you do about this problem when it occurs in a 3rd party library that you don't have source for? Such as libc? Sorry this is a kernel bug, the hack above fixes it for only one single case. I don't see the point in fixing it for the one case when there are potentially dozens of others that could be broken. You should complain to your solaris rep. I'd be more sympathetic if EINTR were one of the posix errors for stat() -- at least then there would be precedence for having to retry a stat. But it isn't... (and I don't see ESTALE documented on the solaris man page either). bleh. I'll leave the PR open though, maybe someone else feels differently. Dean From: Jay Soffian To: apbugs@hyperreal.org, Dean Gaudet Cc: eng-disc@cimedia.com, ron@cimedia.com Subject: Re: os-solaris/2834: Seeing lots of[Wed Aug 12 02:41:56 1998] access to /index_layout.html failed for 172.16.20.2, reason: stat: Stale NFS file handle (errno = 151) in error log Date: Wed, 19 Aug 1998 14:24:43 -0400 +--Dean Gaudet once said: | |On 12 Aug 1998, Jay Soffian wrote: | |> errno = 0; |> rv = stat(path, &r->finfo); |> + if (rv < 0 && errno == ESTALE) /* workaround for Stale NFS Filehand |le problem */ |> + rv = stat(path, &r->finfo); /* with Solaris's rnode cache */ | |But what do you do about this problem when it occurs in a 3rd party |library that you don't have source for? Such as libc? Sorry this is a |kernel bug, the hack above fixes it for only one single case. I don't see |the point in fixing it for the one case when there are potentially dozens |of others that could be broken. You should complain to your solaris rep. | |I'd be more sympathetic if EINTR were one of the posix errors for stat() |-- at least then there would be precedence for having to retry a stat. But |it isn't... (and I don't see ESTALE documented on the solaris man page |either). Actually, I have to agree with you. After I submitted the PR, I thought about it some more and realized I need to submit a PR to Sun. It really is a problem they created. That being said, until Sun fixes the problem, there isn't anything I can do but work around it in Apache. I can't imagine there aren't other people having this same problem; apache/solaris/NFS is a pretty common combination. But I could find no documentation of anyone else seeing this problem. In any case, the problem is worthy of being documenting in the Apache bugs; having achieved that, I'm happy. |bleh. Yeah, you and me both. You can't imagine how long this took to track down. Sun should be shot for passing this problem up to the application level. |I'll leave the PR open though, maybe someone else feels differently. Thanks. j -- Jay Soffian UNIX Systems Administrator 404.572.1941 Cox Interactive Media From: Dean Gaudet To: Jay Soffian Cc: apbugs@apache.org Subject: Re: os-solaris/2834: Seeing lots of[Wed Aug 12 02:41:56 1998] access to /index_layout.html failed for 172.16.20.2, reason: stat: Stale NFS file handle (errno = 151) in error log Date: Wed, 19 Aug 1998 11:36:18 -0700 (PDT) On 19 Aug 1998, Jay Soffian wrote: > Yeah, you and me both. You can't imagine how long this took to track > down. Sun should be shot for passing this problem up to the > application level. Actually sun should just be shot for NFS. ;) Dean State-Changed-From-To: open-closed State-Changed-By: marc State-Changed-When: Sun Aug 23 21:19:18 PDT 1998 State-Changed-Why: I agree with Dean that this is a lame workaround to have to add, so will close the report. Since we haven't had previous reports that I can recall, it may be specific to particular patchsets or other details of the setup. If others have problems, they can find the PR. If lots of others start having this problem we can reconsider. Thanks. >Unformatted: [In order for any reply to be added to the PR database, ] [you need to include in the Cc line ] [and leave the subject line UNCHANGED. This is not done] [automatically because of the potential for mail loops. ] [If you do not include this Cc, your reply may be ig- ] [nored unless you are responding to an explicit request ] [from a developer. ] [Reply only with text; DO NOT SEND ATTACHMENTS! ]