[Nfs-ganesha-devel] Re: V2.5.4: lru_run won't progress, open_fd_count exhausted

Monday, 18 June 2018

I do see these message frequently, but still their occurrences won't add up
to 4055.
[work-77] fsal_rdwr :FSAL :EVENT :fsal_rdwr_plus: close = File/directory
not opened
[work-167] fsal_rdwr :FSAL :EVENT :fsal_rdwr_plus: close = File/directory
not opened

On Mon, Jun 18, 2018 at 10:02 AM Daniel Gryniewicz <dang(a)redhat.com&gt; wrote:

...
 No, that one's okay.  That check means the entry is in use
(something
 has an active recfount on it), so it's just skipped, and the next entry
 is considered.

 Eventually, when that entry is no longer in use, it will be closed.

 Daniel

 On 06/18/2018 12:13 PM, bharat singh wrote:
 > # cat /proc/14459/limits
 > Limit                     Soft Limit           Hard Limit           Units
 > Max open files            4096                 4096                 files
 >
 > I suspect a leak in lru_run_lane, but I might be wrong here
 >
 > static inline size_t lru_run_lane(size_t lane, uint64_t *const
 totalclosed)
 > {
 > ...
 >          /* check refcnt in range */
 >          if (unlikely(refcnt > 2)) {
 >              /* This unref is ok to be done without a valid op_ctx
 >               * because we always map a new entry to an export before
 >               * we could possibly release references in
 >               * mdcache_new_entry.
 >               */
 >              QUNLOCK(qlane);
 >              mdcache_lru_unref(entry);  >>>>>> won don'e have
a
 > fsal_close for this mdcache_lru_unref
 >              goto next_lru;
 >          }
 > ...
 >          /* Make sure any FSAL global file descriptor is closed. */
 >          status = fsal_close(&entry->obj_handle);
 >
 >          if (not_support_ex) {
 >              /* Release the content lock. */
 >              PTHREAD_RWLOCK_unlock(&entry->content_lock);
 >          }
 >
 >          if (FSAL_IS_ERROR(status)) {
 >              LogCrit(COMPONENT_CACHE_INODE_LRU,
 >                  "Error closing file in LRU thread.");
 >          } else {
 >              ++(*totalclosed);
 >              ++closed;
 >          }
 >
 >          mdcache_lru_unref(entry);
 > }
 >
 > On Mon, Jun 18, 2018 at 8:56 AM Daniel Gryniewicz <dang(a)redhat.com
 > <mailto:dang@redhat.com>> wrote:
 >
 >     We do that.  If open_fd_count > fds_hard_limit, we return EDELAY in
 >     mdcache_open2() and fsal_reopen_obj().
 >
 >     Daniel
 >
 >     On 06/18/2018 11:20 AM, Malahal Naineni wrote:
 >      > The actual number of fds open is 554, at least that is what the
 >     kernel
 >      > thinks. If you have open_fd_count as 4055, something is wrong in
 the
 >      > accounting of open files. What is the max files your Ganesha
 >     daemon can
 >      > open (cat /proc/<PID>/limits should tell you).  As far as I
 >     remember,
 >      > the accounting value "open_fd_count" is only used to close
files
 >      > aggressively. Can you track code path where ganesha is sending
 >     DELAY error?
 >      >
 >      > On Mon, Jun 18, 2018 at 8:03 PM bharat singh
 >     <bharat064015(a)gmail.com <mailto:bharat064015@gmail.com>
 >      > <mailto:bharat064015@gmail.com
<mailto:bharat064015@gmail.com>>>
 >     wrote:
 >      >
 >      >     This is a V3 mount only.
 >      >     There are a bunch of socket and anonymous fds opened, but
 >     that only
 >      >     554. In current state my setup has 4055 fds opened and it
 >     won't make
 >      >     any progress for days even without any new I/O coming in. I
 >     have a
 >      >     coredump, please let me know what info you need out of it to
 >     debug this.
 >      >
 >      >     # ls -l  /proc/2576/fd | wc -l
 >      >     554
 >      >
 >      >     On Mon, Jun 18, 2018 at 7:12 AM Malahal Naineni
 >     <malahal(a)gmail.com <mailto:malahal@gmail.com>
 >      >     <mailto:malahal@gmail.com <mailto:malahal@gmail.com>>>
wrote:
 >      >
 >      >         Try to find the open files by doing "ls -l
 >     /proc/<PID>/fds".
 >      >         Are you using NFSv4 or V3? If this is all a V3, then
 >     clearly a
 >      >         bug. NFSv4 may imply some clients opened the files but
 never
 >      >         closed for some reason or we ignored client's CLOSE
 request.
 >      >
 >      >         On Mon, Jun 18, 2018 at 7:30 PM bharat singh
 >      >         <bharat064015(a)gmail.com <mailto:bharat064015@gmail.com>
 >     <mailto:bharat064015@gmail.com <mailto:bharat064015@gmail.com>>>
 wrote:
 >      >
 >      >             I already have this patch
 >      >             c2b448b1a079ed66446060a695e4dd06d1c3d1c2 Fix closing
 >     global
 >      >             file descriptors
 >      >
 >      >
 >      >
 >      >             On Mon, Jun 18, 2018 at 5:41 AM Daniel Gryniewicz
 >      >             <dang(a)redhat.com <mailto:dang@redhat.com>
 >     <mailto:dang@redhat.com <mailto:dang@redhat.com>>> wrote:
 >      >
 >      >                 Try this one:
 >      >
 >      >                 5c2efa8f077fafa82023f5aec5e2c474c5ed2fdf Fix
 closing
 >      >                 global file descriptors
 >      >
 >      >                 Daniel
 >      >
 >      >
 >      >                 On 06/15/2018 03:08 PM, bharat singh wrote:
 >      >                  > I have been testing Ganesha 2.5.4 code with
 >     default
 >      >                 mdcache settings. It
 >      >                  > starts showing issues after prolonged I/O runs.
 >      >                  > Once it exhausts all the allowed fds, its kind
 of
 >      >                 gets stuck
 >      >                  > returning ERR_FSAL_DELAY for every client op.
 >      >                  >
 >      >                  > A snapshot of the mdcache
 >      >                  >
 >      >                  > open_fd_count = 4055
 >      >                  > lru_state = {
 >      >                  >    entries_hiwat = 100000,
 >      >                  >    entries_used = 323,
 >      >                  >    chunks_hiwat = 100000,
 >      >                  >    chunks_used = 9,
 >      >                  >    fds_system_imposed = 4096,
 >      >                  >    fds_hard_limit = 4055,
 >      >                  >    fds_hiwat = 3686,
 >      >                  >    fds_lowat = 2048,
 >      >                  >    futility = 109,
 >      >                  >    per_lane_work = 50,
 >      >                  >    biggest_window = 1638,
 >      >                  >    prev_fd_count = 4055,
 >      >                  >    prev_time = 1529013538,
 >      >                  >    fd_state = 3
 >      >                  > }
 >      >                  >
 >      >                  > [cache_lru] lru_run :INODE LRU :INFO :After
 work,
 >      >                 open_fd_count:4055
 >      >                  > entries used count:327 fdrate:0 threadwait=9
 >      >                  > [cache_lru] lru_run :INODE LRU :INFO :lru
 entries:
 >      >                 327 open_fd_count:4055
 >      >                  > [cache_lru] lru_run :INODE LRU :INFO :lru
 entries:
 >      >                 327open_fd_count:4055
 >      >                  > [cache_lru] lru_run :INODE LRU :INFO :After
 work,
 >      >                 open_fd_count:4055
 >      >                  > entries used count:327 fdrate:0 threadwait=90
 >      >                  >
 >      >                  > I have killed the NFS clients, so no new I/O
 >     is being
 >      >                 received. But even
 >      >                  > after a couple of hours I don't see lru_run
 making
 >      >                 any progress, thereby
 >      >                  > open_fd_count remains a 4055 and even a single
 >     file
 >      >                 open won't be
 >      >                  > served. So basically the server is in stuck
 state.
 >      >                  >
 >      >                  > I have these changes patched over 2.5.4 code
 >      >                  > e2156ad3feac841487ba89969769bf765457ea6e
 Replace
 >      >                 cache_fds parameter and
 >      >                  > handling with better logic
 >      >                  > 667083fe395ddbb4aa14b7bbe7e15ffca87e3b0b
 MDCACHE -
 >      >                 Change and lower
 >      >                  > futility message
 >      >                  > 37732e61985d919e6ca84dfa7b4a84163080abae Move
 >      >                 open_fd_count from MDCACHE
 >      >                  > to FSALs (
 https://review.gerrithub.io/#/c/391267/)
 >      >                  >
 >      >                  > Any suggestions how to resolve this ?
 >      >                  >
 >      >                  >
 >      >                  >
 >      >                  >
 >      >                  > _______________________________________________
 >      >                  > Devel mailing list --
 >     devel(a)lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org>
 >      >                 <mailto:devel@lists.nfs-ganesha.org
 >     <mailto:devel@lists.nfs-ganesha.org>>
 >      >                  > To unsubscribe send an email to
 >      > devel-leave(a)lists.nfs-ganesha.org
 >     <mailto:devel-leave@lists.nfs-ganesha.org>
 >      >                 <mailto:devel-leave@lists.nfs-ganesha.org
 >     <mailto:devel-leave@lists.nfs-ganesha.org>>
 >      >                  >
 >      >                 _______________________________________________
 >      >                 Devel mailing list -- devel(a)lists.nfs-ganesha.org
 >     <mailto:devel@lists.nfs-ganesha.org>
 >      >                 <mailto:devel@lists.nfs-ganesha.org
 >     <mailto:devel@lists.nfs-ganesha.org>>
 >      >                 To unsubscribe send an email to
 >      > devel-leave(a)lists.nfs-ganesha.org
 >     <mailto:devel-leave@lists.nfs-ganesha.org>
 >      >                 <mailto:devel-leave@lists.nfs-ganesha.org
 >     <mailto:devel-leave@lists.nfs-ganesha.org>>
 >      >
 >      >
 >      >
 >      >             --
 >      >             -Bharat
 >      >
 >      >
 >      >             _______________________________________________
 >      >             Devel mailing list -- devel(a)lists.nfs-ganesha.org
 >     <mailto:devel@lists.nfs-ganesha.org>
 >      >             <mailto:devel@lists.nfs-ganesha.org
 >     <mailto:devel@lists.nfs-ganesha.org>>
 >      >             To unsubscribe send an email to
 >      > devel-leave(a)lists.nfs-ganesha.org
 >     <mailto:devel-leave@lists.nfs-ganesha.org>
 >      >             <mailto:devel-leave@lists.nfs-ganesha.org
 >     <mailto:devel-leave@lists.nfs-ganesha.org>>
 >      >
 >      >
 >      >
 >      >     --
 >      >     -Bharat
 >      >
 >      >
 >      >
 >      >
 >      > _______________________________________________
 >      > Devel mailing list -- devel(a)lists.nfs-ganesha.org
 >     <mailto:devel@lists.nfs-ganesha.org>
 >      > To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org
 >     <mailto:devel-leave@lists.nfs-ganesha.org>
 >      >
 >     w
 >     _______________________________________________
 >     Devel mailing list -- devel(a)lists.nfs-ganesha.org
 >     <mailto:devel@lists.nfs-ganesha.org>
 >     To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org
 >     <mailto:devel-leave@lists.nfs-ganesha.org>
 >
 >
 >
 > --
 > -Bharat
 >
 >

-- 
-Bharat

2026

2025

2024

2023

2022

2021

2020

2019

2018

[Nfs-ganesha-devel] Re: V2.5.4: lru_run won't progress, open_fd_count exhausted