[Nfs-ganesha-devel] Re: V2.5.4: lru_run won't progress, open_fd_count exhausted

Monday, 18 June 2018

# cat /proc/14459/limits
Limit                     Soft Limit           Hard Limit           Units
Max open files            4096                 4096                 files

I suspect a leak in lru_run_lane, but I might be wrong here

static inline size_t lru_run_lane(size_t lane, uint64_t *const totalclosed)
{
...
        /* check refcnt in range */
        if (unlikely(refcnt > 2)) {
            /* This unref is ok to be done without a valid op_ctx
             * because we always map a new entry to an export before
             * we could possibly release references in
             * mdcache_new_entry.
             */
            QUNLOCK(qlane);
            mdcache_lru_unref(entry);  >>>>>> won don'e have a
fsal_close
for this mdcache_lru_unref
            goto next_lru;
        }
...
        /* Make sure any FSAL global file descriptor is closed. */
        status = fsal_close(&entry->obj_handle);

        if (not_support_ex) {
            /* Release the content lock. */
            PTHREAD_RWLOCK_unlock(&entry->content_lock);
        }

        if (FSAL_IS_ERROR(status)) {
            LogCrit(COMPONENT_CACHE_INODE_LRU,
                "Error closing file in LRU thread.");
        } else {
            ++(*totalclosed);
            ++closed;
        }

        mdcache_lru_unref(entry);
}

On Mon, Jun 18, 2018 at 8:56 AM Daniel Gryniewicz <dang(a)redhat.com&gt; wrote:

...
 We do that.  If open_fd_count > fds_hard_limit, we return EDELAY
in
 mdcache_open2() and fsal_reopen_obj().

 Daniel

 On 06/18/2018 11:20 AM, Malahal Naineni wrote:
 > The actual number of fds open is 554, at least that is what the kernel
 > thinks. If you have open_fd_count as 4055, something is wrong in the
 > accounting of open files. What is the max files your Ganesha daemon can
 > open (cat /proc/<PID>/limits should tell you).  As far as I remember,
 > the accounting value "open_fd_count" is only used to close files
 > aggressively. Can you track code path where ganesha is sending DELAY
 error?
 >
 > On Mon, Jun 18, 2018 at 8:03 PM bharat singh <bharat064015(a)gmail.com
 > <mailto:bharat064015@gmail.com>> wrote:
 >
 >     This is a V3 mount only.
 >     There are a bunch of socket and anonymous fds opened, but that only
 >     554. In current state my setup has 4055 fds opened and it won't make
 >     any progress for days even without any new I/O coming in. I have a
 >     coredump, please let me know what info you need out of it to debug
 this.
 >
 >     # ls -l  /proc/2576/fd | wc -l
 >     554
 >
 >     On Mon, Jun 18, 2018 at 7:12 AM Malahal Naineni <malahal(a)gmail.com
 >     <mailto:malahal@gmail.com>> wrote:
 >
 >         Try to find the open files by doing "ls -l 
/proc/<PID>/fds".
 >         Are you using NFSv4 or V3? If this is all a V3, then clearly a
 >         bug. NFSv4 may imply some clients opened the files but never
 >         closed for some reason or we ignored client's CLOSE request.
 >
 >         On Mon, Jun 18, 2018 at 7:30 PM bharat singh
 >         <bharat064015(a)gmail.com <mailto:bharat064015@gmail.com>> wrote:
 >
 >             I already have this patch
 >             c2b448b1a079ed66446060a695e4dd06d1c3d1c2 Fix closing global
 >             file descriptors
 >
 >
 >
 >             On Mon, Jun 18, 2018 at 5:41 AM Daniel Gryniewicz
 >             <dang(a)redhat.com <mailto:dang@redhat.com>> wrote:
 >
 >                 Try this one:
 >
 >                 5c2efa8f077fafa82023f5aec5e2c474c5ed2fdf Fix closing
 >                 global file descriptors
 >
 >                 Daniel
 >
 >
 >                 On 06/15/2018 03:08 PM, bharat singh wrote:
 >                  > I have been testing Ganesha 2.5.4 code with default
 >                 mdcache settings. It
 >                  > starts showing issues after prolonged I/O runs.
 >                  > Once it exhausts all the allowed fds, its kind of
 >                 gets stuck
 >                  > returning ERR_FSAL_DELAY for every client op.
 >                  >
 >                  > A snapshot of the mdcache
 >                  >
 >                  > open_fd_count = 4055
 >                  > lru_state = {
 >                  >    entries_hiwat = 100000,
 >                  >    entries_used = 323,
 >                  >    chunks_hiwat = 100000,
 >                  >    chunks_used = 9,
 >                  >    fds_system_imposed = 4096,
 >                  >    fds_hard_limit = 4055,
 >                  >    fds_hiwat = 3686,
 >                  >    fds_lowat = 2048,
 >                  >    futility = 109,
 >                  >    per_lane_work = 50,
 >                  >    biggest_window = 1638,
 >                  >    prev_fd_count = 4055,
 >                  >    prev_time = 1529013538,
 >                  >    fd_state = 3
 >                  > }
 >                  >
 >                  > [cache_lru] lru_run :INODE LRU :INFO :After work,
 >                 open_fd_count:4055
 >                  > entries used count:327 fdrate:0 threadwait=9
 >                  > [cache_lru] lru_run :INODE LRU :INFO :lru entries:
 >                 327 open_fd_count:4055
 >                  > [cache_lru] lru_run :INODE LRU :INFO :lru entries:
 >                 327open_fd_count:4055
 >                  > [cache_lru] lru_run :INODE LRU :INFO :After work,
 >                 open_fd_count:4055
 >                  > entries used count:327 fdrate:0 threadwait=90
 >                  >
 >                  > I have killed the NFS clients, so no new I/O is being
 >                 received. But even
 >                  > after a couple of hours I don't see lru_run making
 >                 any progress, thereby
 >                  > open_fd_count remains a 4055 and even a single file
 >                 open won't be
 >                  > served. So basically the server is in stuck state.
 >                  >
 >                  > I have these changes patched over 2.5.4 code
 >                  > e2156ad3feac841487ba89969769bf765457ea6e Replace
 >                 cache_fds parameter and
 >                  > handling with better logic
 >                  > 667083fe395ddbb4aa14b7bbe7e15ffca87e3b0b MDCACHE -
 >                 Change and lower
 >                  > futility message
 >                  > 37732e61985d919e6ca84dfa7b4a84163080abae Move
 >                 open_fd_count from MDCACHE
 >                  > to FSALs (https://review.gerrithub.io/#/c/391267/)
 >                  >
 >                  > Any suggestions how to resolve this ?
 >                  >
 >                  >
 >                  >
 >                  >
 >                  > _______________________________________________
 >                  > Devel mailing list -- devel(a)lists.nfs-ganesha.org
 >                 <mailto:devel@lists.nfs-ganesha.org>
 >                  > To unsubscribe send an email to
 >                 devel-leave(a)lists.nfs-ganesha.org
 >                 <mailto:devel-leave@lists.nfs-ganesha.org>
 >                  >
 >                 _______________________________________________
 >                 Devel mailing list -- devel(a)lists.nfs-ganesha.org
 >                 <mailto:devel@lists.nfs-ganesha.org>
 >                 To unsubscribe send an email to
 >                 devel-leave(a)lists.nfs-ganesha.org
 >                 <mailto:devel-leave@lists.nfs-ganesha.org>
 >
 >
 >
 >             --
 >             -Bharat
 >
 >
 >             _______________________________________________
 >             Devel mailing list -- devel(a)lists.nfs-ganesha.org
 >             <mailto:devel@lists.nfs-ganesha.org>
 >             To unsubscribe send an email to
 >             devel-leave(a)lists.nfs-ganesha.org
 >             <mailto:devel-leave@lists.nfs-ganesha.org>
 >
 >
 >
 >     --
 >     -Bharat
 >
 >
 >
 >
 > _______________________________________________
 > Devel mailing list -- devel(a)lists.nfs-ganesha.org
 > To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org
 >
 w
 _______________________________________________
 Devel mailing list -- devel(a)lists.nfs-ganesha.org
 To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

-- 
-Bharat

2025

2024

2023

2022

2021

2020

2019

2018

[Nfs-ganesha-devel] Re: V2.5.4: lru_run won't progress, open_fd_count exhausted