# cat /proc/14459/limits
Limit Soft Limit Hard Limit Units
Max open files 4096 4096 files
I suspect a leak in lru_run_lane, but I might be wrong here
static inline size_t lru_run_lane(size_t lane, uint64_t *const totalclosed)
{
...
/* check refcnt in range */
if (unlikely(refcnt > 2)) {
/* This unref is ok to be done without a valid op_ctx
* because we always map a new entry to an export before
* we could possibly release references in
* mdcache_new_entry.
*/
QUNLOCK(qlane);
mdcache_lru_unref(entry); >>>>>> won don'e have a
fsal_close
for this mdcache_lru_unref
goto next_lru;
}
...
/* Make sure any FSAL global file descriptor is closed. */
status = fsal_close(&entry->obj_handle);
if (not_support_ex) {
/* Release the content lock. */
PTHREAD_RWLOCK_unlock(&entry->content_lock);
}
if (FSAL_IS_ERROR(status)) {
LogCrit(COMPONENT_CACHE_INODE_LRU,
"Error closing file in LRU thread.");
} else {
++(*totalclosed);
++closed;
}
mdcache_lru_unref(entry);
}
On Mon, Jun 18, 2018 at 8:56 AM Daniel Gryniewicz <dang(a)redhat.com> wrote:
We do that. If open_fd_count > fds_hard_limit, we return EDELAY
in
mdcache_open2() and fsal_reopen_obj().
Daniel
On 06/18/2018 11:20 AM, Malahal Naineni wrote:
> The actual number of fds open is 554, at least that is what the kernel
> thinks. If you have open_fd_count as 4055, something is wrong in the
> accounting of open files. What is the max files your Ganesha daemon can
> open (cat /proc/<PID>/limits should tell you). As far as I remember,
> the accounting value "open_fd_count" is only used to close files
> aggressively. Can you track code path where ganesha is sending DELAY
error?
>
> On Mon, Jun 18, 2018 at 8:03 PM bharat singh <bharat064015(a)gmail.com
> <mailto:bharat064015@gmail.com>> wrote:
>
> This is a V3 mount only.
> There are a bunch of socket and anonymous fds opened, but that only
> 554. In current state my setup has 4055 fds opened and it won't make
> any progress for days even without any new I/O coming in. I have a
> coredump, please let me know what info you need out of it to debug
this.
>
> # ls -l /proc/2576/fd | wc -l
> 554
>
> On Mon, Jun 18, 2018 at 7:12 AM Malahal Naineni <malahal(a)gmail.com
> <mailto:malahal@gmail.com>> wrote:
>
> Try to find the open files by doing "ls -l
/proc/<PID>/fds".
> Are you using NFSv4 or V3? If this is all a V3, then clearly a
> bug. NFSv4 may imply some clients opened the files but never
> closed for some reason or we ignored client's CLOSE request.
>
> On Mon, Jun 18, 2018 at 7:30 PM bharat singh
> <bharat064015(a)gmail.com <mailto:bharat064015@gmail.com>> wrote:
>
> I already have this patch
> c2b448b1a079ed66446060a695e4dd06d1c3d1c2 Fix closing global
> file descriptors
>
>
>
> On Mon, Jun 18, 2018 at 5:41 AM Daniel Gryniewicz
> <dang(a)redhat.com <mailto:dang@redhat.com>> wrote:
>
> Try this one:
>
> 5c2efa8f077fafa82023f5aec5e2c474c5ed2fdf Fix closing
> global file descriptors
>
> Daniel
>
>
> On 06/15/2018 03:08 PM, bharat singh wrote:
> > I have been testing Ganesha 2.5.4 code with default
> mdcache settings. It
> > starts showing issues after prolonged I/O runs.
> > Once it exhausts all the allowed fds, its kind of
> gets stuck
> > returning ERR_FSAL_DELAY for every client op.
> >
> > A snapshot of the mdcache
> >
> > open_fd_count = 4055
> > lru_state = {
> > entries_hiwat = 100000,
> > entries_used = 323,
> > chunks_hiwat = 100000,
> > chunks_used = 9,
> > fds_system_imposed = 4096,
> > fds_hard_limit = 4055,
> > fds_hiwat = 3686,
> > fds_lowat = 2048,
> > futility = 109,
> > per_lane_work = 50,
> > biggest_window = 1638,
> > prev_fd_count = 4055,
> > prev_time = 1529013538,
> > fd_state = 3
> > }
> >
> > [cache_lru] lru_run :INODE LRU :INFO :After work,
> open_fd_count:4055
> > entries used count:327 fdrate:0 threadwait=9
> > [cache_lru] lru_run :INODE LRU :INFO :lru entries:
> 327 open_fd_count:4055
> > [cache_lru] lru_run :INODE LRU :INFO :lru entries:
> 327open_fd_count:4055
> > [cache_lru] lru_run :INODE LRU :INFO :After work,
> open_fd_count:4055
> > entries used count:327 fdrate:0 threadwait=90
> >
> > I have killed the NFS clients, so no new I/O is being
> received. But even
> > after a couple of hours I don't see lru_run making
> any progress, thereby
> > open_fd_count remains a 4055 and even a single file
> open won't be
> > served. So basically the server is in stuck state.
> >
> > I have these changes patched over 2.5.4 code
> > e2156ad3feac841487ba89969769bf765457ea6e Replace
> cache_fds parameter and
> > handling with better logic
> > 667083fe395ddbb4aa14b7bbe7e15ffca87e3b0b MDCACHE -
> Change and lower
> > futility message
> > 37732e61985d919e6ca84dfa7b4a84163080abae Move
> open_fd_count from MDCACHE
> > to FSALs (
https://review.gerrithub.io/#/c/391267/)
> >
> > Any suggestions how to resolve this ?
> >
> >
> >
> >
> > _______________________________________________
> > Devel mailing list -- devel(a)lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > To unsubscribe send an email to
> devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> >
> _______________________________________________
> Devel mailing list -- devel(a)lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> To unsubscribe send an email to
> devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
>
>
>
> --
> -Bharat
>
>
> _______________________________________________
> Devel mailing list -- devel(a)lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> To unsubscribe send an email to
> devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
>
>
>
> --
> -Bharat
>
>
>
>
> _______________________________________________
> Devel mailing list -- devel(a)lists.nfs-ganesha.org
> To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org
>
w
_______________________________________________
Devel mailing list -- devel(a)lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org