Re: V2.5.4: lru_run won't progress, open_fd_count exhausted
by Daniel Gryniewicz
The problem is that there is a leak in accounting, so open_fd_count will
never go down, because there aren't FDs to close. The "missing" FDs are
actually just missing decrements of the counter, not real FDs to close.
You can change the code to not return ERR_FSAL_DELAY, and that will
disable the FD accounting entirely, and the results of the open attempt
will be propagated back to the client. This means, likely, EMFILE or
ENFILE, which are converted to ERR_FSAL_IO, which is likely a fatal
error to the client. In addition, it will mean that the LRU thread will
run full speed all the time, taking some extra CPU.
If your workload will never have more files open than the server allows,
then this is likely not a problem, but if it may try to open too many
files, then we'll need to fix the accounting, so that the case is
handled gracefully with ERR_FSAL_DELAY.
Daniel
On 06/19/2018 10:41 AM, bharat singh wrote:
> Yeah, we have a non-support_ex() FSAL, and was trying to upgrade to
> newer versions of Ganesha.
>
> We use NFS-Ganesha to serve ESXi datastores using NFS v3 currently on V2.2.
>
> But since these code paths are buggy and soon to be deprecated, and
> there is definitely an accounting leak for open_fd_count here.
>
> In my testing I have seen that open_fd_count keeps growing slowly and
> eventually crosses the 4k hard limit and then the clients are stuck.
> Now lru_run() won't lower the open_fd_count, even after stopping all the
> I/O. Thus making the server useless to clients.
>
> Earlier you pointed out that "open_fd_count" is only used to close files
> aggressively.
> So if I stop returning ERETRY from mdcache_lru_fds_available() even if
> this count crosses the hard limit.
> This would push lru threads to reap aggressively which is what we want
> and my clients won't be hung any more.
>
> What do you recommend ?
>
> On Tue, Jun 19, 2018 at 5:26 AM Daniel Gryniewicz <dang(a)redhat.com
> <mailto:dang@redhat.com>> wrote:
>
> Oh, so this is a non-support_ex() FSAL? That may be an issue, I'm not
> sure that the non-support_ex() codepaths are correct, since there
> are no
> in-tree FSALs that use them, and they've been completely removed from
> later versions of Ganesha.
>
> One thing you can do is to add LTTng tracepoint at every inc and dec of
> open_fd_count, and then run with just that tracepoint on. It should be
> low enough overhead to avoid interfering with normal operation.
>
> Daniel
>
> On 06/18/2018 08:16 PM, bharat singh wrote:
> > I do see these message frequently, but still their occurrences
> won't add
> > up to 4055.
> > [work-77] fsal_rdwr :FSAL :EVENT :fsal_rdwr_plus: close =
> File/directory
> > not opened
> > [work-167] fsal_rdwr :FSAL :EVENT :fsal_rdwr_plus: close =
> > File/directory not opened
> >
> > On Mon, Jun 18, 2018 at 10:02 AM Daniel Gryniewicz
> <dang(a)redhat.com <mailto:dang@redhat.com>
> > <mailto:dang@redhat.com <mailto:dang@redhat.com>>> wrote:
> >
> > No, that one's okay. That check means the entry is in use
> (something
> > has an active recfount on it), so it's just skipped, and the
> next entry
> > is considered.
> >
> > Eventually, when that entry is no longer in use, it will be
> closed.
> >
> > Daniel
> >
> >
> > On 06/18/2018 12:13 PM, bharat singh wrote:
> > > # cat /proc/14459/limits
> > > Limit Soft Limit Hard Limit
> > Units
> > > Max open files 4096 4096
> > files
> > >
> > > I suspect a leak in lru_run_lane, but I might be wrong here
> > >
> > > static inline size_t lru_run_lane(size_t lane, uint64_t *const
> > totalclosed)
> > > {
> > > ...
> > > /* check refcnt in range */
> > > if (unlikely(refcnt > 2)) {
> > > /* This unref is ok to be done without a
> valid op_ctx
> > > * because we always map a new entry to an
> export before
> > > * we could possibly release references in
> > > * mdcache_new_entry.
> > > */
> > > QUNLOCK(qlane);
> > > mdcache_lru_unref(entry); >>>>>> won don'e
> have a
> > > fsal_close for this mdcache_lru_unref
> > > goto next_lru;
> > > }
> > > ...
> > > /* Make sure any FSAL global file descriptor is
> closed. */
> > > status = fsal_close(&entry->obj_handle);
> > >
> > > if (not_support_ex) {
> > > /* Release the content lock. */
> > > PTHREAD_RWLOCK_unlock(&entry->content_lock);
> > > }
> > >
> > > if (FSAL_IS_ERROR(status)) {
> > > LogCrit(COMPONENT_CACHE_INODE_LRU,
> > > "Error closing file in LRU thread.");
> > > } else {
> > > ++(*totalclosed);
> > > ++closed;
> > > }
> > >
> > > mdcache_lru_unref(entry);
> > > }
> > >
> > > On Mon, Jun 18, 2018 at 8:56 AM Daniel Gryniewicz
> > <dang(a)redhat.com <mailto:dang@redhat.com>
> <mailto:dang@redhat.com <mailto:dang@redhat.com>>
> > > <mailto:dang@redhat.com <mailto:dang@redhat.com>
> <mailto:dang@redhat.com <mailto:dang@redhat.com>>>> wrote:
> > >
> > > We do that. If open_fd_count > fds_hard_limit, we return
> > EDELAY in
> > > mdcache_open2() and fsal_reopen_obj().
> > >
> > > Daniel
> > >
> > > On 06/18/2018 11:20 AM, Malahal Naineni wrote:
> > > > The actual number of fds open is 554, at least that is
> > what the
> > > kernel
> > > > thinks. If you have open_fd_count as 4055, something is
> > wrong in the
> > > > accounting of open files. What is the max files
> your Ganesha
> > > daemon can
> > > > open (cat /proc/<PID>/limits should tell you). As
> far as I
> > > remember,
> > > > the accounting value "open_fd_count" is only used
> to close
> > files
> > > > aggressively. Can you track code path where
> ganesha is sending
> > > DELAY error?
> > > >
> > > > On Mon, Jun 18, 2018 at 8:03 PM bharat singh
> > > <bharat064015(a)gmail.com
> <mailto:bharat064015@gmail.com> <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>>
> > <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com> <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>>>
> > > > <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>
> > <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>> <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>
> > <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>>>>>
> > > wrote:
> > > >
> > > > This is a V3 mount only.
> > > > There are a bunch of socket and anonymous fds
> opened, but
> > > that only
> > > > 554. In current state my setup has 4055 fds
> opened and it
> > > won't make
> > > > any progress for days even without any new I/O
> coming
> > in. I
> > > have a
> > > > coredump, please let me know what info you need
> out of
> > it to
> > > debug this.
> > > >
> > > > # ls -l /proc/2576/fd | wc -l
> > > > 554
> > > >
> > > > On Mon, Jun 18, 2018 at 7:12 AM Malahal Naineni
> > > <malahal(a)gmail.com <mailto:malahal@gmail.com>
> <mailto:malahal@gmail.com <mailto:malahal@gmail.com>>
> > <mailto:malahal@gmail.com <mailto:malahal@gmail.com>
> <mailto:malahal@gmail.com <mailto:malahal@gmail.com>>>
> > > > <mailto:malahal@gmail.com
> <mailto:malahal@gmail.com> <mailto:malahal@gmail.com
> <mailto:malahal@gmail.com>>
> > <mailto:malahal@gmail.com <mailto:malahal@gmail.com>
> <mailto:malahal@gmail.com <mailto:malahal@gmail.com>>>>> wrote:
> > > >
> > > > Try to find the open files by doing "ls -l
> > > /proc/<PID>/fds".
> > > > Are you using NFSv4 or V3? If this is all a
> V3, then
> > > clearly a
> > > > bug. NFSv4 may imply some clients opened
> the files
> > but never
> > > > closed for some reason or we ignored client's
> > CLOSE request.
> > > >
> > > > On Mon, Jun 18, 2018 at 7:30 PM bharat singh
> > > > <bharat064015(a)gmail.com
> <mailto:bharat064015@gmail.com>
> > <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>> <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>
> > <mailto:bharat064015@gmail.com <mailto:bharat064015@gmail.com>>>
> > > <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>
> > <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>> <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>
> > <mailto:bharat064015@gmail.com
> <mailto:bharat064015@gmail.com>>>>> wrote:
> > > >
> > > > I already have this patch
> > > >
> c2b448b1a079ed66446060a695e4dd06d1c3d1c2 Fix
> > closing
> > > global
> > > > file descriptors
> > > >
> > > >
> > > >
> > > > On Mon, Jun 18, 2018 at 5:41 AM Daniel
> Gryniewicz
> > > > <dang(a)redhat.com
> <mailto:dang@redhat.com> <mailto:dang@redhat.com
> <mailto:dang@redhat.com>>
> > <mailto:dang@redhat.com <mailto:dang@redhat.com>
> <mailto:dang@redhat.com <mailto:dang@redhat.com>>>
> > > <mailto:dang@redhat.com <mailto:dang@redhat.com>
> <mailto:dang@redhat.com <mailto:dang@redhat.com>>
> > <mailto:dang@redhat.com <mailto:dang@redhat.com>
> <mailto:dang@redhat.com <mailto:dang@redhat.com>>>>> wrote:
> > > >
> > > > Try this one:
> > > >
> > > >
> 5c2efa8f077fafa82023f5aec5e2c474c5ed2fdf
> > Fix closing
> > > > global file descriptors
> > > >
> > > > Daniel
> > > >
> > > >
> > > > On 06/15/2018 03:08 PM, bharat
> singh wrote:
> > > > > I have been testing Ganesha
> 2.5.4 code with
> > > default
> > > > mdcache settings. It
> > > > > starts showing issues after
> prolonged
> > I/O runs.
> > > > > Once it exhausts all the allowed
> fds,
> > its kind of
> > > > gets stuck
> > > > > returning ERR_FSAL_DELAY for every
> > client op.
> > > > >
> > > > > A snapshot of the mdcache
> > > > >
> > > > > open_fd_count = 4055
> > > > > lru_state = {
> > > > > entries_hiwat = 100000,
> > > > > entries_used = 323,
> > > > > chunks_hiwat = 100000,
> > > > > chunks_used = 9,
> > > > > fds_system_imposed = 4096,
> > > > > fds_hard_limit = 4055,
> > > > > fds_hiwat = 3686,
> > > > > fds_lowat = 2048,
> > > > > futility = 109,
> > > > > per_lane_work = 50,
> > > > > biggest_window = 1638,
> > > > > prev_fd_count = 4055,
> > > > > prev_time = 1529013538,
> > > > > fd_state = 3
> > > > > }
> > > > >
> > > > > [cache_lru] lru_run :INODE LRU :INFO
> > :After work,
> > > > open_fd_count:4055
> > > > > entries used count:327 fdrate:0
> > threadwait=9
> > > > > [cache_lru] lru_run :INODE LRU :INFO
> > :lru entries:
> > > > 327 open_fd_count:4055
> > > > > [cache_lru] lru_run :INODE LRU :INFO
> > :lru entries:
> > > > 327open_fd_count:4055
> > > > > [cache_lru] lru_run :INODE LRU :INFO
> > :After work,
> > > > open_fd_count:4055
> > > > > entries used count:327 fdrate:0
> > threadwait=90
> > > > >
> > > > > I have killed the NFS clients, so no
> > new I/O
> > > is being
> > > > received. But even
> > > > > after a couple of hours I don't see
> > lru_run making
> > > > any progress, thereby
> > > > > open_fd_count remains a 4055 and
> even a
> > single
> > > file
> > > > open won't be
> > > > > served. So basically the server
> is in
> > stuck state.
> > > > >
> > > > > I have these changes patched
> over 2.5.4
> > code
> > > > >
> > e2156ad3feac841487ba89969769bf765457ea6e Replace
> > > > cache_fds parameter and
> > > > > handling with better logic
> > > > >
> > 667083fe395ddbb4aa14b7bbe7e15ffca87e3b0b MDCACHE -
> > > > Change and lower
> > > > > futility message
> > > > >
> > 37732e61985d919e6ca84dfa7b4a84163080abae Move
> > > > open_fd_count from MDCACHE
> > > > > to FSALs
> > (https://review.gerrithub.io/#/c/391267/)
> > > > >
> > > > > Any suggestions how to resolve
> this ?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > _______________________________________________
> > > > > Devel mailing list --
> > > devel(a)lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>>
> > > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>
> > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>>>
> > > > > To unsubscribe send an email to
> > > > devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>
> > > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>>
> > > >
> <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>
> > > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>>>
> > > > >
> > > >
> > _______________________________________________
> > > > Devel mailing list --
> > devel(a)lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org>
> <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>
> > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>>
> > > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>
> > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>>>
> > > > To unsubscribe send an email to
> > > > devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>
> > > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>>
> > > >
> <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>
> > > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>>>
> > > >
> > > >
> > > >
> > > > --
> > > > -Bharat
> > > >
> > > >
> > > >
> _______________________________________________
> > > > Devel mailing list --
> > devel(a)lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org>
> <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>
> > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>>
> > > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>
> > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>>>
> > > > To unsubscribe send an email to
> > > > devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>
> > > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>>
> > > >
> <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>
> > > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>>>
> > > >
> > > >
> > > >
> > > > --
> > > > -Bharat
> > > >
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Devel mailing list -- devel(a)lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>
> > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>>
> > > > To unsubscribe send an email to
> > devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>
> > > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>>
> > > >
> > > w
> > > _______________________________________________
> > > Devel mailing list -- devel(a)lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>
> > > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> > <mailto:devel@lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>>>
> > > To unsubscribe send an email to
> > devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>
> > > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> > <mailto:devel-leave@lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>>>
> > >
> > >
> > >
> > > --
> > > -Bharat
> > >
> > >
> >
> >
> >
> > --
> > -Bharat
> >
> >
>
>
>
> --
> -Bharat
>
>
6 years, 4 months
V2.5.4: lru_run won't progress, open_fd_count exhausted
by bharat singh
I have been testing Ganesha 2.5.4 code with default mdcache settings. It
starts showing issues after prolonged I/O runs.
Once it exhausts all the allowed fds, its kind of gets stuck
returning ERR_FSAL_DELAY for every client op.
A snapshot of the mdcache
open_fd_count = 4055
lru_state = {
entries_hiwat = 100000,
entries_used = 323,
chunks_hiwat = 100000,
chunks_used = 9,
fds_system_imposed = 4096,
fds_hard_limit = 4055,
fds_hiwat = 3686,
fds_lowat = 2048,
futility = 109,
per_lane_work = 50,
biggest_window = 1638,
prev_fd_count = 4055,
prev_time = 1529013538,
fd_state = 3
}
[cache_lru] lru_run :INODE LRU :INFO :After work, open_fd_count:4055
entries used count:327 fdrate:0 threadwait=9
[cache_lru] lru_run :INODE LRU :INFO :lru entries: 327 open_fd_count:4055
[cache_lru] lru_run :INODE LRU :INFO :lru entries: 327open_fd_count:4055
[cache_lru] lru_run :INODE LRU :INFO :After work, open_fd_count:4055
entries used count:327 fdrate:0 threadwait=90
I have killed the NFS clients, so no new I/O is being received. But even
after a couple of hours I don't see lru_run making any progress, thereby
open_fd_count remains a 4055 and even a single file open won't be served.
So basically the server is in stuck state.
I have these changes patched over 2.5.4 code
e2156ad3feac841487ba89969769bf765457ea6e Replace cache_fds parameter and
handling with better logic
667083fe395ddbb4aa14b7bbe7e15ffca87e3b0b MDCACHE - Change and lower
futility message
37732e61985d919e6ca84dfa7b4a84163080abae Move open_fd_count from MDCACHE to
FSALs (https://review.gerrithub.io/#/c/391267/)
Any suggestions how to resolve this ?
6 years, 4 months
Change in ffilz/nfs-ganesha[next]: GTest - fix name too short error
by GerritHub
From Daniel Gryniewicz <dang(a)redhat.com>:
Daniel Gryniewicz has uploaded this change for review. ( https://review.gerrithub.io/415834
Change subject: GTest - fix name too short error
......................................................................
GTest - fix name too short error
New tests need a name buffer larger than 16, so raise it to 24. This
exposes that NAMELEN was defined both in a header and in every test.
Change-Id: I7ee9f1018b62992d37e56347144aeb124e446a14
Signed-off-by: Daniel Gryniewicz <dang(a)redhat.com>
---
M src/gtest/fsal_api/test_close2_latency.cc
M src/gtest/fsal_api/test_close_latency.cc
M src/gtest/fsal_api/test_getattrs_latency.cc
M src/gtest/fsal_api/test_handle_to_key_latency.cc
M src/gtest/fsal_api/test_handle_to_wire_latency.cc
M src/gtest/fsal_api/test_link_latency.cc
M src/gtest/fsal_api/test_lock_op2_latency.cc
M src/gtest/fsal_api/test_lookup_latency.cc
M src/gtest/fsal_api/test_mkdir_latency.cc
M src/gtest/fsal_api/test_mknode_latency.cc
M src/gtest/fsal_api/test_open2_latency.cc
M src/gtest/fsal_api/test_readdir_latency.cc
M src/gtest/fsal_api/test_readlink_latency.cc
M src/gtest/fsal_api/test_release_latency.cc
M src/gtest/fsal_api/test_rename_latency.cc
M src/gtest/fsal_api/test_setattr2_latency.cc
M src/gtest/fsal_api/test_symlink_latency.cc
M src/gtest/fsal_api/test_unlink_latency.cc
M src/gtest/gtest.hh
M src/gtest/nfs4/test_nfs4_link_latency.cc
M src/gtest/nfs4/test_nfs4_lookup_latency.cc
M src/gtest/nfs4/test_nfs4_putfh_latency.cc
M src/gtest/nfs4/test_nfs4_rename_latency.cc
23 files changed, 1 insertion(+), 23 deletions(-)
git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/34/415834/1
--
To view, visit https://review.gerrithub.io/415834
To unsubscribe, or for help writing mail filters, visit https://review.gerrithub.io/settings
Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: I7ee9f1018b62992d37e56347144aeb124e446a14
Gerrit-Change-Number: 415834
Gerrit-PatchSet: 1
Gerrit-Owner: Daniel Gryniewicz <dang(a)redhat.com>
6 years, 4 months
Change in ffilz/nfs-ganesha[next]: SAL - Fix coverity error: limit on cleanup was broken
by GerritHub
From Daniel Gryniewicz <dang(a)redhat.com>:
Daniel Gryniewicz has uploaded this change for review. ( https://review.gerrithub.io/415833
Change subject: SAL - Fix coverity error: limit on cleanup was broken
......................................................................
SAL - Fix coverity error: limit on cleanup was broken
Change-Id: I07efe85d6a0cef1ae0c525444f42c06e458dea93
Signed-off-by: Daniel Gryniewicz <dang(a)redhat.com>
---
M src/SAL/nfs4_state.c
1 file changed, 1 insertion(+), 0 deletions(-)
git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/33/415833/1
--
To view, visit https://review.gerrithub.io/415833
To unsubscribe, or for help writing mail filters, visit https://review.gerrithub.io/settings
Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: I07efe85d6a0cef1ae0c525444f42c06e458dea93
Gerrit-Change-Number: 415833
Gerrit-PatchSet: 1
Gerrit-Owner: Daniel Gryniewicz <dang(a)redhat.com>
6 years, 4 months
Announce Push of V2.7-dev.17
by Frank Filz
Branch next
Tag:V2.7-dev.17
Release Highlights
* new gtests
* gtest improvements
* doc: cleanups and clarifications in rados_cluster design manpage
* rados_cluster: start grace period after creating new recovery DB
* src/FSAL/default_methods.c: return of handle_get_ref and handle_put_ref
* packaging: pkg install needs to mkdir /var/log/ganesha/
Signed-off-by: Frank S. Filz <ffilzlnx(a)mindspring.com>
Contents:
6dcf23b Frank S. Filz V2.7-dev.17
0cc95c6 Kaleb S. KEITHLEY packaging: pkg install needs to mkdir
/var/log/ganesha/
311e46f grajoria src/FSAL/default_methods.c: return of handle_get_ref and
handle_put_ref
883acd7 grajoria gtest/fsal_api/ : update in open2, reopen2 and close2
c710de6 Jeff Layton rados_cluster: start grace period after creating new
recovery DB
fadbdba Jeff Layton doc: cleanups and clarifications in rados_cluster design
manpage
00e0cc3 Frank S. Filz gtest: nfs4_op_link
2d52e39 Frank S. Filz gtest: nfs4_op_rename also needs to set saved_export
c553d73 Frank S. Filz gtest: gtest_nfs4.hh add set_saved_export method
6 years, 4 months
Question about operations between SAL <-> FSAL for locking
by Tuan Viet Nguyen
Hello,
I'm trying to integrate a FS supporting async lock (can run callback when a
lock is granted) as ganesha FSAL. For the moment I do not make any upcalls
to FSAL and the lock still work. I wonder how it can work? may be SAL
handles that itself, does it have an internal lock queue and tries/retries
the waiting lock by making lock calls again frequently to the FSAL to know
when it can grant the lock?
Can somebody shed some light on that? And if it is the case, I imagine that
making the grant_lock upcall will increase the performance? Or there are
even more benefits?
Thank you
Viet
6 years, 4 months
FW: [PATCH] exports: document change to "insecure" export option
by Frank Filz
Hmm, should we make a similar change in Ganesha?
On the one hand it seems reasonable, but it may also not be a factor in our
environments.
Frank
-----Original Message-----
From: linux-nfs-owner(a)vger.kernel.org
[mailto:linux-nfs-owner@vger.kernel.org] On Behalf Of J. Bruce Fields
Sent: Thursday, June 14, 2018 6:33 AM
To: Steve Dickson <steved(a)redhat.com>
Cc: linux-nfs(a)vger.kernel.org
Subject: [PATCH] exports: document change to "insecure" export option
From: "J. Bruce Fields" <bfields(a)redhat.com>
We're changing the kernel to allow gss requests from high ports even when
"secure" is set.
If the change gets backported to distro kernels, the kernel version may be
an imperfect predictor of the behavior, but I think it's the best we can do.
Signed-off-by: J. Bruce Fields <bfields(a)redhat.com>
---
utils/exportfs/exports.man | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/utils/exportfs/exports.man b/utils/exportfs/exports.man index
4f95f3a2197e..e3a16f6b276a 100644
--- a/utils/exportfs/exports.man
+++ b/utils/exportfs/exports.man
@@ -131,10 +131,12 @@ this way are ro, rw, no_root_squash, root_squash, and
all_squash.
understands the following export options:
.TP
.IR secure
-This option requires that requests originate on an Internet port less -than
IPPORT_RESERVED (1024). This option is on by default. To turn it -off,
specify
+This option requires that requests not using gss originate on an
+Internet port less than IPPORT_RESERVED (1024). This option is on by
default.
+To turn it off, specify
.IR insecure .
+(NOTE: older kernels (before upstream kernel version 4.17) enforced
+this requirement on gss requests as well.)
.TP
.IR rw
Allow both read and write requests on this NFS volume. The
--
2.17.1
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the
body of a message to majordomo(a)vger.kernel.org More majordomo info at
http://vger.kernel.org/majordomo-info.html
6 years, 4 months
Change in ffilz/nfs-ganesha[next]: FSAL_VFS : only_one_user mode
by GerritHub
From Patrice LUCAS <patrice.lucas(a)cea.fr>:
Patrice LUCAS has uploaded this change for review. ( https://review.gerrithub.io/415396
Change subject: FSAL_VFS : only_one_user mode
......................................................................
FSAL_VFS : only_one_user mode
Add a "only_one_user" module option in all VFS subfsals
(VFS, LUSTRE, XFS and PANFS).
This option allows to prevent ganesha VFS FSALs to call setuid and
setgid when there are running in user mode and dedicated to only
one user. This allows ganesha VFS FSAL to be run in user mode
dedicated to only one user without setting the cap_setuid and
cap_setgid on the ganesha server binary.
This option is dedicated to run the ganesha FSAL only with the
starting UID and GID (user or root). The default of this option is
false. If "only_one_user" is set to true, we disable use of setuid
and setgid to deal with credential. Instead, all incoming requests
with uid or gid different from the one the fsal is running on are
rejected with EPERM error code.
Change-Id: Ib4b2bbd7c622b58726b20d54a4933d1fe7238ca8
Signed-off-by: Patrice LUCAS <patrice.lucas(a)cea.fr>
---
M src/FSAL/FSAL_VFS/CMakeLists.txt
M src/FSAL/FSAL_VFS/export.c
M src/FSAL/FSAL_VFS/file.c
M src/FSAL/FSAL_VFS/handle.c
M src/FSAL/FSAL_VFS/panfs/main.c
M src/FSAL/FSAL_VFS/vfs/main-c.in.cmake
M src/FSAL/FSAL_VFS/vfs_methods.h
M src/FSAL/FSAL_VFS/xfs/main.c
M src/FSAL/access_check.c
M src/include/FSAL/access_check.h
10 files changed, 272 insertions(+), 147 deletions(-)
git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/96/415396/1
--
To view, visit https://review.gerrithub.io/415396
To unsubscribe, or for help writing mail filters, visit https://review.gerrithub.io/settings
Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib4b2bbd7c622b58726b20d54a4933d1fe7238ca8
Gerrit-Change-Number: 415396
Gerrit-PatchSet: 1
Gerrit-Owner: Patrice LUCAS <patrice.lucas(a)cea.fr>
6 years, 4 months
[RFC PATCH] rados_cluster: add a "design" manpage
by Jeff Layton
From: Jeff Layton <jlayton(a)redhat.com>
Bruce asked for better design documentation, so this is my attempt at
it. Let me know what you think. I'll probably end up squashing this into
one of the code patches but for now I'm sending this separately to see
if it helps clarify things.
Suggestions and feedback are welcome.
Change-Id: I53cc77f66b2407c2083638e5760666639ba1fd57
Signed-off-by: Jeff Layton <jlayton(a)redhat.com>
---
src/doc/man/ganesha-rados-cluster.rst | 227 ++++++++++++++++++++++++++
1 file changed, 227 insertions(+)
create mode 100644 src/doc/man/ganesha-rados-cluster.rst
diff --git a/src/doc/man/ganesha-rados-cluster.rst b/src/doc/man/ganesha-rados-cluster.rst
new file mode 100644
index 000000000000..1ba2d3c29093
--- /dev/null
+++ b/src/doc/man/ganesha-rados-cluster.rst
@@ -0,0 +1,227 @@
+==============================================================================
+ganesha-rados-cluster-design -- Clustered RADOS Recovery Backend Design
+==============================================================================
+
+.. program:: ganesha-rados-cluster-design
+
+This document aims to explain the theory and design behind the
+rados_cluster recovery backend, which coordinates grace period
+enforcement among multiple, independent NFS servers.
+
+In order to understand the clustered recovery backend, it's first necessary
+to understand how recovery works with a single server:
+
+Singleton Server Recovery
+-------------------------
+NFSv4 is a lease-based protocol. Clients set up a relationship to the
+server and must periodically renew their lease in order to maintain
+their ephemeral state (open files, locks, delegations or layouts).
+
+When a singleton NFS server is restarted, any ephemeral state is lost. When
+the server comes comes back online, NFS clients detect that the server has
+been restarted and will reclaim the ephemeral state that they held at the
+time of their last contact with the server.
+
+Singleton Grace Period
+----------------------
+
+In order to ensure that we don't end up with conflicts, clients are
+barred from acquiring any new state while in the Recovery phase. Only
+reclaim operations are allowed.
+
+This period of time is called the **grace period**. Most NFS servers
+have a grace period that lasts around two lease periods, however
+nfs-ganesha can and will lift the grace period early if it determines
+that no more clients will be allowed to recover.
+
+Once the grace period ends, the server will move into its Normal
+operation state. During this period, no more recovery is allowed and new
+state can be acquired by NFS clients.
+
+Reboot Epochs
+-------------
+The lifecycle of a singleton NFS server can be considered to be a series
+of transitions from the Recovery period to Normal operation and back. In the
+remainder of this document we'll consider such a period to be an
+**epoch**, and assign each a number beginning with 1.
+
+Visually, we can represent it like this, such that each
+Normal -> Recovery transition is marked by a change in the epoch value:
+
+ +---------------------------------------+
+ | R | N | R | N | R | R | R | N | R | N | <=== Operational State
+ +---------------------------------------+
+ | 1 | 2 | 3 | 4 | <=== Epoch
+ +---------------------------------------+
+
+Note that it is possible to restart during the grace period (as shown
+above during epoch 3). That just serves to extend the recovery period
+and the epoch for longer. A new epoch is only declared during a
+Recovery -> Normal transition.
+
+Client Recovery Database
+------------------------
+There are some potential edge cases that can occur involving network
+partitions and multiple reboots. In order to prevent those, the server
+must maintain a list of clients that hold state on the server at any
+given time. This list must be maintained on stable storage. If a client
+sends a request to reclaim some state, then the server must check to
+make sure it's on that list before allowing the request.
+
+Thus when the server allows reclaim requests it must always gate it
+against the recovery database from the previous epoch. As clients come
+in to reclaim, we establish records for them in a new database
+associated with the current epoch.
+
+The transition from recovery to normal operation should perform an
+atomic switch of recovery databases. A recovery database only becomes
+legitimate on a recovery to normal transition. Until that point, the
+recovery database from the previous epoch is the canonical one.
+
+Exporting a Clustered Filesystem
+--------------------------------
+Let's consider a set of independent NFS servers, all serving out the same
+content from a clustered backend filesystem of any flavor. Each NFS
+server in this case can itself be considered a clustered FS client. This
+means that the NFS server is really just a proxy for state on the
+clustered filesystem (XXX: diagram here?)
+
+The filesystem must make some guarantees to the NFS server. First filesystem
+guarantee:
+
+1. The filesystem ensures that the NFS servers (aka the FS clients)
+ cannot obtain state that conflicts with that of another NFS server.
+
+This is somewhat obvious and is what we expect from any clustered filesystem
+outside of any requirements of NFS. If the clustered filesystem can
+provide this, then we know that conflicting state during normal
+operations cannot be granted.
+
+The recovery period has a different set of rules. If an NFS server
+crashes and is restarted, then we have a window of time when that NFS
+server does not know what state was held by its clients.
+
+If the state held by the crashed NFS server is immediately released
+after the crash, another NFS server could hand out conflicting state
+before the original NFS client has a chance to recover it.
+
+This *must* be prevented. Second filesystem guarantee:
+
+2. The filesystem must not release state held by a server during the
+ previous epoch until all servers in the cluster are enforcing the
+ grace period.
+
+In practical terms, we want the filesystem to provide a way for an NFS
+server to tell it when it's safe to release state held by a previous
+instance of itself. The server should do this once it knows that all of
+its siblings are enforcing the grace period.
+
+Note that we do not require that all servers restart and allow reclaim
+at that point. It's sufficient for them to simply begin grace period
+enforcement as soon as possible once one server needs it.
+
+Clustered Grace Period Database
+-------------------------------
+At this point the cluster siblings are no longer completely independent,
+and the grace period has become a cluster-wide property. This means that
+we must track the current epoch on some sort of shared storage that the
+servers can all access.
+
+Additionally we must also keep track of whether a cluster-wide grace period
+is in effect. Any running nodes should all be informed when either of this
+info changes, so they can take appropriate steps when it occurs.
+
+In the rados_cluster backend, we track these using two epoch values:
+
+- **C**: is the current epoch. This represents the current epoch value
+ of the cluster
+
+- **R**: is the recovery epoch. This represents the epoch from which
+ clients are allowed to recover. A non-zero value here means
+ that a cluster-wide grace period is in effect. Setting this to
+ 0 ends that grace period.
+
+In order to decide when to make grace period transitions, we must also
+have each server to advertise its state to the other nodes. Specifically,
+we need to allow servers to determine these two things about each of
+its siblings:
+
+1. Does this server have clients from the previous epoch that will require
+ recovery? (NEED)
+
+2. Is this server allowing clients to acquire new state? (ENFORCING)
+
+We do this with a pair of flags per sibling (NEED and ENFORCING). Each
+server typically manages its own flags.
+
+The rados_cluster backend stores all of this information in a single
+RADOS object that is modified using read/modify/write cycles. Typically
+we'll read the whole object, modify it, and then attept to write it
+back. If something changes between the read and write, we redo the read
+and try it again.
+
+Clustered Client Recovery Databases
+-----------------------------------
+In rados_cluster the client recovery databases are stored as RADOS
+objects. Each NFS server has its own set of them and they are given
+names that have the current epoch (C) embedded in it. This ensures
+that recovery databases are specific to a particular epoch.
+
+In general, it's safe to delete any recovery database that precedes R
+when R is non-zero, and safe to remove any recovery database except for
+the current one (the one with C in the name) when the grace period is
+not in effect (R==0).
+
+Establishing a New Grace Period
+-------------------------------
+When a server restarts and wants to allow clients to reclaim their
+state, it must establish a new epoch by incrementing the current epoch
+to declare a new grace period (R=C; C=C+1).
+
+The exception to this rule is when the cluster is already in a grace
+period. Servers can just join an in-progress grace period instead of
+establishing a new one if one is already active.
+
+In either case, the server should also set its NEED and ENFORCING flags
+at the same time.
+
+The other surviving cluster siblings should take steps to begin grace
+period enforcement as soon as possible. This entails "draining off" any
+in-progress state morphing operations and then blocking the acquisition
+of any new state (usually with a return of NFS4ERR_GRACE to clients that
+attempt it). Again, there is no need for the survivors from the previous
+epoch to allow recovery here.
+
+The surviving servers must however establish a new client recovery
+database at this point to ensure that their clients can do recovery in
+the event of a crash afterward.
+
+Once all of the siblings are enforcing the grace period, the recovering
+server can then request that the filesystem release the old state, and
+allow clients to begin reclaiming their state. In the rados_cluster
+backend driver, we do this by stalling server startup until all hosts
+in the cluster are enforcing the grace period.
+
+Lifting the Grace Period
+------------------------
+Transitioning from recovery to normal operation really consists of two
+different steps:
+
+1. the server decides that it no longer requires a grace period, either
+ due to it timing out or there not being any clients that would be
+ allowed to reclaim.
+
+2. the server stops enforcing the grace period and transitions to normal
+ operation
+
+These concepts are often conflated in a singleton servers, but in
+a cluster we must consider them independently.
+
+When a server is finished with its own local recovery period, it should
+clear its NEED flag. That server should continue enforcing the grace
+period however until the grace period is fully lifted.
+
+If the servers' own NEED flag is the last one set, then it can lift the
+grace period (by setting R=0). At that point, all servers in the cluster
+can end grace period enforcement, and communicate that fact to the
+others by clearing their ENFORCING flags.
--
2.17.0
6 years, 4 months