I put this in our IBM repo for now. Happy to take whatever form goes to
upstream later. Appreciate for any review comments:
TOP commit here:
https://github.com/malahal/nfs-ganesha/commits/ibm2.5
On Thu, Oct 25, 2018 at 6:53 PM Daniel Gryniewicz <dang(a)redhat.com> wrote:
You're right, I'd missed that. So the only issues left are
keeping
nsm_count correct, and making sure that statd behaves correctly when we
disconnect and reconnect.
I think I'd prefer the version with a force argument for the error
cases. That way, under normal circumstances, things stay the same.
Daniel
On 10/25/2018 12:00 AM, Malahal Naineni wrote:
> nsm_monitor and nsm_unmonitor use a single lock to serialize everything.
> I am not sure what problem would it cause.
>
> On Wed, Oct 24, 2018 at 7:25 PM Daniel Gryniewicz <dang(a)redhat.com
> <mailto:dang@redhat.com>> wrote:
>
> Ah, okay, I see now. The problem is, I'm not sure how
> force-disconnect interacts with nsm_count. I don't think we can just
> disregard it. In fact, looking at it, we can't force delete
nsm_clnt,
> because another nsm_monitor() call might be using it. Maybe we need
a
> separate client for each nsm_monitor() call? I don't know if that
> works with the lock manager, though.
>
> Frank, any ideas?
>
> Daniel
> On Wed, Oct 24, 2018 at 4:22 AM Malahal Naineni <malahal(a)gmail.com
> <mailto:malahal@gmail.com>> wrote:
> >
> > Dan, nsm_disconnect() is disconnecting every time as opposed to
> only when nsm_count==0. I think this works because it is connecting
> and disconnecting everything, not just on error. We could pass a
> flag to nsm_disconnect(force=true) on error and
> nsm_disconnect(force=false) with normal case. We need this as a way
> to recovery mechanism.
> >
> > Regards, Malahal.
> > PS: I just looked at this nsm code for a customer issue I posted
> on IRC yesterday! Thank you Suhrud.
> >
> > On Tue, Oct 23, 2018 at 7:55 PM Daniel Gryniewicz
> <dang(a)redhat.com <mailto:dang@redhat.com>> wrote:
> >>
> >> Discussion inline.
> >>
> >> On 10/23/2018 07:58 AM, Suhrud Patankar wrote:
> >> > Hello All,
> >> >
> >> > I have two questions related to NLM and rpc.statd.
> >> >
> >> > Setup:
> >> > Ganesha version: V2.5.0.4
> >> > CentOS Linux release 7.5.1804
> >> > FSAL – Any
> >> >
> >> > 1. By default libntirpc has -DPORTMAP build flag enabled. As I
> >> > understand this will use UDP to communicate with rpc.statd.
> With this
> >> > Ganesha fails to monitor the NLM clients and hence lock
> requests fail.
> >> > The call to clnt_tp_ncreate_timed() fails in
> >> > clnt_dg_control(CLSET_SVC_ADDR) as addr->len is 16 where as
> >> > sizeof(cu->cu_raddr) is 128.
> >> > If I disable -DPORTMAP build flag in tirpc, then Ganesha uses
> “tcp” to
> >> > connect to rpc.statd and everything works as expected.
> >> > Are we supposed to use “tcp” and disable the -DPORTMAP flag?
> Why is
> >> > this flag enabled by default?
> >>
> >> This isn't related to UDP vs. TCP. The issue is that one
> address buffer
> >> passed in is IPv4, and the other is IPv6, and an IPv6 address
> cannot fit
> >> into an IPv4 buffer. It's definitely a bug, but I don't see
> exactly how
> >> this is happening. Can you give a more detailed explanation,
with
> >> callpaths to get there?
> >>
> >> >
> >> > 2. Case: rpc.statd service restarts. Ganesha is not restarted.
> >> > nsm_connect() is done only once. On failure to connect to
> rpc.statd,
> >> > we don’t retry. This causes the NLM lock request to fail with
“No
> >> > lock” error.
> >> > Is it expected that Ganesha will restart when rpc.statd
> restarts? I
> >> > added a patch to re-call nsm_connect() on failure and it works
> for me
> >> > now.
> >> >
> >> > Is this correct way to handle this?
> >>
> >> We should probably re-connect when statd fails, but I'm not sure
how
> >> this patch accomplishes that. All it does is retry if the
initial
> >> connect fails, it doesn't seem to retry later on. Can you
> explain more
> >> how this is working?
> >>
> >> Daniel
> >>
> >> > The patch is as below.
> >> > ------------------------------
> >> > From: Suhrud Patankar <suhrudpatankar(a)gmail.com
> <mailto:suhrudpatankar@gmail.com>>
> >> > Date: Thu, 11 Oct 2018 00:35:45 -0700
> >> > Subject: [PATCH] NFSv3: NLM - Reconnect with statd on failure
> >> >
> >> > Ganesha creates the statd connection context only once.
> >> > In case statd service restarts, we fail to monitor new clients.
> >> > On failure, destroy the STATD connection and recreate new one.
> >> > ---
> >> > src/Protocols/NLM/nsm.c | 21 +++++++++++++++++++--
> >> >
> >> > 1 files changed, 19 insertions(+), 2 deletions(-)
> >> >
> >> >
> >> > diff --git a/src/Protocols/NLM/nsm.c b/src/Protocols/NLM/nsm.c
> >> >
> >> > index 09909d9..8bb41d7 100644
> >> > --- a/src/Protocols/NLM/nsm.c
> >> > +++ b/src/Protocols/NLM/nsm.c
> >> > @@ -47,6 +47,8 @@ bool nsm_connect(void)
> >> > if (nsm_clnt != NULL)
> >> > return true;
> >> >
> >> > + LogFullDebug(COMPONENT_NLM, "Creating new nsm
> connection ");
> >> > +
> >> > nsm_clnt = clnt_ncreate("localhost", SM_PROG,
> SM_VERS, "tcp");
> >> >
> >> > if (CLNT_FAILURE(nsm_clnt)) {
> >> > @@ -66,7 +68,7 @@ bool nsm_connect(void)
> >> >
> >> > void nsm_disconnect(void)
> >> > {
> >> > - if (nsm_count == 0 && nsm_clnt != NULL) {
> >> > + if (nsm_clnt != NULL) {
> >> > CLNT_DESTROY(nsm_clnt);
> >> > nsm_clnt = NULL;
> >> > AUTH_DESTROY(nsm_auth);
> >> > @@ -83,6 +85,7 @@ bool nsm_monitor(state_nsm_client_t *host)
> >> > enum clnt_stat ret;
> >> > char client_ip[SOCK_NAME_MAX];
> >> > char server_ip[SOCK_NAME_MAX * 2]; /* space for
> proxy_ip */
> >> > + bool retry = false;
> >> >
> >> > if (host == NULL)
> >> > return true;
> >> > @@ -120,7 +123,7 @@ bool nsm_monitor(state_nsm_client_t *host)
> >> > LogDebug(COMPONENT_NLM, "Monitor %s",
> host->ssc_nlm_caller_name);
> >> >
> >> > PTHREAD_MUTEX_lock(&nsm_mutex);
> >> > -
> >> > +again:
> >> > /* create a connection to nsm on the localhost */
> >> > if (!nsm_connect()) {
> >> > LogCrit(COMPONENT_NLM,
> >> > @@ -144,6 +147,20 @@ bool nsm_monitor(state_nsm_client_t *host)
> >> > }
> >> >
> >> > if (ret != RPC_SUCCESS) {
> >> > + if (! retry) {
> >> > + /* Handle case when statd restarts.
> >> > + * Need to reconnect with statd
> >> > + */
> >> > + LogFullDebug(COMPONENT_NLM,
> >> > + "Try to reconnect with
STATD");
> >> > +
> >> > + retry = true;
> >> > + nsm_mon.mon_id.my_id.my_name = NULL;
> >> > + clnt_req_release(cc);
> >> > + nsm_disconnect();
> >> > + goto again;
> >> > + }
> >> > +
> >> > t = rpc_sperror(&cc->cc_error,
"failed");
> >> > LogCrit(COMPONENT_NLM,
> >> > "Monitor %s SM_MON %s",
> >> > --
> >> > 1.7.1
> >> >
> >> > Thanks & Regards,
> >> > Suhrud
> >> > _______________________________________________
> >> > Devel mailing list -- devel(a)lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> >> > To unsubscribe send an email to
> devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
> >> >
> >> _______________________________________________
> >> Devel mailing list -- devel(a)lists.nfs-ganesha.org
> <mailto:devel@lists.nfs-ganesha.org>
> >> To unsubscribe send an email to
> devel-leave(a)lists.nfs-ganesha.org
> <mailto:devel-leave@lists.nfs-ganesha.org>
>