I put this in our IBM repo for now. Happy to take whatever form goes to upstream later. Appreciate for any review comments:
TOP commit here: https://github.com/malahal/nfs-ganesha/commits/ibm2.5

On Thu, Oct 25, 2018 at 6:53 PM Daniel Gryniewicz <dang@redhat.com> wrote:
You're right, I'd missed that.  So the only issues left are keeping
nsm_count correct, and making sure that statd behaves correctly when we
disconnect and reconnect.

I think I'd prefer the version with a force argument for the error
cases.  That way, under normal circumstances, things stay the same.

Daniel

On 10/25/2018 12:00 AM, Malahal Naineni wrote:
> nsm_monitor and nsm_unmonitor use a single lock to serialize everything.
> I am not sure what problem would it cause.
>
> On Wed, Oct 24, 2018 at 7:25 PM Daniel Gryniewicz <dang@redhat.com
> <mailto:dang@redhat.com>> wrote:
>
>     Ah, okay, I see now.  The problem is, I'm not sure how
>     force-disconnect interacts with nsm_count.  I don't think we can just
>     disregard it.  In fact, looking at it, we can't force delete nsm_clnt,
>     because another nsm_monitor() call might be using it.  Maybe we need a
>     separate client for each nsm_monitor() call?  I don't know if that
>     works with the lock manager, though.
>
>     Frank, any ideas?
>
>     Daniel
>     On Wed, Oct 24, 2018 at 4:22 AM Malahal Naineni <malahal@gmail.com
>     <mailto:malahal@gmail.com>> wrote:
>      >
>      > Dan, nsm_disconnect() is disconnecting every time as opposed to
>     only when nsm_count==0. I think this works because it is connecting
>     and disconnecting everything, not just on error. We could pass a
>     flag to nsm_disconnect(force=true) on error and
>     nsm_disconnect(force=false) with normal case. We need this as a way
>     to recovery mechanism.
>      >
>      > Regards, Malahal.
>      > PS: I just looked at this nsm code for a customer issue I posted
>     on IRC yesterday! Thank you Suhrud.
>      >
>      > On Tue, Oct 23, 2018 at 7:55 PM Daniel Gryniewicz
>     <dang@redhat.com <mailto:dang@redhat.com>> wrote:
>      >>
>      >> Discussion inline.
>      >>
>      >> On 10/23/2018 07:58 AM, Suhrud Patankar wrote:
>      >> > Hello All,
>      >> >
>      >> > I have two questions related to NLM and rpc.statd.
>      >> >
>      >> > Setup:
>      >> > Ganesha version: V2.5.0.4
>      >> > CentOS Linux release 7.5.1804
>      >> > FSAL – Any
>      >> >
>      >> > 1. By default libntirpc has -DPORTMAP build flag enabled. As I
>      >> > understand this will use UDP to communicate with rpc.statd.
>     With this
>      >> > Ganesha fails to monitor the NLM clients and hence lock
>     requests fail.
>      >> > The call to clnt_tp_ncreate_timed() fails in
>      >> > clnt_dg_control(CLSET_SVC_ADDR) as addr->len is 16 where as
>      >> > sizeof(cu->cu_raddr) is 128.
>      >> > If I disable -DPORTMAP build flag in tirpc, then Ganesha uses
>     “tcp” to
>      >> > connect to rpc.statd and everything works as expected.
>      >> > Are we supposed to use “tcp” and disable the -DPORTMAP flag?
>     Why is
>      >> > this flag enabled by default?
>      >>
>      >> This isn't related to UDP vs. TCP.  The issue is that one
>     address buffer
>      >> passed in is IPv4, and the other is IPv6, and an IPv6 address
>     cannot fit
>      >> into an IPv4 buffer.  It's definitely a bug, but I don't see
>     exactly how
>      >> this is happening.  Can you give a more detailed explanation, with
>      >> callpaths to get there?
>      >>
>      >> >
>      >> > 2. Case: rpc.statd service restarts. Ganesha is not restarted.
>      >> > nsm_connect() is done only once. On failure to connect to
>     rpc.statd,
>      >> > we don’t retry. This causes the NLM lock request to fail with “No
>      >> > lock” error.
>      >> > Is it expected that Ganesha will restart when rpc.statd
>     restarts? I
>      >> > added a patch to re-call nsm_connect() on failure and it works
>     for me
>      >> > now.
>      >> >
>      >> > Is this correct way to handle this?
>      >>
>      >> We should probably re-connect when statd fails, but I'm not sure how
>      >> this patch accomplishes that.  All it does is retry if the initial
>      >> connect fails, it doesn't seem to retry later on.  Can you
>     explain more
>      >> how this is working?
>      >>
>      >> Daniel
>      >>
>      >> > The patch is as below.
>      >> > ------------------------------
>      >> > From: Suhrud Patankar <suhrudpatankar@gmail.com
>     <mailto:suhrudpatankar@gmail.com>>
>      >> > Date: Thu, 11 Oct 2018 00:35:45 -0700
>      >> > Subject: [PATCH] NFSv3: NLM - Reconnect with statd on failure
>      >> >
>      >> > Ganesha creates the statd connection context only once.
>      >> > In case statd service restarts, we fail to monitor new clients.
>      >> > On failure, destroy the STATD connection and recreate new one.
>      >> > ---
>      >> >   src/Protocols/NLM/nsm.c |   21 +++++++++++++++++++--
>      >> >
>      >> >   1 files changed, 19 insertions(+), 2 deletions(-)
>      >> >
>      >> >
>      >> > diff --git a/src/Protocols/NLM/nsm.c b/src/Protocols/NLM/nsm.c
>      >> >
>      >> > index 09909d9..8bb41d7 100644
>      >> > --- a/src/Protocols/NLM/nsm.c
>      >> > +++ b/src/Protocols/NLM/nsm.c
>      >> > @@ -47,6 +47,8 @@ bool nsm_connect(void)
>      >> >          if (nsm_clnt != NULL)
>      >> >                  return true;
>      >> >
>      >> > +       LogFullDebug(COMPONENT_NLM, "Creating new nsm
>     connection ");
>      >> > +
>      >> >          nsm_clnt = clnt_ncreate("localhost", SM_PROG,
>     SM_VERS, "tcp");
>      >> >
>      >> >          if (CLNT_FAILURE(nsm_clnt)) {
>      >> > @@ -66,7 +68,7 @@ bool nsm_connect(void)
>      >> >
>      >> >   void nsm_disconnect(void)
>      >> >   {
>      >> > -       if (nsm_count == 0 && nsm_clnt != NULL) {
>      >> > +       if (nsm_clnt != NULL) {
>      >> >                  CLNT_DESTROY(nsm_clnt);
>      >> >                  nsm_clnt = NULL;
>      >> >                  AUTH_DESTROY(nsm_auth);
>      >> > @@ -83,6 +85,7 @@ bool nsm_monitor(state_nsm_client_t *host)
>      >> >          enum clnt_stat ret;
>      >> >          char client_ip[SOCK_NAME_MAX];
>      >> >          char server_ip[SOCK_NAME_MAX * 2]; /* space for
>     proxy_ip */
>      >> > +       bool retry = false;
>      >> >
>      >> >          if (host == NULL)
>      >> >                  return true;
>      >> > @@ -120,7 +123,7 @@ bool nsm_monitor(state_nsm_client_t *host)
>      >> >          LogDebug(COMPONENT_NLM, "Monitor %s",
>     host->ssc_nlm_caller_name);
>      >> >
>      >> >          PTHREAD_MUTEX_lock(&nsm_mutex);
>      >> > -
>      >> > +again:
>      >> >          /* create a connection to nsm on the localhost */
>      >> >          if (!nsm_connect()) {
>      >> >                  LogCrit(COMPONENT_NLM,
>      >> > @@ -144,6 +147,20 @@ bool nsm_monitor(state_nsm_client_t *host)
>      >> >          }
>      >> >
>      >> >          if (ret != RPC_SUCCESS) {
>      >> > +               if (! retry) {
>      >> > +                       /* Handle case when statd restarts.
>      >> > +                        * Need to reconnect with statd
>      >> > +                        */
>      >> > +                       LogFullDebug(COMPONENT_NLM,
>      >> > +                               "Try to reconnect with STATD");
>      >> > +
>      >> > +                       retry = true;
>      >> > +                       nsm_mon.mon_id.my_id.my_name = NULL;
>      >> > +                       clnt_req_release(cc);
>      >> > +                       nsm_disconnect();
>      >> > +                       goto again;
>      >> > +               }
>      >> > +
>      >> >                  t = rpc_sperror(&cc->cc_error, "failed");
>      >> >                  LogCrit(COMPONENT_NLM,
>      >> >                          "Monitor %s SM_MON %s",
>      >> > --
>      >> > 1.7.1
>      >> >
>      >> > Thanks & Regards,
>      >> > Suhrud
>      >> > _______________________________________________
>      >> > Devel mailing list -- devel@lists.nfs-ganesha.org
>     <mailto:devel@lists.nfs-ganesha.org>
>      >> > To unsubscribe send an email to
>     devel-leave@lists.nfs-ganesha.org
>     <mailto:devel-leave@lists.nfs-ganesha.org>
>      >> >
>      >> _______________________________________________
>      >> Devel mailing list -- devel@lists.nfs-ganesha.org
>     <mailto:devel@lists.nfs-ganesha.org>
>      >> To unsubscribe send an email to
>     devel-leave@lists.nfs-ganesha.org
>     <mailto:devel-leave@lists.nfs-ganesha.org>
>