nsm_monitor and nsm_unmonitor use a single lock to serialize everything. I am not sure what problem would it cause.

On Wed, Oct 24, 2018 at 7:25 PM Daniel Gryniewicz <dang@redhat.com> wrote:
Ah, okay, I see now.  The problem is, I'm not sure how
force-disconnect interacts with nsm_count.  I don't think we can just
disregard it.  In fact, looking at it, we can't force delete nsm_clnt,
because another nsm_monitor() call might be using it.  Maybe we need a
separate client for each nsm_monitor() call?  I don't know if that
works with the lock manager, though.

Frank, any ideas?

Daniel
On Wed, Oct 24, 2018 at 4:22 AM Malahal Naineni <malahal@gmail.com> wrote:
>
> Dan, nsm_disconnect() is disconnecting every time as opposed to only when nsm_count==0. I think this works because it is connecting and disconnecting everything, not just on error. We could pass a flag to nsm_disconnect(force=true) on error and nsm_disconnect(force=false) with normal case. We need this as a way to recovery mechanism.
>
> Regards, Malahal.
> PS: I just looked at this nsm code for a customer issue I posted on IRC yesterday! Thank you Suhrud.
>
> On Tue, Oct 23, 2018 at 7:55 PM Daniel Gryniewicz <dang@redhat.com> wrote:
>>
>> Discussion inline.
>>
>> On 10/23/2018 07:58 AM, Suhrud Patankar wrote:
>> > Hello All,
>> >
>> > I have two questions related to NLM and rpc.statd.
>> >
>> > Setup:
>> > Ganesha version: V2.5.0.4
>> > CentOS Linux release 7.5.1804
>> > FSAL – Any
>> >
>> > 1. By default libntirpc has -DPORTMAP build flag enabled. As I
>> > understand this will use UDP to communicate with rpc.statd. With this
>> > Ganesha fails to monitor the NLM clients and hence lock requests fail.
>> > The call to clnt_tp_ncreate_timed() fails in
>> > clnt_dg_control(CLSET_SVC_ADDR) as addr->len is 16 where as
>> > sizeof(cu->cu_raddr) is 128.
>> > If I disable -DPORTMAP build flag in tirpc, then Ganesha uses “tcp” to
>> > connect to rpc.statd and everything works as expected.
>> > Are we supposed to use “tcp” and disable the -DPORTMAP flag? Why is
>> > this flag enabled by default?
>>
>> This isn't related to UDP vs. TCP.  The issue is that one address buffer
>> passed in is IPv4, and the other is IPv6, and an IPv6 address cannot fit
>> into an IPv4 buffer.  It's definitely a bug, but I don't see exactly how
>> this is happening.  Can you give a more detailed explanation, with
>> callpaths to get there?
>>
>> >
>> > 2. Case: rpc.statd service restarts. Ganesha is not restarted.
>> > nsm_connect() is done only once. On failure to connect to rpc.statd,
>> > we don’t retry. This causes the NLM lock request to fail with “No
>> > lock” error.
>> > Is it expected that Ganesha will restart when rpc.statd restarts? I
>> > added a patch to re-call nsm_connect() on failure and it works for me
>> > now.
>> >
>> > Is this correct way to handle this?
>>
>> We should probably re-connect when statd fails, but I'm not sure how
>> this patch accomplishes that.  All it does is retry if the initial
>> connect fails, it doesn't seem to retry later on.  Can you explain more
>> how this is working?
>>
>> Daniel
>>
>> > The patch is as below.
>> > ------------------------------
>> > From: Suhrud Patankar <suhrudpatankar@gmail.com>
>> > Date: Thu, 11 Oct 2018 00:35:45 -0700
>> > Subject: [PATCH] NFSv3: NLM - Reconnect with statd on failure
>> >
>> > Ganesha creates the statd connection context only once.
>> > In case statd service restarts, we fail to monitor new clients.
>> > On failure, destroy the STATD connection and recreate new one.
>> > ---
>> >   src/Protocols/NLM/nsm.c |   21 +++++++++++++++++++--
>> >
>> >   1 files changed, 19 insertions(+), 2 deletions(-)
>> >
>> >
>> > diff --git a/src/Protocols/NLM/nsm.c b/src/Protocols/NLM/nsm.c
>> >
>> > index 09909d9..8bb41d7 100644
>> > --- a/src/Protocols/NLM/nsm.c
>> > +++ b/src/Protocols/NLM/nsm.c
>> > @@ -47,6 +47,8 @@ bool nsm_connect(void)
>> >          if (nsm_clnt != NULL)
>> >                  return true;
>> >
>> > +       LogFullDebug(COMPONENT_NLM, "Creating new nsm connection ");
>> > +
>> >          nsm_clnt = clnt_ncreate("localhost", SM_PROG, SM_VERS, "tcp");
>> >
>> >          if (CLNT_FAILURE(nsm_clnt)) {
>> > @@ -66,7 +68,7 @@ bool nsm_connect(void)
>> >
>> >   void nsm_disconnect(void)
>> >   {
>> > -       if (nsm_count == 0 && nsm_clnt != NULL) {
>> > +       if (nsm_clnt != NULL) {
>> >                  CLNT_DESTROY(nsm_clnt);
>> >                  nsm_clnt = NULL;
>> >                  AUTH_DESTROY(nsm_auth);
>> > @@ -83,6 +85,7 @@ bool nsm_monitor(state_nsm_client_t *host)
>> >          enum clnt_stat ret;
>> >          char client_ip[SOCK_NAME_MAX];
>> >          char server_ip[SOCK_NAME_MAX * 2]; /* space for proxy_ip */
>> > +       bool retry = false;
>> >
>> >          if (host == NULL)
>> >                  return true;
>> > @@ -120,7 +123,7 @@ bool nsm_monitor(state_nsm_client_t *host)
>> >          LogDebug(COMPONENT_NLM, "Monitor %s", host->ssc_nlm_caller_name);
>> >
>> >          PTHREAD_MUTEX_lock(&nsm_mutex);
>> > -
>> > +again:
>> >          /* create a connection to nsm on the localhost */
>> >          if (!nsm_connect()) {
>> >                  LogCrit(COMPONENT_NLM,
>> > @@ -144,6 +147,20 @@ bool nsm_monitor(state_nsm_client_t *host)
>> >          }
>> >
>> >          if (ret != RPC_SUCCESS) {
>> > +               if (! retry) {
>> > +                       /* Handle case when statd restarts.
>> > +                        * Need to reconnect with statd
>> > +                        */
>> > +                       LogFullDebug(COMPONENT_NLM,
>> > +                               "Try to reconnect with STATD");
>> > +
>> > +                       retry = true;
>> > +                       nsm_mon.mon_id.my_id.my_name = NULL;
>> > +                       clnt_req_release(cc);
>> > +                       nsm_disconnect();
>> > +                       goto again;
>> > +               }
>> > +
>> >                  t = rpc_sperror(&cc->cc_error, "failed");
>> >                  LogCrit(COMPONENT_NLM,
>> >                          "Monitor %s SM_MON %s",
>> > --
>> > 1.7.1
>> >
>> > Thanks & Regards,
>> > Suhrud
>> > _______________________________________________
>> > Devel mailing list -- devel@lists.nfs-ganesha.org
>> > To unsubscribe send an email to devel-leave@lists.nfs-ganesha.org
>> >
>> _______________________________________________
>> Devel mailing list -- devel@lists.nfs-ganesha.org
>> To unsubscribe send an email to devel-leave@lists.nfs-ganesha.org