I came across this while we were trying to debug
https://github.com/nfs-ganesha/nfs-ganesha/issues/680 (not quite there
yet).
The following commit was merged fairly recently, partly at my request:
commit 275539c02ad3984e17d6d78b494bd9328cf466c7
Author: Malahal Naineni <malahal(a)us.ibm.com>
AuthorDate: Wed Oct 24 14:04:34 2018 +0530
Commit: Frank S. Filz <ffilzlnx(a)mindspring.com>
CommitDate: Fri Nov 18 10:33:13 2022 -0800
Retry communication with NSM service
It retries SM_MON and SM_UNMON on failure, forcing a disconnect on
error.
The above commit was written before the following one, but was merged
after:
commit 5febadaa98fb53bc4c3f2dab7793d6afb3847073
Author: Gaurav B. Gangalwar <gaurav.gangalwar(a)gmail.com>
AuthorDate: Tue Feb 19 10:42:57 2019 -0500
Commit: Frank S. Filz <ffilzlnx(a)mindspring.com>
CommitDate: Fri Feb 22 10:24:07 2019 -0800
Reduce nsm_count before nsm_disconnect as we will not be able to disconnect nsm
client in case of RPC failures.
Change-Id: I0cd5f0614e89f447dc7c5ac278cf03224add05e2
Signed-off-by: Gaurav B. Gangalwar <gaurav.gangalwar(a)gmail.com>
It moves nsm_count-- before the last error check.
The combination of both patches means that, on error, an SM_UNMON
request can reduce nsm_count by 2, making it negative. Oops!
The 2nd commit claims to "Reduce nsm_count before nsm_disconnect".
However, when you look at the patch and the code, the call to
nsm_disconnect() is after the final LogDebug() in the patch context, so
nsm_count-- was always before the call to nsm_disconnect(). So, unless
there's a something subtle going on (e.g. it really wants to decrement
the reference count before the call to clnt_req_release()) then the
patch doesn't do what the commit message says.
In my mind, the simplest solution to the double-decrement-on-retry
issue is to move nsm_count-- back to where it was before, so it is only
done on success.
If that seems sane then I'm happy to submit a patch doing that.
Thanks!
peace & happiness,
martin