Looks like my previous message was not delivered due to long length, summarising it again,
I tried with this fix, as suggested added SVC_DESTROY in UMNT call, added some logs for debugging,

Issue remains same.

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] mnt_Umnt :NFS3 :DEBUG :REQUEST PROCESSING: Calling MNT_UMNT path /export

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] nfs_rpc_process_request :DISP :DEBUG :After svc_sendreply for umount before SVC_DESTROY0x7fff38024450 1045 4

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] nfs_rpc_process_request :DISP :DEBUG :After svc_sendreply for umount after SVC_DESTROY0x7fff38024450 1045 3

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] nfs_rpc_decode_request :DISP :DEBUG :SVC_DECODE on 0x7fff38024450 fd 1045 (::ffff:10.5.192.59:911) xid=1123634942 returned XPRT_DESTROYED

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] free_nfs_request :DISP :DEBUG :free_nfs_request: 0x7fff38024450 fd 1045 xp_refcnt 3 rq_refcnt 0

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before SVC_RELEASE 0x7fff38024450 1045 2

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after SVC_RELEASE 0x7fff38024450 1045 1

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_4] rpc :TIRPC :DEBUG :Event on already destroyed xprt 0x7fff38024450 1045 1


As we can see in the log highlighted we are still left with one ref even after SVC_DESTROY in UMNT and all SVC_RELEASE.


I tried with SVC_RELEASE instead of SVC_DESTROY in same patch, it worked at least for connection on which UMNT came, we are releasing the xprt for it.


10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] mnt_Umnt :NFS3 :DEBUG :REQUEST PROCESSING: Calling MNT_UMNT path /export

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] nfs_rpc_process_request :DISP :DEBUG :After svc_sendreply for umount before SVC_RELEASE 0x7fffd000b600 487 4

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] nfs_rpc_process_request :DISP :DEBUG :After svc_sendreply for umount after SVC_RELEASE 0x7fffd000b600 487 3

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] nfs_rpc_decode_request :DISP :DEBUG :SVC_DECODE on 0x7fffd000b600 fd 487 (::ffff:10.5.192.59:774) xid=457315812 returned XPRT_IDLE

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] free_nfs_request :DISP :DEBUG :free_nfs_request: 0x7fffd000b600 fd 487 xp_refcnt 3 rq_refcnt 0

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before SVC_RELEASE 0x7fffd000b600 487 2

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after SVC_RELEASE 0x7fffd000b600 487 1

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd000b600 fd 487 recv closed (will set dead), before SVC_DESTROY refcnt 2

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd000b600 fd 487 recv closed (will set dead), after SVC_DESTROY refcnt 1

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before SVC_RELEASE 0x7fffd000b600 487 1

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after SVC_RELEASE 0x7fffd000b600 487 0


As we can see in the logs refcnt is going to zero for the connection on which UMNT came.
But there are logs for other connections also, which we are not cleaning up as there is no UMNT on them, but we are polling on them and they are getting closed,
I just have single client and running same script, looks like client is opening multiple connections and closing them on UMNT.

10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd401b2c0 fd 1045 recv closed (will set dead), before SVC_DESTROY refcnt 3

10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd401b2c0 fd 1045 recv closed (will set dead), after SVC_DESTROY refcnt 2

10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before SVC_RELEASE 0x7fffd401b2c0 1045 2

10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after SVC_RELEASE 0x7fffd401b2c0 1045 1 

So this fix is not working for connection close without UMNT, there is still fd leak.

One suggestion, should we have svc_rqst_clean_dead just like svc_rqst_clean_idle?
svc_rqst_clean_dead will iterate through xprts and do SVC_RELEASE for xprts with SVC_XPRT_FLAG_DESTROYING set and have xp_refcnt == 1.
Check this patch,
Regards,
Gaurav


On Mon, Dec 10, 2018 at 4:22 PM gaurav gangalwar <gaurav.gangalwar@gmail.com> wrote:
I tried with this fix, as suggested added SVC_DESTROY in UMNT call, added some logs for debugging,

Issue remains same.

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] mnt_Umnt :NFS3 :DEBUG :REQUEST PROCESSING: Calling MNT_UMNT path /export

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] nfs_rpc_process_request :DISP :DEBUG :After svc_sendreply for umount before SVC_DESTROY 0x7fff38024450 1045 4

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] nfs_rpc_process_request :DISP :DEBUG :After svc_sendreply for umount after SVC_DESTROY 0x7fff38024450 1045 3

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] nfs_rpc_decode_request :DISP :DEBUG :SVC_DECODE on 0x7fff38024450 fd 1045 (::ffff:10.5.192.59:911) xid=1123634942 returned XPRT_DESTROYED

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] free_nfs_request :DISP :DEBUG :free_nfs_request: 0x7fff38024450 fd 1045 xp_refcnt 3 rq_refcnt 0

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before SVC_RELEASE 0x7fff38024450 1045 2

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after SVC_RELEASE 0x7fff38024450 1045 1

10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 : ganesha.nfsd-78524[svc_4] rpc :TIRPC :DEBUG :Event on already destroyed xprt 0x7fff38024450 1045 1


As we can see in the log highlighted we are still left with one ref even after SVC_DESTROY in UMNT and all SVC_RELEASE.


I tried with SVC_RELEASE instead of SVC_DESTROY in same patch, it worked at least for connection on which UMNT came, we are releasing the xprt for it.


10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] mnt_Umnt :NFS3 :DEBUG :REQUEST PROCESSING: Calling MNT_UMNT path /export

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] nfs_rpc_process_request :DISP :DEBUG :After svc_sendreply for umount before SVC_RELEASE 0x7fffd000b600 487 4

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] nfs_rpc_process_request :DISP :DEBUG :After svc_sendreply for umount after SVC_RELEASE 0x7fffd000b600 487 3

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] nfs_rpc_decode_request :DISP :DEBUG :SVC_DECODE on 0x7fffd000b600 fd 487 (::ffff:10.5.192.59:774) xid=457315812 returned XPRT_IDLE

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] free_nfs_request :DISP :DEBUG :free_nfs_request: 0x7fffd000b600 fd 487 xp_refcnt 3 rq_refcnt 0

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before SVC_RELEASE 0x7fffd000b600 487 2

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after SVC_RELEASE 0x7fffd000b600 487 1

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd000b600 fd 487 recv closed (will set dead), before SVC_DESTROY refcnt 2

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd000b600 fd 487 recv closed (will set dead), after SVC_DESTROY refcnt 1

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before SVC_RELEASE 0x7fffd000b600 487 1

10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after SVC_RELEASE 0x7fffd000b600 487 0


As we can see in the logs refcnt is going to zero for the connection on which UMNT came.
But there are logs for other connections also, which we are not cleaning up as there is no UMNT on them, but we are polling on them and they are getting closed,
I just have single client and running same script, looks like client is opening multiple connections and closing them on UMNT.

10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd401b2c0 fd 1045 recv closed (will set dead), before SVC_DESTROY refcnt 3

10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd401b2c0 fd 1045 recv closed (will set dead), after SVC_DESTROY refcnt 2

10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before SVC_RELEASE 0x7fffd401b2c0 1045 2

10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 : ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after SVC_RELEASE 0x7fffd401b2c0 1045 1 

So this fix is not working for connection close without UMNT, there is still fd leak.
Attached log file.

One suggestion, should we have svc_rqst_clean_dead just like svc_rqst_clean_idle?
svc_rqst_clean_dead will iterate through xprts and do SVC_RELEASE for xprts with SVC_XPRT_FLAG_DESTROYING set and have xp_refcnt == 1.
Regards,
Gaurav


On Thu, Dec 6, 2018 at 7:37 PM Matt Benjamin <mbenjami@redhat.com> wrote:
As the function naming (which partly predates ntirpc fork) and
comments clearly imply, the semantics of svc_clean_idle is to do
periodic processing of idle, not closed, connections.  At some point
in the not too distant past, regular processing of closed connections
reliably triggered the conn_set_dead condition, hence caused
svc_destroy() to be called on the xprt handles.

Matt
On Wed, Dec 5, 2018 at 3:11 PM William Allen Simpson
<william.allen.simpson@gmail.com> wrote:
>
> Wow, I really hate top posts.  Very hard to comprehend your responses.
>
>
> On 12/5/18 12:28 AM, gaurav gangalwar wrote:
> > I waited for more that 14 hours. This is the script I ran.
> > for i in {1..300}; do echo $i; mount -t nfs -o vers=3,tcp 10.15.196.92:/export /mnt1; umount /mnt1; done
> > It went out of ads at 169th iteration.
> > 169
> > mount.nfs: Connection timed out
> > umount: /mnt1: not mounted
> >
> I really don't understand.  There are no constants in the current V2.8-dev code base
> that add up to more than 14 hours, and no constant that would yield 169.
>
> In any case, this particular denial of service attack should be easy to fix by
> adding SVC_DESTROY to the Ganesha umount code path(s).  However, because it seems
> rather unlikely this could happen in production, it's not currently a high priority.
>
> Do you have a production environment or customer report where this occurred?
>
> Would you like help writing the Ganesha patch?
>
> Are you with Nutanix? Or IBM?
>
>
> > After this I was not able to mount or do any operation. There was no active mount on my client. So there was no activity from client side, so does it mean it will never cleanup?
> >
> No, as I wrote,
> #     If you stop doing anything and simply wait, that will be 8.24 hours
> #     (1023 * 29 seconds).
>
> I've discussed this with DanG, and that ntirpc design logic from before my time
> seems faulty.  It does cleanup more frequently on a busy system, but is very
> slow on an idle system (the most likely time to have an fd become idle).
>
> DanG suggested that a better approach would be to trigger cleanup based upon a
> high-water-mark.  The best approach would be to order the list of fds by latest
> recv, and clean out the idle ones as soon as they expire -- instead of running
> through the list of (tens of) thousands of fds.  That shouldn't be too hard,
> but we must be careful that we don't add new locking in the fast path.
>
> But this is an own-time project for me, and my primary interest is RDMA and
> performance.  So it may be some time before I've a chance to look at it.  Maybe
> over the holiday....
> _______________________________________________
> Devel mailing list -- devel@lists.nfs-ganesha.org
> To unsubscribe send an email to devel-leave@lists.nfs-ganesha.org



--

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309