Looks like my previous message was not delivered due to long length,
summarising it again,
I tried with this fix, as suggested added SVC_DESTROY in UMNT call, added
some logs for debugging,
https://paste.fedoraproject.org/paste/MmpBT2vTwF8DMBvaLTu5rQ
Issue remains same.
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] mnt_Umnt :NFS3 :DEBUG :REQUEST PROCESSING:
Calling MNT_UMNT path /export*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] nfs_rpc_process_request :DISP :DEBUG :After
svc_sendreply for umount before SVC_DESTROY0x7fff38024450 1045 4*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] nfs_rpc_process_request :DISP :DEBUG :After
svc_sendreply for umount after SVC_DESTROY0x7fff38024450 1045 3*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] nfs_rpc_decode_request :DISP :DEBUG :SVC_DECODE
on 0x7fff38024450 fd 1045 (::ffff:10.5.192.59:911
<
http://10.5.192.59:911/>) xid=1123634942 returned XPRT_DESTROYED*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] free_nfs_request :DISP :DEBUG
:free_nfs_request: 0x7fff38024450 fd 1045 xp_refcnt 3 rq_refcnt 0*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before
SVC_RELEASE 0x7fff38024450 1045 2*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after
SVC_RELEASE 0x7fff38024450 1045 1*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_4] rpc :TIRPC :DEBUG :Event on already destroyed
xprt 0x7fff38024450 1045 1*
As we can see in the log highlighted we are still left with one ref even
after SVC_DESTROY in UMNT and all SVC_RELEASE.
I tried with SVC_RELEASE instead of SVC_DESTROY in same patch, it worked at
least for connection on which UMNT came, we are releasing the xprt for it.
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] mnt_Umnt :NFS3 :DEBUG :REQUEST PROCESSING:
Calling MNT_UMNT path /export*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] nfs_rpc_process_request :DISP :DEBUG :After
svc_sendreply for umount before SVC_RELEASE 0x7fffd000b600 487 4*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] nfs_rpc_process_request :DISP :DEBUG :After
svc_sendreply for umount after SVC_RELEASE 0x7fffd000b600 487 3*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] nfs_rpc_decode_request :DISP :DEBUG :SVC_DECODE
on 0x7fffd000b600 fd 487 (::ffff:10.5.192.59:774 <
http://10.5.192.59:774/>)
xid=457315812 returned XPRT_IDLE*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] free_nfs_request :DISP :DEBUG
:free_nfs_request: 0x7fffd000b600 fd 487 xp_refcnt 3 rq_refcnt 0*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before
SVC_RELEASE 0x7fffd000b600 487 2*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after
SVC_RELEASE 0x7fffd000b600 487 1*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd000b600
fd 487 recv closed (will set dead), before SVC_DESTROY refcnt 2*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd000b600
fd 487 recv closed (will set dead), after SVC_DESTROY refcnt 1*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before
SVC_RELEASE 0x7fffd000b600 487 1*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after
SVC_RELEASE 0x7fffd000b600 487 0*
As we can see in the logs refcnt is going to zero for the connection on
which UMNT came.
But there are logs for other connections also, which we are not cleaning up
as there is no UMNT on them, but we are polling on them and they are
getting closed,
I just have single client and running same script, looks like client is
opening multiple connections and closing them on UMNT.
*10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd401b2c0
fd 1045 recv closed (will set dead), before SVC_DESTROY refcnt 3*
*10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd401b2c0
fd 1045 recv closed (will set dead), after SVC_DESTROY refcnt 2*
*10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_rqst_xprt_task
before SVC_RELEASE 0x7fffd401b2c0 1045 2*
*10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_rqst_xprt_task
after SVC_RELEASE 0x7fffd401b2c0 1045 1*
So this fix is not working for connection close without UMNT, there is
still fd leak.
One suggestion, should we have svc_rqst_clean_dead just
like svc_rqst_clean_idle?
svc_rqst_clean_dead will iterate through xprts and do SVC_RELEASE for xprts
with SVC_XPRT_FLAG_DESTROYING set and have xp_refcnt == 1.
Check this patch,
https://paste.fedoraproject.org/paste/1nIK~GC5Ril98nlc-DcpgQ
Regards,
Gaurav
On Mon, Dec 10, 2018 at 4:22 PM gaurav gangalwar <gaurav.gangalwar(a)gmail.com>
wrote:
I tried with this fix, as suggested added SVC_DESTROY in UMNT call,
added
some logs for debugging,
https://paste.fedoraproject.org/paste/MmpBT2vTwF8DMBvaLTu5rQ
Issue remains same.
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] mnt_Umnt :NFS3 :DEBUG :REQUEST PROCESSING:
Calling MNT_UMNT path /export*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] nfs_rpc_process_request :DISP :DEBUG :After
svc_sendreply for umount before SVC_DESTROY 0x7fff38024450 1045 4*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] nfs_rpc_process_request :DISP :DEBUG :After
svc_sendreply for umount after SVC_DESTROY 0x7fff38024450 1045 3*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] nfs_rpc_decode_request :DISP :DEBUG :SVC_DECODE
on 0x7fff38024450 fd 1045 (::ffff:10.5.192.59:911 <
http://10.5.192.59:911>)
xid=1123634942 returned XPRT_DESTROYED*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] free_nfs_request :DISP :DEBUG :free_nfs_request:
0x7fff38024450 fd 1045 xp_refcnt 3 rq_refcnt 0*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before
SVC_RELEASE 0x7fff38024450 1045 2*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after
SVC_RELEASE 0x7fff38024450 1045 1*
*10/12/2018 03:30:57 : epoch 5c0e234e : Gaurav-centos7-1 :
ganesha.nfsd-78524[svc_4] rpc :TIRPC :DEBUG :Event on already destroyed
xprt 0x7fff38024450 1045 1*
As we can see in the log highlighted we are still left with one ref even
after SVC_DESTROY in UMNT and all SVC_RELEASE.
I tried with SVC_RELEASE instead of SVC_DESTROY in same patch, it
worked at least for connection on which UMNT came, we are releasing the
xprt for it.
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] mnt_Umnt :NFS3 :DEBUG :REQUEST PROCESSING:
Calling MNT_UMNT path /export*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] nfs_rpc_process_request :DISP :DEBUG :After
svc_sendreply for umount before SVC_RELEASE 0x7fffd000b600 487 4*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] nfs_rpc_process_request :DISP :DEBUG :After
svc_sendreply for umount after SVC_RELEASE 0x7fffd000b600 487 3*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] nfs_rpc_decode_request :DISP :DEBUG :SVC_DECODE
on 0x7fffd000b600 fd 487 (::ffff:10.5.192.59:774 <
http://10.5.192.59:774>)
xid=457315812 returned XPRT_IDLE*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] free_nfs_request :DISP :DEBUG :free_nfs_request:
0x7fffd000b600 fd 487 xp_refcnt 3 rq_refcnt 0*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before
SVC_RELEASE 0x7fffd000b600 487 2*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_2] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after
SVC_RELEASE 0x7fffd000b600 487 1*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd000b600
fd 487 recv closed (will set dead), before SVC_DESTROY refcnt 2*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd000b600
fd 487 recv closed (will set dead), after SVC_DESTROY refcnt 1*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before
SVC_RELEASE 0x7fffd000b600 487 1*
*10/12/2018 05:23:23 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_79] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after
SVC_RELEASE 0x7fffd000b600 487 0*
As we can see in the logs refcnt is going to zero for the connection on
which UMNT came.
But there are logs for other connections also, which we are not cleaning
up as there is no UMNT on them, but we are polling on them and they are
getting closed,
I just have single client and running same script, looks like client is
opening multiple connections and closing them on UMNT.
*10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd401b2c0
fd 1045 recv closed (will set dead), before SVC_DESTROY refcnt 3*
*10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_vc_wait: 0x7fffd401b2c0
fd 1045 recv closed (will set dead), after SVC_DESTROY refcnt 2*
*10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_rqst_xprt_task before
SVC_RELEASE 0x7fffd401b2c0 1045 2*
*10/12/2018 05:25:13 : epoch 5c0e3e5a : Gaurav-centos7-1 :
ganesha.nfsd-94024[svc_42] rpc :TIRPC :DEBUG :svc_rqst_xprt_task after
SVC_RELEASE 0x7fffd401b2c0 1045 1*
So this fix is not working for connection close without UMNT, there is
still fd leak.
Attached log file.
One suggestion, should we have svc_rqst_clean_dead just like
svc_rqst_clean_idle?
svc_rqst_clean_dead will iterate through xprts and do SVC_RELEASE for
xprts with SVC_XPRT_FLAG_DESTROYING set and have xp_refcnt == 1.
Regards,
Gaurav
On Thu, Dec 6, 2018 at 7:37 PM Matt Benjamin <mbenjami(a)redhat.com> wrote:
> As the function naming (which partly predates ntirpc fork) and
> comments clearly imply, the semantics of svc_clean_idle is to do
> periodic processing of idle, not closed, connections. At some point
> in the not too distant past, regular processing of closed connections
> reliably triggered the conn_set_dead condition, hence caused
> svc_destroy() to be called on the xprt handles.
>
> Matt
> On Wed, Dec 5, 2018 at 3:11 PM William Allen Simpson
> <william.allen.simpson(a)gmail.com> wrote:
> >
> > Wow, I really hate top posts. Very hard to comprehend your responses.
> >
> >
> > On 12/5/18 12:28 AM, gaurav gangalwar wrote:
> > > I waited for more that 14 hours. This is the script I ran.
> > > for i in {1..300}; do echo $i; mount -t nfs -o vers=3,tcp
> 10.15.196.92:/export /mnt1; umount /mnt1; done
> > > It went out of ads at 169th iteration.
> > > 169
> > > mount.nfs: Connection timed out
> > > umount: /mnt1: not mounted
> > >
> > I really don't understand. There are no constants in the current
> V2.8-dev code base
> > that add up to more than 14 hours, and no constant that would yield 169.
> >
> > In any case, this particular denial of service attack should be easy to
> fix by
> > adding SVC_DESTROY to the Ganesha umount code path(s). However,
> because it seems
> > rather unlikely this could happen in production, it's not currently a
> high priority.
> >
> > Do you have a production environment or customer report where this
> occurred?
> >
> > Would you like help writing the Ganesha patch?
> >
> > Are you with Nutanix? Or IBM?
> >
> >
> > > After this I was not able to mount or do any operation. There was no
> active mount on my client. So there was no activity from client side, so
> does it mean it will never cleanup?
> > >
> > No, as I wrote,
> > # If you stop doing anything and simply wait, that will be 8.24
> hours
> > # (1023 * 29 seconds).
> >
> > I've discussed this with DanG, and that ntirpc design logic from before
> my time
> > seems faulty. It does cleanup more frequently on a busy system, but is
> very
> > slow on an idle system (the most likely time to have an fd become idle).
> >
> > DanG suggested that a better approach would be to trigger cleanup based
> upon a
> > high-water-mark. The best approach would be to order the list of fds
> by latest
> > recv, and clean out the idle ones as soon as they expire -- instead of
> running
> > through the list of (tens of) thousands of fds. That shouldn't be too
> hard,
> > but we must be careful that we don't add new locking in the fast path.
> >
> > But this is an own-time project for me, and my primary interest is RDMA
> and
> > performance. So it may be some time before I've a chance to look at
> it. Maybe
> > over the holiday....
> > _______________________________________________
> > Devel mailing list -- devel(a)lists.nfs-ganesha.org
> > To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org
>
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
>
http://www.redhat.com/en/technologies/storage
>
> tel. 734-821-5101
> fax. 734-769-8938
> cel. 734-216-5309
>