Thanks a lot for the pointer. Yes, this makes complete sense.

In our system we do failover an active NFS server by sending RST to the clients connected to it and in the tests I mentioned, we have had a failover before we started seeing this issue.

So this issue is fixed in RHEL7.7 and higher if my understanding of the kernel versions are right?

From: QR <zhbingyin@sina.com>
Reply to: "zhbingyin@sina.com" <zhbingyin@sina.com>
Date: Thursday, 23 April 2020 at 3:20 AM
To: Deepthi Shivaramu <des@vmware.com>, ganesha-devel <devel@lists.nfs-ganesha.org>
Subject: 回复：[NFS-Ganesha-Devel] NFSv3 mounts hang from a client forever and new mounts to the same share hangs

refer to https://lore.kernel.org/linux-nfs/20181212135157.4489-1-dwysocha@redhat.com/T/

--------------------------------

----- 原始邮件 -----
发件人：des@vmware.com
收件人：devel@lists.nfs-ganesha.org
主题：[NFS-Ganesha-Devel] NFSv3 mounts hang from a client forever and new mounts to the same share hangs
日期：2020年04月22日 14点36分

We are using NFS Ganesha and mounting NFSv3 with auth_sys on RHEL7.6 linux clients.
I am seeing this weird issue that after running some system tests some linux clients enter into a state where the existing mount point for one share(Lets says testShare1) becomes inaccessible and trying to mount again the same share hangs forever. Client does not get out of this situation at all. Strange thing is it is able to mount other shares successfully. To add it to it the testShare1 is accessible fine from other clients too.
The packet captures on client show no packets on the wire and the ganesha logs dont contain any hint too.
This looks like client issue and we have RHEL7.6 linux clients in this setup. In the client's /var/log/messages, I see this error continuously:
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: xs_tcp_send_request(524460) = -32
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 marshaling UNIX cred ffff889b0829c900
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 call_status (status -32)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 using AUTH_UNIX cred ffff889b0829c900 to wrap rpc data
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 xprt_transmit(524460)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: xs_tcp_send_request(524460) = -32
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 call_bind (status 0)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 call_status (status -32)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 call_connect xprt ffff889b3c5a2800 is connected
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 call_transmit (status 0)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 xprt_prepare_transmit
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 rpc_xdr_encode (status 0)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 marshaling UNIX cred ffff889b0829c900
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 using AUTH_UNIX cred ffff889b0829c900 to wrap rpc data
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 call_bind (status 0)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 call_connect xprt ffff889b3c5a6000 is connected
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 call_transmit (status 0)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 xprt_prepare_transmit
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 rpc_xdr_encode (status 0)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 marshaling UNIX cred ffff889b0829c900
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 xprt_transmit(524460)
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: xs_tcp_send_request(524460) = -32
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 56546 using AUTH_UNIX cred ffff889b0829c900 to wrap rpc data
Apr 19 03:47:00 w1h34v25-c0006 kernel: RPC: 57696 call_status (status -32)
#define EPIPE 32 /* Broken pipe */
xs_tcp_send_request - write an RPC request to a TCP socket
https://github.com/torvalds/linux/blob/master/net/sunrpc/xprtsock.c#L1027
Linux source code pointed above shows client is not able to send the RPC request on the socket. socket send is failing with EPIPE error.
I believe the NFS packets are failing to be sent out from this RPC transport, hence all access for a particular share gets associated with same transport and all of them keep failing with EPIPE error.
I see this thread in https://bugzilla.redhat.com/show_bug.cgi?id=692315#c15 where Jeff Layton was discussing same issue but this seems to be fixed in RHEL6.2 itself.
Jeff, can you please help to understand if the fix for above bug in RHEL7.6 and if so why do we see this issue still?
_______________________________________________
Devel mailing list -- devel@lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave@lists.nfs-ganesha.org