Hi,
We are facing IO hang with oracle linux 7.9
On debugging further found that we are sending two zero window since
fd is not closed but we destroyed the xprt, so we will not be polling
on it and the recv queue exhausted.
We tried with a mix of centos 8 and oracle linux 7.9 clients, so on
zero window centos clients reset the connection and start a new
connection to recover, so IO hung for sometime but it recovers. But
oracle linux don't reset the connection and remain in hung state
forever.
To reproduce it easily I ran IO with this patch to simulate connection
destroy while doing IO, I just removed these lines from
svc_rqst_clean_func
- if ((acc->ts.tv_sec - REC_XPRT(xprt)->recv.ts.tv_sec) < acc->timeout)
- return (false);
-
On checking the code further we found out the issue where we could
rearm with refs taken and but there won't be any task executed from
epoll since xprt is in a destroyed state.
This is code path which could cause the issue
In svc_ioq_write, svc_rqst_evchan_write rearm with refs on EWOULDBLOCK
In svc_rqst_epoll_event, svc_xprt_lookup got xprt in a destroyed
state(it got destroyed in some other path, could be due to some error
or idle cleanup happening at same time).
So svc_rqst_xprt_task_send won't get a chance to execute and cleanup
the refs taken for responses.
I have a patch for it and it's working fine with the patch.
https://github.com/nfs-ganesha/ntirpc/pull/227
Anyone faced this issue with oracle linux client and is there any work
around we can do from client?
Regards,
Gaurav