Unless nfs server sends an error, the client is supposed to retry the request forever with a hard mount. Use a different client (centsos7.5 for example).

Regards, Malahal.

On Fri, Sep 28, 2018 at 11:59 PM Rolf Anders <rolf.anders@rz.uni-augsburg.de> wrote:
On Wed, Sep 26, 2018 at 03:27:45PM +0200, Boris Faure wrote:
> I understood what was going on.
> This is due to "timeo=2,retrans=3". In dmesg, I saw "nfs: server 127.0.0.1
> not responding, timed out" and in tcpdump I saw that some WRITE were not
> acked.
> Because of that, the kernel decided that some write failed (and indeed,
> they have???) and thus fdatasync() returned EIO.

I have a similar problem with ganesha 2.6.3, FSAL_GPFS and NFSv4.2
(haven't tried 4.0 or 4.1). When writing a large file the server logs
the following from time to time, and the data transfer stops for a few
seconds (or sometimes completely: "dd: error writing 'testfile': Remote
I/O error"):

epoch 5bad1b6e : srv : ganesha.nfsd-10257[svc_890] rpc :TIRPC :EVENT :svc_vc_recv: 0x7f5d78225aa0 fd 40 recv errno 104 (will set dead)

However, all WRITEs are replied with NFS4_OK.  The network trace shows
that the client suddenly closes the connection (frame 8443) and opens
it again 3 seconds later:

8439   1.659424 srv -> cli TCP 66 2049 -> 712 [ACK] Seq=45441 Ack=96896309 Win=12399 Len=0 TSval=1320132075 TSecr=4244429882
8440   1.664256 srv -> cli NFS 94 V4 Reply (Call In 8436)
8441   1.664289 srv -> cli NFS 94 V4 Reply (Call In 8436)
8442   1.664373 srv -> cli NFS 314 V4 Reply (Call In 8436) WRITE
8443   1.664490 cli -> srv TCP 66 712 -> 2049 [FIN, ACK] Seq=96896309 Ack=45497 Win=1444 Len=0 TSval=4244429888 TSecr=1320132076
8444   1.664549 cli -> srv TCP 60 712 -> 2049 [RST] Seq=96896309 Win=0 Len=0
8445   1.664559 srv -> cli TCP 66 2049 -> 712 [ACK] Seq=45745 Ack=96896310 Win=12399 Len=0 TSval=1320132076 TSecr=4244429888
8446   1.664676 cli -> srv TCP 60 712 -> 2049 [RST] Seq=96896310 Win=0 Len=0
8447   4.682295 cli -> srv TCP 74 [TCP Port numbers reused] 712 -> 2049 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=4244432906 TSecr=0 WS=128
8448   4.682348 srv -> cli TCP 74 2049 -> 712 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1320132831 TSecr=4244432906 WS=128
8449   4.682592 cli -> srv TCP 66 712 -> 2049 [ACK] Seq=1 Ack=1 Win=29312 Len=0 TSval=4244432906 TSecr=1320132831

In the kernel log on the client (/proc/sys/sunrpc/nfs_debug set to 0xffff)
I don't see any reason for closing the connection:

Sep 28 18:24:47.320222 kernel: NFS: write_begin(7.4/testfile(17096813), 4096@115703808)
Sep 28 18:24:47.320284 kernel: NFS: write_end(7.4/testfile(17096813), 4096@115703808)
Sep 28 18:24:47.320362 kernel: NFS:       nfs_updatepage(7.4/testfile 4096@115703808)
Sep 28 18:24:47.320425 kernel: NFS:       nfs_updatepage returns 0 (isize 115707904)
Sep 28 18:24:47.320501 kernel: NFS: initiated pgio call (req 0:57/17096813, 1048576 bytes @ offset 112979968)
Sep 28 18:24:47.320586 kernel: NFS: initiated pgio call (req 0:57/17096813, 1048576 bytes @ offset 114028544)
Sep 28 18:24:47.320662 kernel: NFS: initiated pgio call (req 0:57/17096813, 630784 bytes @ offset 115077120)
Sep 28 18:24:47.320737 kernel: --> nfs4_alloc_slot used_slots=fffffffffffffff highest_used=59 max_slots=64
Sep 28 18:24:47.320813 kernel: <-- nfs4_alloc_slot used_slots=1fffffffffffffff highest_used=60 slotid=60
Sep 28 18:24:47.320876 kernel: --> nfs4_alloc_slot used_slots=1fffffffffffffff highest_used=60 max_slots=64
Sep 28 18:24:47.320950 kernel: <-- nfs4_alloc_slot used_slots=3fffffffffffffff highest_used=61 slotid=61
Sep 28 18:24:47.321013 kernel: --> nfs4_alloc_slot used_slots=3fffffffffffffff highest_used=61 max_slots=64
Sep 28 18:24:47.321076 kernel: <-- nfs4_alloc_slot used_slots=7fffffffffffffff highest_used=62 slotid=62
Sep 28 18:24:49.103567 kernel: NFS:       nfs_updatepage returns 0 (isize 115744768)
Sep 28 18:24:49.103758 kernel: NFS: write_begin(7.4/testfile(17096813), 4096@115744768)
Sep 28 18:24:49.103835 kernel: NFS: write_end(7.4/testfile(17096813), 4096@115744768)

The file was written using 'dd if=/dev/zero of=testfile
bs=64K count=32768'; smaller block sizes seem to make the
problem appear more often. The file system was mounted with
'timeo=600,retrans=2,sec=krb5i,vers=4.2'.

The client runs Ubuntu 18.04 (kernel 4.15.0-32-generic), the server Ubuntu
16.04 (4.4.0-109-generic) and GPFS 5.0.1.1.

What might make the client close the connection? Any thoughts are welcome.

Rolf

--
Rolf Anders ............................ http://www.rz.uni-augsburg.de
Universität Augsburg, Rechenzentrum ............. Tel. (0821) 598-2030
86135 Augsburg .................................. Fax. (0821) 598-2028
_______________________________________________
Devel mailing list -- devel@lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave@lists.nfs-ganesha.org