Unless nfs server sends an error, the client is supposed to retry the
request forever with a hard mount. Use a different client (centsos7.5 for
example).
Regards, Malahal.
On Fri, Sep 28, 2018 at 11:59 PM Rolf Anders <rolf.anders(a)rz.uni-augsburg.de>
wrote:
On Wed, Sep 26, 2018 at 03:27:45PM +0200, Boris Faure wrote:
> I understood what was going on.
> This is due to "timeo=2,retrans=3". In dmesg, I saw "nfs: server
127.0.0.1
> not responding, timed out" and in tcpdump I saw that some WRITE were not
> acked.
> Because of that, the kernel decided that some write failed (and indeed,
> they have???) and thus fdatasync() returned EIO.
I have a similar problem with ganesha 2.6.3, FSAL_GPFS and NFSv4.2
(haven't tried 4.0 or 4.1). When writing a large file the server logs
the following from time to time, and the data transfer stops for a few
seconds (or sometimes completely: "dd: error writing 'testfile': Remote
I/O error"):
epoch 5bad1b6e : srv : ganesha.nfsd-10257[svc_890] rpc :TIRPC :EVENT
:svc_vc_recv: 0x7f5d78225aa0 fd 40 recv errno 104 (will set dead)
However, all WRITEs are replied with NFS4_OK. The network trace shows
that the client suddenly closes the connection (frame 8443) and opens
it again 3 seconds later:
8439 1.659424 srv -> cli TCP 66 2049 -> 712 [ACK] Seq=45441 Ack=96896309
Win=12399 Len=0 TSval=1320132075 TSecr=4244429882
8440 1.664256 srv -> cli NFS 94 V4 Reply (Call In 8436)
8441 1.664289 srv -> cli NFS 94 V4 Reply (Call In 8436)
8442 1.664373 srv -> cli NFS 314 V4 Reply (Call In 8436) WRITE
8443 1.664490 cli -> srv TCP 66 712 -> 2049 [FIN, ACK] Seq=96896309
Ack=45497 Win=1444 Len=0 TSval=4244429888 TSecr=1320132076
8444 1.664549 cli -> srv TCP 60 712 -> 2049 [RST] Seq=96896309 Win=0
Len=0
8445 1.664559 srv -> cli TCP 66 2049 -> 712 [ACK] Seq=45745 Ack=96896310
Win=12399 Len=0 TSval=1320132076 TSecr=4244429888
8446 1.664676 cli -> srv TCP 60 712 -> 2049 [RST] Seq=96896310 Win=0
Len=0
8447 4.682295 cli -> srv TCP 74 [TCP Port numbers reused] 712 -> 2049
[SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=4244432906 TSecr=0
WS=128
8448 4.682348 srv -> cli TCP 74 2049 -> 712 [SYN, ACK] Seq=0 Ack=1
Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1320132831 TSecr=4244432906
WS=128
8449 4.682592 cli -> srv TCP 66 712 -> 2049 [ACK] Seq=1 Ack=1 Win=29312
Len=0 TSval=4244432906 TSecr=1320132831
In the kernel log on the client (/proc/sys/sunrpc/nfs_debug set to 0xffff)
I don't see any reason for closing the connection:
Sep 28 18:24:47.320222 kernel: NFS: write_begin(7.4/testfile(17096813),
4096@115703808)
Sep 28 18:24:47.320284 kernel: NFS: write_end(7.4/testfile(17096813),
4096@115703808)
Sep 28 18:24:47.320362 kernel: NFS: nfs_updatepage(7.4/testfile
4096@115703808)
Sep 28 18:24:47.320425 kernel: NFS: nfs_updatepage returns 0 (isize
115707904)
Sep 28 18:24:47.320501 kernel: NFS: initiated pgio call (req
0:57/17096813, 1048576 bytes @ offset 112979968)
Sep 28 18:24:47.320586 kernel: NFS: initiated pgio call (req
0:57/17096813, 1048576 bytes @ offset 114028544)
Sep 28 18:24:47.320662 kernel: NFS: initiated pgio call (req
0:57/17096813, 630784 bytes @ offset 115077120)
Sep 28 18:24:47.320737 kernel: --> nfs4_alloc_slot
used_slots=fffffffffffffff highest_used=59 max_slots=64
Sep 28 18:24:47.320813 kernel: <-- nfs4_alloc_slot
used_slots=1fffffffffffffff highest_used=60 slotid=60
Sep 28 18:24:47.320876 kernel: --> nfs4_alloc_slot
used_slots=1fffffffffffffff highest_used=60 max_slots=64
Sep 28 18:24:47.320950 kernel: <-- nfs4_alloc_slot
used_slots=3fffffffffffffff highest_used=61 slotid=61
Sep 28 18:24:47.321013 kernel: --> nfs4_alloc_slot
used_slots=3fffffffffffffff highest_used=61 max_slots=64
Sep 28 18:24:47.321076 kernel: <-- nfs4_alloc_slot
used_slots=7fffffffffffffff highest_used=62 slotid=62
Sep 28 18:24:49.103567 kernel: NFS: nfs_updatepage returns 0 (isize
115744768)
Sep 28 18:24:49.103758 kernel: NFS: write_begin(7.4/testfile(17096813),
4096@115744768)
Sep 28 18:24:49.103835 kernel: NFS: write_end(7.4/testfile(17096813),
4096@115744768)
The file was written using 'dd if=/dev/zero of=testfile
bs=64K count=32768'; smaller block sizes seem to make the
problem appear more often. The file system was mounted with
'timeo=600,retrans=2,sec=krb5i,vers=4.2'.
The client runs Ubuntu 18.04 (kernel 4.15.0-32-generic), the server Ubuntu
16.04 (4.4.0-109-generic) and GPFS 5.0.1.1.
What might make the client close the connection? Any thoughts are welcome.
Rolf
--
Rolf Anders ............................
http://www.rz.uni-augsburg.de
Universität Augsburg, Rechenzentrum ............. Tel. (0821) 598-2030
86135 Augsburg .................................. Fax. (0821) 598-2028
_______________________________________________
Devel mailing list -- devel(a)lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org