Hello Ivano,
please open a case with IBM support.

We will need to look at more data here - a gpfs / ces / nfs snap, exact versions.

I would be curious to see your authentication config. We have seen a similar issue with a big customer - unrelated to Ganesha !

Michael

Mit freundlichen Grüßen / with best regards
Michael Diederich IBM Systems Group Spectrum Scale Software Development	Contact Information		IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294
	mail: fon: address:	michael.diederich@de.ibm.com +49-7034-274-4062 Am Weiher 24 D-65451 Kelsterbach

From: "Talamo Ivano Giuseppe (PSI)" <Ivano.Talamo@psi.ch>
To: Michael Diederich <diederich@de.ibm.com>
Cc: "support@lists.nfs-ganesha.org" <support@lists.nfs-ganesha.org>
Date: 10/17/2018 10:37 AM
Subject: Re: [NFS-Ganesha-Support] unavailability of NFSv3

Dear Michael, We hit the issue again and I was able to collect some data. First of all I observed an extreme high number of CLOSE_WAIT from ganesha to the client, with always 29 bytes in the recv-q buffer. There’s about 11k lines like the following in netstat: tcp6 29 0 server:38534 client:38610 CLOSE_WAIT The client is always the same one. And 38610 is the port on the client were NLM is running: [root@server ~]# rpcinfo -p client program vers proto port service 100000 4 tcp 111 portmapper 100000 3 tcp 111 portmapper 100000 2 tcp 111 portmapper 100000 4 udp 111 portmapper 100000 3 udp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 44714 status 100024 1 tcp 56631 status 100021 1 udp 35435 nlockmgr 100021 3 udp 35435 nlockmgr 100021 4 udp 35435 nlockmgr 100021 1 tcp 38610 nlockmgr 100021 3 tcp 38610 nlockmgr 100021 4 tcp 38610 nlockmgr The nlm seem to reply fine: [root@server ~]# rpcinfo -T tcp client 100021 3 program 100021 version 3 ready and waiting I also collected the content of /proc/<PID-of-Ganesha-process>/task/*. There’s no trace of writev and it looks fine to me, most of the tasks (274/283) are like the following: [<ffffffff810f7016>] futex_wait_queue_me+0xc6/0x130 [<ffffffff810f7cdb>] futex_wait+0x17b/0x280 [<ffffffff810f9a16>] do_futex+0x106/0x5a0 [<ffffffff810f9f30>] SyS_futex+0x80/0x180 [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff there’s a couple like: [<ffffffffc0a18cd1>] cxiWaitEventWait+0x1d1/0x2f0 [mmfslinux] [<ffffffffc0b92a10>] _ZN6ThCond12internalWaitEP16KernelSynchStatejPv+0xd0/0x260 [mmfs26] [<ffffffffc0b93d0e>] _ZN6ThCond18kInterruptibleWaitEiPKc+0x1de/0x3d0 [mmfs26] [<ffffffffc0b42e94>] _Z17gpfsGaneshaUpdateP13gpfsVfsData_tPiS1_P10cxiVattr_tP5glockS1_PjS6_S6_iS6_+0x264/0x7d0 [mmfs26] [<ffffffffc0a20ba2>] gpfs_wait_inode_update+0x1a2/0x740 [mmfslinux] [<ffffffffc0a211aa>] get_inode_update+0x6a/0x90 [mmfslinux] [<ffffffffc0a28eb2>] kxGanesha+0x5a2/0x3990 [mmfslinux] [<ffffffffc0c01f17>] _Z8ss_ioctljm+0x677/0x1c00 [mmfs26] [<ffffffffc0a097e1>] ss_fs_unlocked_ioctl+0xf1/0x530 [mmfslinux] [<ffffffff8121730d>] do_vfs_ioctl+0x33d/0x540 [<ffffffff812175b1>] SyS_ioctl+0xa1/0xc0 [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff a couple like: [<ffffffff81217ff5>] poll_schedule_timeout+0x55/0xb0 [<ffffffff8121957d>] do_sys_poll+0x4cd/0x580 [<ffffffff81219734>] SyS_poll+0x74/0x110 [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff and 5 like: [<ffffffff8124dc9e>] ep_poll+0x23e/0x360 [<ffffffff8124f12d>] SyS_epoll_wait+0xed/0x120 [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff When the issue started I also run ganesha_mgr set_log COMPONENT_ALL FULL_DEBUG for a few minutes and collected about half gigabyte of data, in case it may be used for further investigation. Thanks, Ivano On 09/10/18 15:09, "Michael Diederich" <diederich@de.ibm.com> wrote: Ivano, I am sure we are working on your ticket :-) have a look at the sum of your tcp send-q bytes (netstat output) and compare that to tcp_wmem setting (sysctl ) it is possible you may have a client that is not acking the data it requested... Michael From: "Talamo Ivano Giuseppe (PSI)" <Ivano.Talamo@psi.ch> To: "support@lists.nfs-ganesha.org" <support@lists.nfs-ganesha.org> Date: 10/08/2018 03:04 PM Subject: [NFS-Ganesha-Support] unavailability of NFSv3 ________________________________________ Hi all, We are using nfs-ganesha via the IBM Spectrum Scale protocol setup, currently consisting of 2 servers and around 50 clients. After a couple of months of smooth run, we started to experience (already twice in three days) a critical issue that consists in all clients not being able to mount the filesystem. When the issue happens this is what we see via rpcinfo on the server: [root@server ~]# rpcinfo -s program version(s) netid(s) service owner 100000 2,3,4 local,udp,tcp,udp6,tcp6 portmapper superuser 100024 1 tcp6,udp6,tcp,udp status 29 100003 4,3 tcp6,tcp,udp6,udp nfs superuser 100005 3,1 tcp6,tcp,udp6,udp mountd superuser 100021 4 tcp6,tcp,udp6,udp nlockmgr superuser 100011 2,1 tcp6,tcp,udp6,udp rquotad superuser [root@host ~]# rpcinfo -T tcp localhost 100003 3 rpcinfo: RPC: Timed out The logs are set to EVENT and when the issue starts, ganesha.log gets full of lines like the following: 2018-10-04 16:40:45 : epoch 0002003d : server : gpfs.ganesha.nfsd-40452[State_Async] nlm_send_async :NLM :MAJ :Cannot create NLM async tcp connection to client ::ffff:129.129.117.65 2018-10-04 16:40:45 : epoch 0002003d : server: gpfs.ganesha.nfsd-40452[State_Async] nlm4_send_grant_msg :NLM :MAJ :GRANTED_MSG RPC call failed with return code -1. Removing the blocking lock The nfs-ganesha version is 2.5.3 even if that’s the ibm version so I am not sure what are the changes. I was wondering if someone on the mailing list had an idea of what direction to take to investigate this further. Thanks, Ivano _______________________________________________ Support mailing list -- support@lists.nfs-ganesha.org To unsubscribe send an email to support-leave@lists.nfs-ganesha.org