Hello Ivano,
please open a case with IBM support.

We will need to look at more data here - a gpfs / ces / nfs snap, exact versions.

I would be curious to see your authentication config. We have seen a similar issue with a big customer  - unrelated to Ganesha !

Michael
Mit freundlichen Grüßen / with best regards
Michael Diederich
IBM Systems Group
Spectrum Scale

Software Development
Contact InformationIBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294
mail:
fon:
address:
michael.diederich@de.ibm.com
+49-7034-274-4062
Am Weiher 24
D-65451 Kelsterbach





From:        "Talamo Ivano Giuseppe (PSI)" <Ivano.Talamo@psi.ch>
To:        Michael Diederich <diederich@de.ibm.com>
Cc:        "support@lists.nfs-ganesha.org" <support@lists.nfs-ganesha.org>
Date:        10/17/2018 10:37 AM
Subject:        Re: [NFS-Ganesha-Support] unavailability of NFSv3




Dear Michael,

We hit the issue again and I was able to collect some data.

First of all I observed an extreme high number of CLOSE_WAIT from ganesha to the client, with always 29 bytes in the recv-q buffer.
There’s about 11k lines like the following in netstat:

tcp6      29      0 server:38534 client:38610 CLOSE_WAIT

The client is always the same one.
And 38610 is the port on the client were NLM is running:

[root@server ~]# rpcinfo -p client
  program vers proto   port  service
   100000    4   tcp    111  portmapper
   100000    3   tcp    111  portmapper
   100000    2   tcp    111  portmapper
   100000    4   udp    111  portmapper
   100000    3   udp    111  portmapper
   100000    2   udp    111  portmapper
   100024    1   udp  44714  status
   100024    1   tcp  56631  status
   100021    1   udp  35435  nlockmgr
   100021    3   udp  35435  nlockmgr
   100021    4   udp  35435  nlockmgr
   100021    1   tcp  38610  nlockmgr
   100021    3   tcp  38610  nlockmgr
   100021    4   tcp  38610  nlockmgr


The nlm seem to reply fine:

[root@server ~]# rpcinfo -T tcp client 100021 3
program 100021 version 3 ready and waiting


I also collected the content of /proc/<PID-of-Ganesha-process>/task/*.
There’s no trace of writev and it looks fine to me, most of the tasks (274/283) are like the following:

[<ffffffff810f7016>] futex_wait_queue_me+0xc6/0x130
[<ffffffff810f7cdb>] futex_wait+0x17b/0x280
[<ffffffff810f9a16>] do_futex+0x106/0x5a0
[<ffffffff810f9f30>] SyS_futex+0x80/0x180
[<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

there’s a couple like:
[<ffffffffc0a18cd1>] cxiWaitEventWait+0x1d1/0x2f0 [mmfslinux]
[<ffffffffc0b92a10>] _ZN6ThCond12internalWaitEP16KernelSynchStatejPv+0xd0/0x260 [mmfs26]
[<ffffffffc0b93d0e>] _ZN6ThCond18kInterruptibleWaitEiPKc+0x1de/0x3d0 [mmfs26]
[<ffffffffc0b42e94>] _Z17gpfsGaneshaUpdateP13gpfsVfsData_tPiS1_P10cxiVattr_tP5glockS1_PjS6_S6_iS6_+0x264/0x7d0 [mmfs26]
[<ffffffffc0a20ba2>] gpfs_wait_inode_update+0x1a2/0x740 [mmfslinux]
[<ffffffffc0a211aa>] get_inode_update+0x6a/0x90 [mmfslinux]
[<ffffffffc0a28eb2>] kxGanesha+0x5a2/0x3990 [mmfslinux]
[<ffffffffc0c01f17>] _Z8ss_ioctljm+0x677/0x1c00 [mmfs26]
[<ffffffffc0a097e1>] ss_fs_unlocked_ioctl+0xf1/0x530 [mmfslinux]
[<ffffffff8121730d>] do_vfs_ioctl+0x33d/0x540
[<ffffffff812175b1>] SyS_ioctl+0xa1/0xc0
[<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

a couple like:
[<ffffffff81217ff5>] poll_schedule_timeout+0x55/0xb0
[<ffffffff8121957d>] do_sys_poll+0x4cd/0x580
[<ffffffff81219734>] SyS_poll+0x74/0x110
[<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

and 5 like:
[<ffffffff8124dc9e>] ep_poll+0x23e/0x360
[<ffffffff8124f12d>] SyS_epoll_wait+0xed/0x120
[<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff


When the issue started I also run ganesha_mgr set_log COMPONENT_ALL FULL_DEBUG for a few minutes
and collected about half gigabyte of data, in case it may be used for further investigation.

Thanks,
Ivano

On 09/10/18 15:09, "Michael Diederich" <diederich@de.ibm.com> wrote:

   Ivano,
   I am sure we are working on your ticket :-)
   
   have a look at the sum of your tcp send-q bytes (netstat output) and compare that to tcp_wmem setting (sysctl )
   
   
   it is possible you may have a client that is not acking the data it requested...
   
   Michael
   
   
   
   
   From:        "Talamo Ivano Giuseppe (PSI)" <Ivano.Talamo@psi.ch>
   To:        "support@lists.nfs-ganesha.org" <support@lists.nfs-ganesha.org>
   Date:        10/08/2018 03:04 PM
   Subject:        [NFS-Ganesha-Support] unavailability of NFSv3
   ________________________________________
   
   
   
   Hi all,
   
   We are using nfs-ganesha via the IBM Spectrum Scale protocol setup, currently consisting of 2 servers and around 50 clients.
   After a couple of months of smooth run, we started to experience (already twice in three days) a critical issue that consists in all clients
   not being able to mount the filesystem.
   
   When the issue happens this is what we see via rpcinfo on the server:
   
   [root@server ~]# rpcinfo -s
     program version(s) netid(s)                         service     owner
      100000  2,3,4     local,udp,tcp,udp6,tcp6          portmapper  superuser
      100024  1         tcp6,udp6,tcp,udp                status      29
      100003  4,3       tcp6,tcp,udp6,udp                nfs         superuser
      100005  3,1       tcp6,tcp,udp6,udp                mountd      superuser
      100021  4         tcp6,tcp,udp6,udp                nlockmgr    superuser
      100011  2,1       tcp6,tcp,udp6,udp                rquotad     superuser
   [root@host ~]# rpcinfo -T tcp localhost 100003 3
   rpcinfo: RPC: Timed out
   
   
   The logs are set to EVENT and when the issue starts, ganesha.log gets full of lines like the following:
   
   2018-10-04 16:40:45 : epoch 0002003d : server : gpfs.ganesha.nfsd-40452[State_Async] nlm_send_async :NLM :MAJ :Cannot create NLM async tcp connection to client ::ffff:129.129.117.65
   2018-10-04 16:40:45 : epoch 0002003d : server: gpfs.ganesha.nfsd-40452[State_Async] nlm4_send_grant_msg :NLM :MAJ :GRANTED_MSG RPC call failed with return code -1. Removing the blocking lock
   
   The nfs-ganesha version is 2.5.3 even if that’s the ibm version so I am not sure what are the changes.
   
   I was wondering if someone on the mailing list had an idea of what direction to take to investigate this further.
   
   Thanks,
   Ivano
   
   
   _______________________________________________
   Support mailing list -- support@lists.nfs-ganesha.org
   To unsubscribe send an email to support-leave@lists.nfs-ganesha.org