Hi all,
We are using nfs-ganesha via the IBM Spectrum Scale protocol setup, currently consisting
of 2 servers and around 50 clients.
After a couple of months of smooth run, we started to experience (already twice in three
days) a critical issue that consists in all clients
not being able to mount the filesystem.
When the issue happens this is what we see via rpcinfo on the server:
[root@server ~]# rpcinfo -s
program version(s) netid(s) service owner
100000 2,3,4 local,udp,tcp,udp6,tcp6 portmapper superuser
100024 1 tcp6,udp6,tcp,udp status 29
100003 4,3 tcp6,tcp,udp6,udp nfs superuser
100005 3,1 tcp6,tcp,udp6,udp mountd superuser
100021 4 tcp6,tcp,udp6,udp nlockmgr superuser
100011 2,1 tcp6,tcp,udp6,udp rquotad superuser
[root@host ~]# rpcinfo -T tcp localhost 100003 3
rpcinfo: RPC: Timed out
The logs are set to EVENT and when the issue starts, ganesha.log gets full of lines like
the following:
2018-10-04 16:40:45 : epoch 0002003d : server : gpfs.ganesha.nfsd-40452[State_Async]
nlm_send_async :NLM :MAJ :Cannot create NLM async tcp connection to client
::ffff:129.129.117.65
2018-10-04 16:40:45 : epoch 0002003d : server: gpfs.ganesha.nfsd-40452[State_Async]
nlm4_send_grant_msg :NLM :MAJ :GRANTED_MSG RPC call failed with return code -1. Removing
the blocking lock
The nfs-ganesha version is 2.5.3 even if that’s the ibm version so I am not sure what are
the changes.
I was wondering if someone on the mailing list had an idea of what direction to take to
investigate this further.
Thanks,
Ivano