I changed the NFSv4 recovery backend from rados_ng to rados_cluster, as all five
nfs-ganesha processes will be up all the time.
And I found my mistake: I thought the rados_kv would also use the default namespace
"ganesha-namespace" but instead uses NULL. So I was maintaining and dumping the
grace DB at an incorrect location.
I played around with the nodeid parameter for some time as I expected my configuration
error to be there. It would be extremly helpfull to have rados_ng and rados_cluster to say
something in the startup log like
nfs4_recovery_init :CLIENT ID :EVENT :rados_cluster init using rados://nfs-ganesha/grace
with nodeid <nodeid>
That would have been very helpful and time saving!
So NFSv4 recovery works fine over the 5 nodes.
NFSv3 "recovery" also works - EXCEPT (!!!):
I am mounting a cephfs .snap directory on NFS-root booted systems like this:
10.20.56.2:/vol/diskless/.snap/00225/debian10-amd64-srv on /run/initramfs/rofs type nfs
(ro,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=600,acregmax=600,acdirmin=600,acdirmax=600,hard,nocto,nolock,noacl,proto=tcp,port=2049,timeo=100,retrans=360,sec=sys,local_lock=all,addr=10.20.56.2)
As soon as I failover with the 10.20.56.2 IP to another nfs-ganesha I get the stale file
handle.
If I mount the RW image directly like this
10.20.56.2:/vol/diskless/debian10-amd64-srv on /run/initramfs/rofs type nfs
(ro,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=600,acregmax=600,acdirmin=600,acdirmax=600,hard,nocto,nolock,noacl,proto=tcp,port=2049,timeo=100,retrans=360,sec=sys,local_lock=all,addr=10.20.56.2)
the IPv4 takeover just works.
Is it possible to mount a .snap Ceph snapshot directory and survice a NFSv3 IP failover?
Thanks
Rainer