I've been experimenting with an HA NFS configuration using pacemaker and nfs-ganesha. I've noticed that after a failover event, it takes about five minutes for clients to recover, and that seems to be independent of the settings of Lease_lifetime and Grace_period. Client recovery also doesn't seem to correspond to the ":NFS Server Now NOT IN GRACE " message in the ganesha.log. Is this normal behavior?
It's not normal.
IIRC, servers notify the clients that they are in NFS_GRACE. On top of that any NFSv4 client that attempts I/O while the servers are in grace will receive NFS4ERR_GRACE, so if a client somehow missed the initial notification they would discover it when attempting I/O.
I see you're using pacemaker with CephFS (FSAL_CEPH). It's my understanding that Ceph's HA solution for ganesha is built on top of kubernetes, not with pacemaker.
I don't have any experience with ganesha in this situation (or with k8s.) Asking Jeff Layton is probably your best option.
My pacemaker configuration looks like:
Full List of Resources:
* Resource Group: nfs:
* nfsd (systemd:nfs-ganesha): Started nfs2.storage
* nfs_vip (ocf::heartbeat:IPaddr2): Started nfs2.storage
I guess this is meant to be an active/passive setup?
And the ganesha configuration looks like:
NFS_CORE_PARAM
{
Enable_NLM = false;
Enable_RQUOTA = false;
Protocols = 4;
}
NFSv4
{
RecoveryBackend = rados_ng;
Minor_Versions = 1,2;
# From https://www.suse.com/support/kb/doc/?id=000019374
Lease_Lifetime = 10;
Grace_Period = 20;
}
MDCACHE {
# Size the dirent cache down as small as possible.
Dir_Chunk = 0;
}
EXPORT
{
Export_ID=100;
Protocols = 4;
Transports = TCP;
Path = /;
Pseudo = /data;
Access_Type = RW;
Attr_Expiration_Time = 0;
Squash = none;
FSAL {
Name = CEPH;
Filesystem = "tank";
User_Id = "nfs";
}
}
RADOS_KV
{
UserId = "nfsmeta";
pool = "cephfs.tank.meta";
namespace = "ganesha";
}