We have setup NFS Ganesha as Active/Active and have mounted the cephfs using the VIP from
the VM. On top of it we are testing IP failover/and GANESHA SWITCH-OVER.
To manage this fail over in minimum time(for I/O to resume) we have found three variable
in the ceph system as :
1.session_timeout (min allowed value is 30) - default -60
2.session_autoclose(min allowed value is 30) - default -300
mds_cap_revoke_eviction_timeout is disabled by default and work at priority if
enabled or configured as something above 0.
we have tested by setting up mds_cap_revoke_eviction_timeout as 1 and have achieved
the I/O resume duration as 7-9 seconds, but we are unsure of 1 seconds and its impact on
production. because 1 seconds or few seconds can also be in class of network latency or
maybe something other than NFS Server fails.
we need to know certain recommended values for the same variables as marked above.
Note: we are not using this any container kind of setup to bring the down NFS server
immediately, we are just switching from one node to another as we detect failure on one of
the node.
-Best Regards,