On Mon, 2021-07-19 at 08:58 -0400, Kaleb Keithley wrote:
On Fri, Jul 16, 2021 at 9:25 PM <lars(a)redhat.com> wrote:
> I've been experimenting with an HA NFS configuration using pacemaker and
> nfs-ganesha. I've noticed that after a failover event, it takes about five
> minutes for clients to recover, and that seems to be independent of the
> settings of Lease_lifetime and Grace_period. Client recovery also doesn't
> seem to correspond to the ":NFS Server Now NOT IN GRACE " message in the
> ganesha.log. Is this normal behavior?
>
It's not normal.
IIRC, servers notify the clients that they are in NFS_GRACE. On top of that
any NFSv4 client that attempts I/O while the servers are in grace will
receive NFS4ERR_GRACE, so if a client somehow missed the initial
notification they would discover it when attempting I/O.
I see you're using pacemaker with CephFS (FSAL_CEPH). It's my
understanding that Ceph's HA solution for ganesha is built on top of
kubernetes, not with pacemaker.
I don't have any experience with ganesha in this situation (or with k8s.)
Asking Jeff Layton is probably your best option.
It was developed using k8s, but in principle you should be able to use
the rados_cluster recovery backend with pacemaker too.
What version of ganesha and ceph cluster are you using? What OS is all
of this running on?
> My pacemaker configuration looks like:
>
> Full List of Resources:
> * Resource Group: nfs:
> * nfsd (systemd:nfs-ganesha): Started nfs2.storage
> * nfs_vip (ocf::heartbeat:IPaddr2): Started nfs2.storage
I guess this is meant to be an active/passive setup?
> And the ganesha configuration looks like:
>
> NFS_CORE_PARAM
> {
> Enable_NLM = false;
> Enable_RQUOTA = false;
> Protocols = 4;
> }
>
> NFSv4
> {
> RecoveryBackend = rados_ng;
> Minor_Versions = 1,2;
>
> # From
https://www.suse.com/support/kb/doc/?id=000019374
> Lease_Lifetime = 10;
???
> Grace_Period = 20;
???
This looks like a terrible idea. That gives you a lot less breathing
room when things truly do go wrong. This is almost certainly just
papering over a real problem.
> }
>
> MDCACHE {
> # Size the dirent cache down as small as possible.
> Dir_Chunk = 0;
> }
>
> EXPORT
> {
> Export_ID=100;
> Protocols = 4;
> Transports = TCP;
> Path = /;
> Pseudo = /data;
> Access_Type = RW;
> Attr_Expiration_Time = 0;
> Squash = none;
>
> FSAL {
> Name = CEPH;
> Filesystem = "tank";
> User_Id = "nfs";
> }
> }
>
> RADOS_KV
> {
> UserId = "nfsmeta";
> pool = "cephfs.tank.meta";
> namespace = "ganesha";
> }
>
>
A 5 min timeout suggests that it's the Ceph MDS that's timing out state
from the old client, and that sounds like something is not working
right.
Looking further...I think there may be a problem with FSAL_CEPH, the
rados_ng recovery backend and active/passive deployments:
When ganesha starts up, FSAL_CEPH tells the Ceph MDS to drop any state
previously held by its "nodeid". This lets the ceph client (ganesha) get
on with the business of reclaim as soon as it starts up. The MDS will
give up on waiting for this eventually, but it takes...5 mins.
The nodeid defaults to the system hostname (a'la gethostname()). In most
pacemaker-style clusters, the hostnames of the different physical nodes
are different, so that request to drop old state is probably not working
right here.
With the rados_cluster backend, you can also configure a nodeid to use.
If you set up a trivial rados_cluster with one (active/passive) host, I
expect it'll work more like you want. So, I'd set this in the NFSv4
block:
RecoveryBackend = rados_cluster;
...and also configure both servers in your active/passive cluster with
this in the RADOS_KV block:
nodeid = some_nodename;
You'll also need to "ganesha-rados-grace add" the nodeid that you put in
that block.
Longer term, maybe we should allow rados_ng and rados_kv to set a static
nodeid too? That would also probably fix this for people who just want a
simple active/passive cluster. Basically we'd just need to add a
get_nodeid operation for them.
--
Jeff Layton <jlayton(a)redhat.com>