HA failover in less than five minutes?

Possible to run NFSv4 server as...

IRC channel moving to...

lars＠redhat.com

Friday, 16 July 2021 Fri, 16 Jul '21

8:25 p.m.

I've been experimenting with an HA NFS configuration using pacemaker and nfs-ganesha. I've noticed that after a failover event, it takes about five minutes for clients to recover, and that seems to be independent of the settings of Lease_lifetime and Grace_period. Client recovery also doesn't seem to correspond to the ":NFS Server Now NOT IN GRACE " message in the ganesha.log. Is this normal behavior? My pacemaker configuration looks like: Full List of Resources: * Resource Group: nfs: * nfsd (systemd:nfs-ganesha): Started nfs2.storage * nfs_vip (ocf::heartbeat:IPaddr2): Started nfs2.storage And the ganesha configuration looks like: NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; } NFSv4 { RecoveryBackend = rados_ng; Minor_Versions = 1,2; # From https://www.suse.com/support/kb/doc/?id=000019374 Lease_Lifetime = 10; Grace_Period = 20; } MDCACHE { # Size the dirent cache down as small as possible. Dir_Chunk = 0; } EXPORT { Export_ID=100; Protocols = 4; Transports = TCP; Path = /; Pseudo = /data; Access_Type = RW; Attr_Expiration_Time = 0; Squash = none; FSAL { Name = CEPH; Filesystem = "tank"; User_Id = "nfs"; } } RADOS_KV { UserId = "nfsmeta"; pool = "cephfs.tank.meta"; namespace = "ganesha"; }

Show replies by date

Kaleb Keithley

Monday, 19 July Mon, 19 Jul

7:58 a.m.

On Fri, Jul 16, 2021 at 9:25 PM <lars(a)redhat.com> wrote:

...

It's not normal. IIRC, servers notify the clients that they are in NFS_GRACE. On top of that any NFSv4 client that attempts I/O while the servers are in grace will receive NFS4ERR_GRACE, so if a client somehow missed the initial notification they would discover it when attempting I/O. I see you're using pacemaker with CephFS (FSAL_CEPH). It's my understanding that Ceph's HA solution for ganesha is built on top of kubernetes, not with pacemaker. I don't have any experience with ganesha in this situation (or with k8s.) Asking Jeff Layton is probably your best option.

...

My pacemaker configuration looks like: Full List of Resources: * Resource Group: nfs: * nfsd (systemd:nfs-ganesha): Started nfs2.storage * nfs_vip (ocf::heartbeat:IPaddr2): Started nfs2.storage

I guess this is meant to be an active/passive setup?

...

And the ganesha configuration looks like: NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; } NFSv4 { RecoveryBackend = rados_ng; Minor_Versions = 1,2; # From https://www.suse.com/support/kb/doc/?id=000019374 Lease_Lifetime = 10; Grace_Period = 20; } MDCACHE { # Size the dirent cache down as small as possible. Dir_Chunk = 0; } EXPORT { Export_ID=100; Protocols = 4; Transports = TCP; Path = /; Pseudo = /data; Access_Type = RW; Attr_Expiration_Time = 0; Squash = none; FSAL { Name = CEPH; Filesystem = "tank"; User_Id = "nfs"; } } RADOS_KV { UserId = "nfsmeta"; pool = "cephfs.tank.meta"; namespace = "ganesha"; }

Jeff Layton

8:37 a.m.

On Mon, 2021-07-19 at 08:58 -0400, Kaleb Keithley wrote:

...

On Fri, Jul 16, 2021 at 9:25 PM <lars(a)redhat.com> wrote: > I've been experimenting with an HA NFS configuration using pacemaker and > nfs-ganesha. I've noticed that after a failover event, it takes about five > minutes for clients to recover, and that seems to be independent of the > settings of Lease_lifetime and Grace_period. Client recovery also doesn't > seem to correspond to the ":NFS Server Now NOT IN GRACE " message in the > ganesha.log. Is this normal behavior? > It's not normal. IIRC, servers notify the clients that they are in NFS_GRACE. On top of that any NFSv4 client that attempts I/O while the servers are in grace will receive NFS4ERR_GRACE, so if a client somehow missed the initial notification they would discover it when attempting I/O. I see you're using pacemaker with CephFS (FSAL_CEPH). It's my understanding that Ceph's HA solution for ganesha is built on top of kubernetes, not with pacemaker. I don't have any experience with ganesha in this situation (or with k8s.) Asking Jeff Layton is probably your best option.

It was developed using k8s, but in principle you should be able to use the rados_cluster recovery backend with pacemaker too. What version of ganesha and ceph cluster are you using? What OS is all of this running on?

...

> My pacemaker configuration looks like: > > Full List of Resources: > * Resource Group: nfs: > * nfsd (systemd:nfs-ganesha): Started nfs2.storage > * nfs_vip (ocf::heartbeat:IPaddr2): Started nfs2.storage I guess this is meant to be an active/passive setup? > And the ganesha configuration looks like: > > NFS_CORE_PARAM > { > Enable_NLM = false; > Enable_RQUOTA = false; > Protocols = 4; > } > > NFSv4 > { > RecoveryBackend = rados_ng; > Minor_Versions = 1,2; > > # From https://www.suse.com/support/kb/doc/?id=000019374 > Lease_Lifetime = 10;

???

...

> Grace_Period = 20;

??? This looks like a terrible idea. That gives you a lot less breathing room when things truly do go wrong. This is almost certainly just papering over a real problem.

...

> } > > MDCACHE { > # Size the dirent cache down as small as possible. > Dir_Chunk = 0; > } > > EXPORT > { > Export_ID=100; > Protocols = 4; > Transports = TCP; > Path = /; > Pseudo = /data; > Access_Type = RW; > Attr_Expiration_Time = 0; > Squash = none; > > FSAL { > Name = CEPH; > Filesystem = "tank"; > User_Id = "nfs"; > } > } > > RADOS_KV > { > UserId = "nfsmeta"; > pool = "cephfs.tank.meta"; > namespace = "ganesha"; > } > >

A 5 min timeout suggests that it's the Ceph MDS that's timing out state from the old client, and that sounds like something is not working right. Looking further...I think there may be a problem with FSAL_CEPH, the rados_ng recovery backend and active/passive deployments: When ganesha starts up, FSAL_CEPH tells the Ceph MDS to drop any state previously held by its "nodeid". This lets the ceph client (ganesha) get on with the business of reclaim as soon as it starts up. The MDS will give up on waiting for this eventually, but it takes...5 mins. The nodeid defaults to the system hostname (a'la gethostname()). In most pacemaker-style clusters, the hostnames of the different physical nodes are different, so that request to drop old state is probably not working right here. With the rados_cluster backend, you can also configure a nodeid to use. If you set up a trivial rados_cluster with one (active/passive) host, I expect it'll work more like you want. So, I'd set this in the NFSv4 block: RecoveryBackend = rados_cluster; ...and also configure both servers in your active/passive cluster with this in the RADOS_KV block: nodeid = some_nodename; You'll also need to "ganesha-rados-grace add" the nodeid that you put in that block. Longer term, maybe we should allow rados_ng and rados_kv to set a static nodeid too? That would also probably fix this for people who just want a simple active/passive cluster. Basically we'd just need to add a get_nodeid operation for them. -- Jeff Layton <jlayton(a)redhat.com>

Lars Kellogg-Stedman

9:13 a.m.

On Mon, Jul 19, 2021 at 09:37:43AM -0400, Jeff Layton wrote:

...

What version of ganesha and ceph cluster are you using? What OS is all of this running on?

The NFS server is Fedora 34, so that's nfs-ganesha-3.5-7.fc34.x86_64. The Ceph "cluster" (aka single node test environment) was deployed using cephadm from F24 and appears to be 16.2.5.

...

This looks like a terrible idea...

And more than that, as I mentioned, it didn't have any impact on the failover time, so there are two reasons to remove it :).

...

With the rados_cluster backend, you can also configure a nodeid to use. If you set up a trivial rados_cluster with one (active/passive) host, I expect it'll work more like you want.

I'll give that a shot and see what happens. -- Lars Kellogg-Stedman <lars(a)redhat.com> | larsks @ {irc,twitter,github} http://blog.oddbit.com/ | N1LKS

Lars Kellogg-Stedman

9:37 a.m.

On Mon, Jul 19, 2021 at 09:37:43AM -0400, Jeff Layton wrote:

...

With the rados_cluster backend, you can also configure a nodeid to use. If you set up a trivial rados_cluster with one (active/passive) host, I expect it'll work more like you want. So, I'd set this in the NFSv4 block:

Using `rados_cluster` and sharing a `nodeid`... NFSv4 { RecoveryBackend = rados_cluster; Minor_Versions = 1,2; Grace_Period = 60; } [...] RADOS_KV { UserId = "nfsmeta"; pool = "cephfs.tank.meta"; namespace = "ganesha"; nodeid = "nfs.storage"; } ...seems to have resolved the problem; failover times now seem to track the `Grace_Period` setting.

...

Longer term, maybe we should allow rados_ng and rados_kv to set a static nodeid too? That would also probably fix this for people who just want a simple active/passive cluster. Basically we'd just need to add a get_nodeid operation for them.

Was I misusing `rados_ng`? All the docs I found said "use rados_ng for active/passive configurations". -- Lars Kellogg-Stedman <lars(a)redhat.com> | larsks @ {irc,twitter,github} http://blog.oddbit.com/ | N1LKS

Jeff Layton

Tuesday, 20 July Tue, 20 Jul

8:55 a.m.

On Mon, 2021-07-19 at 10:37 -0400, Lars Kellogg-Stedman wrote:

...

On Mon, Jul 19, 2021 at 09:37:43AM -0400, Jeff Layton wrote: > With the rados_cluster backend, you can also configure a nodeid to use. > If you set up a trivial rados_cluster with one (active/passive) host, I > expect it'll work more like you want. So, I'd set this in the NFSv4 > block: Using `rados_cluster` and sharing a `nodeid`... NFSv4 { RecoveryBackend = rados_cluster; Minor_Versions = 1,2; Grace_Period = 60; } [...] RADOS_KV { UserId = "nfsmeta"; pool = "cephfs.tank.meta"; namespace = "ganesha"; nodeid = "nfs.storage"; } ...seems to have resolved the problem; failover times now seem to track the `Grace_Period` setting.

Excellent! Thanks for confirming it.

...

> Longer term, maybe we should allow rados_ng and rados_kv to set a static > nodeid too? That would also probably fix this for people who just want a > simple active/passive cluster. Basically we'd just need to add a > get_nodeid operation for them. Was I misusing `rados_ng`? All the docs I found said "use rados_ng for active/passive configurations".

Nope, that was an oversight on my part in implementing the RADOS_KV "nodeid" setting. At the time I added it, I was focused on rados_cluster and just didn't consider that rados_ng in an active/passive cluster would also need a similar setting. I think what we need to do is make rados_ng (and maybe rados_kv) also respect the nodeid setting. The bottom line is that when you take over serving an nfs over cephfs "service" from another entity, you need to use the same nodeid that it used before so that the MDS releases things properly. I'll see about rolling some patches and updating the docs. Thanks! -- Jeff Layton <jlayton(a)redhat.com>

Jeff Layton

10:02 a.m.

On Tue, 2021-07-20 at 09:55 -0400, Jeff Layton wrote:

...

On Mon, 2021-07-19 at 10:37 -0400, Lars Kellogg-Stedman wrote: > On Mon, Jul 19, 2021 at 09:37:43AM -0400, Jeff Layton wrote: > > With the rados_cluster backend, you can also configure a nodeid to use. > > If you set up a trivial rados_cluster with one (active/passive) host, I > > expect it'll work more like you want. So, I'd set this in the NFSv4 > > block: > > Using `rados_cluster` and sharing a `nodeid`... > > NFSv4 > { > RecoveryBackend = rados_cluster; > Minor_Versions = 1,2; > Grace_Period = 60; > } > > [...] > > RADOS_KV > { > UserId = "nfsmeta"; > pool = "cephfs.tank.meta"; > namespace = "ganesha"; > nodeid = "nfs.storage"; > } > > ...seems to have resolved the problem; failover times now seem to > track the `Grace_Period` setting. > Excellent! Thanks for confirming it. > > Longer term, maybe we should allow rados_ng and rados_kv to set a static > > nodeid too? That would also probably fix this for people who just want a > > simple active/passive cluster. Basically we'd just need to add a > > get_nodeid operation for them. > > Was I misusing `rados_ng`? All the docs I found said "use rados_ng for > active/passive configurations". > Nope, that was an oversight on my part in implementing the RADOS_KV "nodeid" setting. At the time I added it, I was focused on rados_cluster and just didn't consider that rados_ng in an active/passive cluster would also need a similar setting. I think what we need to do is make rados_ng (and maybe rados_kv) also respect the nodeid setting. The bottom line is that when you take over serving an nfs over cephfs "service" from another entity, you need to use the same nodeid that it used before so that the MDS releases things properly. I'll see about rolling some patches and updating the docs. Thanks!

It was pretty simple to do, though I haven't done any real testing with it so far. If you feel inclined to try it out, here's the patch: https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/521021 With that you should be able to use rados_ng, as long as you ensure that you set the same nodeid on both hosts. That said, there is nothing wrong with using rados_cluster in this configuration either, and that should allow you to eventually scale-out to multiple active nodes seamlessly if you choose. I sort of wonder if we ought to phase-out rados_ng altogether and just clearly document how to use rados_cluster in an active/passive configuration. -- Jeff Layton <jlayton(a)redhat.com>

1598

days inactive

1601

days old

support@lists.nfs-ganesha.org

Manage subscription

6 comments

4 participants

tags (0)

participants (4)

Jeff Layton
Kaleb Keithley
Lars Kellogg-Stedman
lars＠redhat.com

2025

2024

2023

2022

2021

2020

2019

2018

HA failover in less than five minutes?