[Nfs-ganesha-devel] Re: [RFC PATCH] rados_cluster: add a "design" manpage

Thursday, 14 June 2018

On Fri, 2018-06-08 at 10:22 -0400, J. Bruce Fields wrote:
...
 On Wed, May 23, 2018 at 08:21:40AM -0400, Jeff Layton wrote:
 > +Lifting the Grace Period
 > +------------------------
 > +Transitioning from recovery to normal operation really consists of two
 > +different steps:
 > +
 > +1. the server decides that it no longer requires a grace period, either
 > +   due to it timing out or there not being any clients that would be
 > +   allowed to reclaim.
 > +
 > +2. the server stops enforcing the grace period and transitions to normal
 > +   operation
 > +
 > +These concepts are often conflated in a singleton servers, but in
 > +a cluster we must consider them independently.
 > +
 > +When a server is finished with its own local recovery period, it should
 > +clear its NEED flag. That server should continue enforcing the grace
 > +period however until the grace period is fully lifted.
 > +
 > +If the servers' own NEED flag is the last one set, then it can lift the
 > +grace period (by setting R=0). At that point, all servers in the cluster
 > +can end grace period enforcement, and communicate that fact to the
 > +others by clearing their ENFORCING flags.

 I think this also needs to describe the ordering of the recovery
 database switch and the epoch increment in the clustered case.

 For "surviving" servers it doesn't matter since their recovery database
 isn't changing.

 For restarting servers, there's a window between clearing NEED and
 clearing ENFORCING when their recovery database can't change.

 The epoch musn't change till everybody's created a new recovery
 database.

 It must change before anyone grants a new non-reclaim lock, because at
 that point it's no longer safe to use the older recovery databases.
 (That could result in allowing a reclaim from a client which conflicts
 with the new lock.)

 So I think servers should 1) stop allowing reclaims, 2) create the new
 recovery database, 3) atomically: clear NEED, check whether they're the
 last to clear NEED, and bump the epoch, and 4) clear ENFORCING.  ??

 Do we have a race like this, in a 2-node cluster?:

 	- server 1 clears NEED
 	- server 2 restarts, sets NEED and ENFORCING
 	- server 2 sees that 1 still has ENFORCING set, starts accepting
 	  reclaims
 	- server 1 clears ENFORCING, starts accepting non-reclaims.

After looking over this a bit more, I think there is a potential
problem. We're currently starting a new (local) grace period (and
setting our own enforcing flag) and then generating the recovery DBs
afterward.

If we crash between those two events then another node could lift the
grace period before this node comes back up and marks its NEED flag.

The easy fix is to just create a new recovery DB for the new epoch prior
to starting the grace period locally. Lightly tested patch here:

    https://review.gerrithub.io/#/c/ffilz/nfs-ganesha/+/415232

-- 
Jeff Layton <jlayton(a)kernel.org&gt;

2025

2024

2023

2022

2021

2020

2019

2018

[Nfs-ganesha-devel] Re: [RFC PATCH] rados_cluster: add a "design" manpage