[Nfs-ganesha-devel] Re: [RFC PATCH] rados_cluster: add a "design" manpage

Friday, 8 June 2018

On Fri, 2018-06-08 at 10:22 -0400, J. Bruce Fields wrote:
...
 On Wed, May 23, 2018 at 08:21:40AM -0400, Jeff Layton wrote:
 > +Lifting the Grace Period
 > +------------------------
 > +Transitioning from recovery to normal operation really consists of two
 > +different steps:
 > +
 > +1. the server decides that it no longer requires a grace period, either
 > +   due to it timing out or there not being any clients that would be
 > +   allowed to reclaim.
 > +
 > +2. the server stops enforcing the grace period and transitions to normal
 > +   operation
 > +
 > +These concepts are often conflated in a singleton servers, but in
 > +a cluster we must consider them independently.
 > +
 > +When a server is finished with its own local recovery period, it should
 > +clear its NEED flag. That server should continue enforcing the grace
 > +period however until the grace period is fully lifted.
 > +
 > +If the servers' own NEED flag is the last one set, then it can lift the
 > +grace period (by setting R=0). At that point, all servers in the cluster
 > +can end grace period enforcement, and communicate that fact to the
 > +others by clearing their ENFORCING flags.

 I think this also needs to describe the ordering of the recovery
 database switch and the epoch increment in the clustered case.

 For "surviving" servers it doesn't matter since their recovery database
 isn't changing.

 For restarting servers, there's a window between clearing NEED and
 clearing ENFORCING when their recovery database can't change.

 The epoch musn't change till everybody's created a new recovery
 database.

 It must change before anyone grants a new non-reclaim lock, because at
 that point it's no longer safe to use the older recovery databases.
 (That could result in allowing a reclaim from a client which conflicts
 with the new lock.)

 So I think servers should 1) stop allowing reclaims, 2) create the new
 recovery database, 3) atomically: clear NEED, check whether they're the
 last to clear NEED, and bump the epoch, and 4) clear ENFORCING.  ??

First, we create the new database as reclaim requests come in, so we
don't need to do that at any particular time. Basically, we allow
reclaims from the server's recovery DB for R (when R != 0), and we
create new records in the DB for C when those requests come in.

If the server already has a DB for C when it restarts, then that DB is
truncated and recreated from the reclaim requests coming in.

A surviving server will need to create a new recovery DB for the new
epoch. That's done after we start enforcing the grace period but before
marking ourselves as enforcing.

...
 Do we have a race like this, in a 2-node cluster?:

 	- server 1 clears NEED
 	- server 2 restarts, sets NEED and ENFORCING
 	- server 2 sees that 1 still has ENFORCING set, starts accepting
 	  reclaims
 	- server 1 clears ENFORCING, starts accepting non-reclaims. 
No. The transactions are done atomically in RADOS. Basically we do:

1) read operation that fetches the object version
2) modify in memory
3) write operation that asserts on the object version

If the assertion fails, then we restart the whole cycle again with a new
read op. The update of the epochs is also atomic wrt to the flag updates
because they're part of the same operation, and RADOS is atomic at the
operation level.

The first NEED flag being set and last need flag being cleared are done
atomically with changes to the epochs.

In practice, in the above example, server1 would not clear its E flag,
because when it went to do so, it would see that server 2 had set N in
the meantime, and would need to maintain keep enforcing its grace
period.

We always treat the E flag conservatively too. We only set it after
we're actually enforcing the grace period locally, and only stop
enforcing grace after we've cleared the flag (and thus ensured that no
one needs it).
-- 
Jeff Layton <jlayton(a)kernel.org&gt;

2025

2024

2023

2022

2021

2020

2019

2018

[Nfs-ganesha-devel] Re: [RFC PATCH] rados_cluster: add a "design" manpage