I can't realistically review most of this code, so I went looking for
some documentation and found this. Maybe it's not the best starting
point. Forgive me if I seem dense, I'd just really like to see
everything spelled out very precisely, and neither this nor your
original presentation quite does that for me yet:
On Thu, May 03, 2018 at 02:58:00PM -0400, Jeff Layton wrote:
+ * The rados_grace database is a rados object with a well-known name
that
+ * with which all cluster nodes can interact to coordinate grace-period
+ * enforcement.
+ *
+ * It consists of two parts:
+ *
+ * 1) 2 uint64_t epoch values (stored LE) that indicate the serial number of
+ * the current grace period (C) and the serial number of the grace period that
Delete "that".
+ * from which recovery is currently allowed (R). These are stored as
object
+ * data.
+ *
+ * 2) An omap containing a key value pair for each cluster node. The key is
+ * the hostname of the node running ganesha, and the value is a byte with a
+ * set of flags.
+ *
+ * Consider a single server epoch (E) of an individual NFS server to be the
+ * period between reboots. That consists of an initial grace period and
+ * a regular operation period. An epoch value of 0 is never valid.
Does "epoch value" mean the same thing as "serial number" above? I
assume it's something that uniquely identifies an "epoch".
Also you've defined an "epoch" for a single server, it needs definition
for a cluster too, right?
+ *
+ * The first value (C) indicates the current server epoch. The client recovery
+ * db should be tagged with this value on creation, or when updating the db
+ * after the grace period has been fully lifted.
What's the "client recovery db"? I guess it's the per-node database of
long-form client identifiers identifying clients that are allowed to
reclaim state?
+ *
+ * The second uint64_t value
(R)
in the data tells the NFS server from what
+ * recovery db it is allowed to reclaim. A value of 0 in this field means that
+ * we are out of the cluster-wide grace period and that no recovery is allowed.
+ *
+ * The omap contains a key for each host in the cluster. Typically, nodes join
+ * the cluster by setting their omap key. The value of the omap is a single
+ * byte that contains a set of flags that indicates their current need for a
+ * grace period and whether they are locally enforcing one.
Is it really just those two flags? A list of flags here would be
helpful.
--b.
> + *
> + * The grace period handling engine will update and store the flags, and it
> + * can be queried to determine whether other nodes may need a grace period or
> + * are enforcing.
> + */