From: Jeff Layton <jlayton(a)redhat.com>
This is an update of the patchset I had originally posted back in late
January. The basic idea is to add a new rados_cluster recovery backend
that allows running ganesha servers to self-aggregate and work out the
grace period amongst themselves by following a set of simple rules and
indicating their current and desired state in a rados object.
The patchset starts by extending the recovery_backend operations
interface to cover handling the grace period. All of these collapse
down to being no-ops on the singleton recovery_backends.
It then adds a new support library that abstracts out management of the
shared rados object. This object tracks whether there is a cluster-wide
grace period in effect, and from what reboot epoch recovery is allowed.
It also allows the cluster nodes to indicate whether they need a grace
period (in order to allow recovery) and whether they are currently
enforcing the grace period.
Then, a new command-line tool for directly manipulating the shared
object. This gives an admin a way to do things like request a grace
period manually, remove a dead host from the cluster, and "fake up"
other nodes in the cluster for purposes of testing.
Finally, it adds a new recovery backend that plugs into the same
library to allow ganesha to participate as a clustered node.
The immediate aim here is to allow us to do an active/active export of
FSAL_CEPH from multiple heads, probably under some sort of container
orchestration (e.g., Kubernetes). The underlying design however should
be extendible to other clustered backends.
While this does work, it's still very much proof-of-concept code at this
point. There is quite a bit of room for improvement here so, I don't
think this is quite ready for merge, but I'd appreciate any early
feedback on the approach. Does anyone see any major red flags in this
design that I haven't yet spotted?
There is one prerequisite for this set -- it currently relies on a patch
to ceph that is not yet in tree (to allow ganesha to immediately kill
off the Ceph MDS session of its previous incarnation). That's still
under development, but it's fairly straightforward.
Jeff Layton (13):
HASHTABLE: add a hashtable_for_each function
reaper: add a way to wake up the reaper immediately
main: initialize recovery backend earlier
SAL: make some rados_kv symbols public
SAL: add new try_lift_grace recovery operation
SAL: add recovery operation to maybe start a grace period
SAL: add new set_enforcing operation
SAL: add a way to check for grace period being enforced cluster-wide
main: add way to stall server until grace is being enforced
support: add a rados_grace support library
tools: add new rados_grace manipulation tool
SAL: add new clustered RADOS recovery backend
FSAL_CEPH: kill off old session before the mount
src/CMakeLists.txt | 1 +
src/FSAL/FSAL_CEPH/main.c | 39 ++
src/MainNFSD/nfs_init.c | 8 -
src/MainNFSD/nfs_lib.c | 13 +
src/MainNFSD/nfs_main.c | 12 +
src/MainNFSD/nfs_reaper_thread.c | 11 +
src/SAL/CMakeLists.txt | 3 +-
src/SAL/nfs4_recovery.c | 90 ++-
src/SAL/recovery/recovery_rados.h | 6 +
src/SAL/recovery/recovery_rados_cluster.c | 406 +++++++++++++
src/SAL/recovery/recovery_rados_kv.c | 7 +-
src/cmake/modules/FindCEPHFS.cmake | 8 +
src/doc/man/ganesha-core-config.rst | 1 +
src/hashtable/hashtable.c | 17 +
src/include/config-h.in.cmake | 1 +
src/include/hashtable.h | 3 +
src/include/nfs_core.h | 1 +
src/include/rados_grace.h | 82 +++
src/include/sal_functions.h | 11 +-
src/nfs-ganesha.spec-in.cmake | 2 +
src/support/CMakeLists.txt | 4 +
src/support/rados_grace.c | 678 ++++++++++++++++++++++
src/tools/CMakeLists.txt | 4 +
src/tools/rados_grace_tool.c | 178 ++++++
24 files changed, 1567 insertions(+), 19 deletions(-)
create mode 100644 src/SAL/recovery/recovery_rados_cluster.c
create mode 100644 src/include/rados_grace.h
create mode 100644 src/support/rados_grace.c
create mode 100644 src/tools/rados_grace_tool.c
--
2.17.0