[RFC PATCH] rados_cluster: add a "design" manpage
by Jeff Layton
From: Jeff Layton <jlayton(a)redhat.com>
Bruce asked for better design documentation, so this is my attempt at
it. Let me know what you think. I'll probably end up squashing this into
one of the code patches but for now I'm sending this separately to see
if it helps clarify things.
Suggestions and feedback are welcome.
Change-Id: I53cc77f66b2407c2083638e5760666639ba1fd57
Signed-off-by: Jeff Layton <jlayton(a)redhat.com>
---
src/doc/man/ganesha-rados-cluster.rst | 227 ++++++++++++++++++++++++++
1 file changed, 227 insertions(+)
create mode 100644 src/doc/man/ganesha-rados-cluster.rst
diff --git a/src/doc/man/ganesha-rados-cluster.rst b/src/doc/man/ganesha-rados-cluster.rst
new file mode 100644
index 000000000000..1ba2d3c29093
--- /dev/null
+++ b/src/doc/man/ganesha-rados-cluster.rst
@@ -0,0 +1,227 @@
+==============================================================================
+ganesha-rados-cluster-design -- Clustered RADOS Recovery Backend Design
+==============================================================================
+
+.. program:: ganesha-rados-cluster-design
+
+This document aims to explain the theory and design behind the
+rados_cluster recovery backend, which coordinates grace period
+enforcement among multiple, independent NFS servers.
+
+In order to understand the clustered recovery backend, it's first necessary
+to understand how recovery works with a single server:
+
+Singleton Server Recovery
+-------------------------
+NFSv4 is a lease-based protocol. Clients set up a relationship to the
+server and must periodically renew their lease in order to maintain
+their ephemeral state (open files, locks, delegations or layouts).
+
+When a singleton NFS server is restarted, any ephemeral state is lost. When
+the server comes comes back online, NFS clients detect that the server has
+been restarted and will reclaim the ephemeral state that they held at the
+time of their last contact with the server.
+
+Singleton Grace Period
+----------------------
+
+In order to ensure that we don't end up with conflicts, clients are
+barred from acquiring any new state while in the Recovery phase. Only
+reclaim operations are allowed.
+
+This period of time is called the **grace period**. Most NFS servers
+have a grace period that lasts around two lease periods, however
+nfs-ganesha can and will lift the grace period early if it determines
+that no more clients will be allowed to recover.
+
+Once the grace period ends, the server will move into its Normal
+operation state. During this period, no more recovery is allowed and new
+state can be acquired by NFS clients.
+
+Reboot Epochs
+-------------
+The lifecycle of a singleton NFS server can be considered to be a series
+of transitions from the Recovery period to Normal operation and back. In the
+remainder of this document we'll consider such a period to be an
+**epoch**, and assign each a number beginning with 1.
+
+Visually, we can represent it like this, such that each
+Normal -> Recovery transition is marked by a change in the epoch value:
+
+ +---------------------------------------+
+ | R | N | R | N | R | R | R | N | R | N | <=== Operational State
+ +---------------------------------------+
+ | 1 | 2 | 3 | 4 | <=== Epoch
+ +---------------------------------------+
+
+Note that it is possible to restart during the grace period (as shown
+above during epoch 3). That just serves to extend the recovery period
+and the epoch for longer. A new epoch is only declared during a
+Recovery -> Normal transition.
+
+Client Recovery Database
+------------------------
+There are some potential edge cases that can occur involving network
+partitions and multiple reboots. In order to prevent those, the server
+must maintain a list of clients that hold state on the server at any
+given time. This list must be maintained on stable storage. If a client
+sends a request to reclaim some state, then the server must check to
+make sure it's on that list before allowing the request.
+
+Thus when the server allows reclaim requests it must always gate it
+against the recovery database from the previous epoch. As clients come
+in to reclaim, we establish records for them in a new database
+associated with the current epoch.
+
+The transition from recovery to normal operation should perform an
+atomic switch of recovery databases. A recovery database only becomes
+legitimate on a recovery to normal transition. Until that point, the
+recovery database from the previous epoch is the canonical one.
+
+Exporting a Clustered Filesystem
+--------------------------------
+Let's consider a set of independent NFS servers, all serving out the same
+content from a clustered backend filesystem of any flavor. Each NFS
+server in this case can itself be considered a clustered FS client. This
+means that the NFS server is really just a proxy for state on the
+clustered filesystem (XXX: diagram here?)
+
+The filesystem must make some guarantees to the NFS server. First filesystem
+guarantee:
+
+1. The filesystem ensures that the NFS servers (aka the FS clients)
+ cannot obtain state that conflicts with that of another NFS server.
+
+This is somewhat obvious and is what we expect from any clustered filesystem
+outside of any requirements of NFS. If the clustered filesystem can
+provide this, then we know that conflicting state during normal
+operations cannot be granted.
+
+The recovery period has a different set of rules. If an NFS server
+crashes and is restarted, then we have a window of time when that NFS
+server does not know what state was held by its clients.
+
+If the state held by the crashed NFS server is immediately released
+after the crash, another NFS server could hand out conflicting state
+before the original NFS client has a chance to recover it.
+
+This *must* be prevented. Second filesystem guarantee:
+
+2. The filesystem must not release state held by a server during the
+ previous epoch until all servers in the cluster are enforcing the
+ grace period.
+
+In practical terms, we want the filesystem to provide a way for an NFS
+server to tell it when it's safe to release state held by a previous
+instance of itself. The server should do this once it knows that all of
+its siblings are enforcing the grace period.
+
+Note that we do not require that all servers restart and allow reclaim
+at that point. It's sufficient for them to simply begin grace period
+enforcement as soon as possible once one server needs it.
+
+Clustered Grace Period Database
+-------------------------------
+At this point the cluster siblings are no longer completely independent,
+and the grace period has become a cluster-wide property. This means that
+we must track the current epoch on some sort of shared storage that the
+servers can all access.
+
+Additionally we must also keep track of whether a cluster-wide grace period
+is in effect. Any running nodes should all be informed when either of this
+info changes, so they can take appropriate steps when it occurs.
+
+In the rados_cluster backend, we track these using two epoch values:
+
+- **C**: is the current epoch. This represents the current epoch value
+ of the cluster
+
+- **R**: is the recovery epoch. This represents the epoch from which
+ clients are allowed to recover. A non-zero value here means
+ that a cluster-wide grace period is in effect. Setting this to
+ 0 ends that grace period.
+
+In order to decide when to make grace period transitions, we must also
+have each server to advertise its state to the other nodes. Specifically,
+we need to allow servers to determine these two things about each of
+its siblings:
+
+1. Does this server have clients from the previous epoch that will require
+ recovery? (NEED)
+
+2. Is this server allowing clients to acquire new state? (ENFORCING)
+
+We do this with a pair of flags per sibling (NEED and ENFORCING). Each
+server typically manages its own flags.
+
+The rados_cluster backend stores all of this information in a single
+RADOS object that is modified using read/modify/write cycles. Typically
+we'll read the whole object, modify it, and then attept to write it
+back. If something changes between the read and write, we redo the read
+and try it again.
+
+Clustered Client Recovery Databases
+-----------------------------------
+In rados_cluster the client recovery databases are stored as RADOS
+objects. Each NFS server has its own set of them and they are given
+names that have the current epoch (C) embedded in it. This ensures
+that recovery databases are specific to a particular epoch.
+
+In general, it's safe to delete any recovery database that precedes R
+when R is non-zero, and safe to remove any recovery database except for
+the current one (the one with C in the name) when the grace period is
+not in effect (R==0).
+
+Establishing a New Grace Period
+-------------------------------
+When a server restarts and wants to allow clients to reclaim their
+state, it must establish a new epoch by incrementing the current epoch
+to declare a new grace period (R=C; C=C+1).
+
+The exception to this rule is when the cluster is already in a grace
+period. Servers can just join an in-progress grace period instead of
+establishing a new one if one is already active.
+
+In either case, the server should also set its NEED and ENFORCING flags
+at the same time.
+
+The other surviving cluster siblings should take steps to begin grace
+period enforcement as soon as possible. This entails "draining off" any
+in-progress state morphing operations and then blocking the acquisition
+of any new state (usually with a return of NFS4ERR_GRACE to clients that
+attempt it). Again, there is no need for the survivors from the previous
+epoch to allow recovery here.
+
+The surviving servers must however establish a new client recovery
+database at this point to ensure that their clients can do recovery in
+the event of a crash afterward.
+
+Once all of the siblings are enforcing the grace period, the recovering
+server can then request that the filesystem release the old state, and
+allow clients to begin reclaiming their state. In the rados_cluster
+backend driver, we do this by stalling server startup until all hosts
+in the cluster are enforcing the grace period.
+
+Lifting the Grace Period
+------------------------
+Transitioning from recovery to normal operation really consists of two
+different steps:
+
+1. the server decides that it no longer requires a grace period, either
+ due to it timing out or there not being any clients that would be
+ allowed to reclaim.
+
+2. the server stops enforcing the grace period and transitions to normal
+ operation
+
+These concepts are often conflated in a singleton servers, but in
+a cluster we must consider them independently.
+
+When a server is finished with its own local recovery period, it should
+clear its NEED flag. That server should continue enforcing the grace
+period however until the grace period is fully lifted.
+
+If the servers' own NEED flag is the last one set, then it can lift the
+grace period (by setting R=0). At that point, all servers in the cluster
+can end grace period enforcement, and communicate that fact to the
+others by clearing their ENFORCING flags.
--
2.17.0
6 years, 6 months
Change in ffilz/nfs-ganesha[next]: gtest/test_handle_to_wire_latency: handle_to_wire latency microbenchmark
by GerritHub
From Girjesh Rajoria <grajoria(a)redhat.com>:
Girjesh Rajoria has uploaded this change for review. ( https://review.gerrithub.io/412955
Change subject: gtest/test_handle_to_wire_latency: handle_to_wire latency microbenchmark
......................................................................
gtest/test_handle_to_wire_latency: handle_to_wire latency microbenchmark
gtest unit test that run handle_to_wire on single case, large loops
and bypass cases to find average latency of the calls.
Change-Id: I5a826246f48d0f87410e119c72c408f5aa553252
Signed-off-by: grajoria <grajoria(a)redhat.com>
---
M src/gtest/CMakeLists.txt
A src/gtest/test_handle_to_wire_latency.cc
2 files changed, 333 insertions(+), 0 deletions(-)
git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/55/412955/1
--
To view, visit https://review.gerrithub.io/412955
To unsubscribe, or for help writing mail filters, visit https://review.gerrithub.io/settings
Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: I5a826246f48d0f87410e119c72c408f5aa553252
Gerrit-Change-Number: 412955
Gerrit-PatchSet: 1
Gerrit-Owner: Girjesh Rajoria <grajoria(a)redhat.com>
6 years, 6 months
Difference among all the recovery backend
by Supriti Singh
Hello,
Recently Jeff introduced few more recovery backends: rados_kv, rados_ng and rados_cluster. I want to understand which
recovery backend should be used for which use case? From design point of view what are differences among these three
recovery backend?
Thanks,
Supriti
------
Supriti Singh SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
6 years, 6 months
Change in ffilz/nfs-ganesha[next]: Updated gtest for w.r.t. base class.
by GerritHub
From Girjesh Rajoria <grajoria(a)redhat.com>:
Girjesh Rajoria has uploaded this change for review. ( https://review.gerrithub.io/412920
Change subject: Updated gtest for w.r.t. base class.
......................................................................
Updated gtest for w.r.t. base class.
Change-Id: I194507f421e74ce8acd921358829a4fd1b0e89f5
Signed-off-by: grajoria <grajoria(a)redhat.com>
---
M src/gtest/test_close2_latency.cc
M src/gtest/test_close_latency.cc
M src/gtest/test_handle_to_key_latency.cc
M src/gtest/test_lock_op2_latency.cc
M src/gtest/test_open2_latency.cc
M src/gtest/test_read2_latency.cc
M src/gtest/test_reopen2_latency.cc
M src/gtest/test_setattr2_latency.cc
M src/gtest/test_write2_latency.cc
9 files changed, 113 insertions(+), 136 deletions(-)
git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/20/412920/1
--
To view, visit https://review.gerrithub.io/412920
To unsubscribe, or for help writing mail filters, visit https://review.gerrithub.io/settings
Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: I194507f421e74ce8acd921358829a4fd1b0e89f5
Gerrit-Change-Number: 412920
Gerrit-PatchSet: 1
Gerrit-Owner: Girjesh Rajoria <grajoria(a)redhat.com>
6 years, 6 months
Change in ffilz/nfs-ganesha[next]: EXPORT: Free cidr strings in EXPORT CLIENTs
by GerritHub
From Frank Filz <ffilzlnx(a)mindspring.com>:
Frank Filz has uploaded this change for review. ( https://review.gerrithub.io/412871
Change subject: EXPORT: Free cidr strings in EXPORT CLIENTs
......................................................................
EXPORT: Free cidr strings in EXPORT CLIENTs
Change-Id: Id42f00fdaf30a9d7678f4dfce15ece9df6675e2c
Signed-off-by: Frank S. Filz <ffilzlnx(a)mindspring.com>
---
M src/support/exports.c
1 file changed, 17 insertions(+), 6 deletions(-)
git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/71/412871/1
--
To view, visit https://review.gerrithub.io/412871
To unsubscribe, or for help writing mail filters, visit https://review.gerrithub.io/settings
Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: Id42f00fdaf30a9d7678f4dfce15ece9df6675e2c
Gerrit-Change-Number: 412871
Gerrit-PatchSet: 1
Gerrit-Owner: Frank Filz <ffilzlnx(a)mindspring.com>
6 years, 6 months
Change in ffilz/nfs-ganesha[next]: gtest: reorganize gtest/CMakeLists.txt
by GerritHub
From Frank Filz <ffilzlnx(a)mindspring.com>:
Frank Filz has uploaded this change for review. ( https://review.gerrithub.io/412867
Change subject: gtest: reorganize gtest/CMakeLists.txt
......................................................................
gtest: reorganize gtest/CMakeLists.txt
Change-Id: Id8a5d97da65e9fcd98b7e2de27b035031a7cc136
---
M src/gtest/CMakeLists.txt
1 file changed, 38 insertions(+), 38 deletions(-)
git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/67/412867/1
--
To view, visit https://review.gerrithub.io/412867
To unsubscribe, or for help writing mail filters, visit https://review.gerrithub.io/settings
Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: Id8a5d97da65e9fcd98b7e2de27b035031a7cc136
Gerrit-Change-Number: 412867
Gerrit-PatchSet: 1
Gerrit-Owner: Frank Filz <ffilzlnx(a)mindspring.com>
6 years, 6 months
Change in ffilz/nfs-ganesha[next]: gtest: nfs4_op_rename
by GerritHub
From Frank Filz <ffilzlnx(a)mindspring.com>:
Frank Filz has uploaded this change for review. ( https://review.gerrithub.io/412866
Change subject: gtest: nfs4_op_rename
......................................................................
gtest: nfs4_op_rename
Change-Id: I7a1a523231c80edc4abbedefb4dbd94cc1e46340
Signed-off-by: Frank S. Filz <ffilzlnx(a)mindspring.com>
---
M src/gtest/CMakeLists.txt
M src/gtest/gtest_nfs4.hh
A src/gtest/test_nfs4_rename_latency.cc
3 files changed, 366 insertions(+), 0 deletions(-)
git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/66/412866/1
--
To view, visit https://review.gerrithub.io/412866
To unsubscribe, or for help writing mail filters, visit https://review.gerrithub.io/settings
Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: I7a1a523231c80edc4abbedefb4dbd94cc1e46340
Gerrit-Change-Number: 412866
Gerrit-PatchSet: 1
Gerrit-Owner: Frank Filz <ffilzlnx(a)mindspring.com>
6 years, 6 months
Recovery backend always gives timeout even though cluster is healthy
by Supriti Singh
Hi,
I was trying to test out the new recovery backend, rados_kv. I am using nfs-ganesha v2.7-dev13 with latest ceph master.
I am testing using vstart cluster. But everytime I get timeout "Failed to connect to cluster: -110". The cluster is
healthy. I can also give commands using the rados cli. Any pointers on what could be failing here?
Also, while working on this, I realized the default timeout (client_mount_timeout) is 300 in seconds, and there are 10
tries. So, in worst case timeout occurs after 50 mins. The option, "client_mount_timeout" can be set in ceph.conf, but
it seems like its also possible to set in the ganesha config and pass it in function rados_conf_set(). [1] Will it be
useful to have such option in ganesha.conf?
[1] https://tracker.ceph.com/issues/6507
------
Supriti Singh SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)
6 years, 6 months
Change in ffilz/nfs-ganesha[next]: gtest/test_release_latency: release latency microbenchmark
by GerritHub
From Girjesh Rajoria <grajoria(a)redhat.com>:
Girjesh Rajoria has uploaded this change for review. ( https://review.gerrithub.io/412638
Change subject: gtest/test_release_latency: release latency microbenchmark
......................................................................
gtest/test_release_latency: release latency microbenchmark
gtest unit test that run release on single case, large loops
and bypass cases to find average latency of the calls.
Change-Id: Idb29a891a67486e71234a3fdd1778036ed1f1ee0
Signed-off-by: grajoria <grajoria(a)redhat.com>
---
M src/gtest/CMakeLists.txt
A src/gtest/test_release_latency.cc
2 files changed, 312 insertions(+), 0 deletions(-)
git pull ssh://review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/38/412638/1
--
To view, visit https://review.gerrithub.io/412638
To unsubscribe, or for help writing mail filters, visit https://review.gerrithub.io/settings
Gerrit-Project: ffilz/nfs-ganesha
Gerrit-Branch: next
Gerrit-MessageType: newchange
Gerrit-Change-Id: Idb29a891a67486e71234a3fdd1778036ed1f1ee0
Gerrit-Change-Number: 412638
Gerrit-PatchSet: 1
Gerrit-Owner: Girjesh Rajoria <grajoria(a)redhat.com>
6 years, 6 months