One more thing to keep in mind: Getting this wrong is a data corruption
bug, which is the worst possible bug you can have in a storage system,
so you should be pretty sure you've gotten it right.
Daniel
On 4/5/19 9:20 AM, Jeff Layton wrote:
Easier said than done. Bear in mind that all of the recovery backend
stuff is entirely for the purpose of dealing with server restarts, so
you really do have to be careful not to leave gaps in particular
scenarios.
Let's say you do decide to synchronously store open and lock records
in a central RADOS-based database and both Server A and Server B are
using it.
Server A crashes, and the client decides to reconnect to Server B
using the recorded clientid/session. Server B says "Oh, this session
was previously held by Server A." Now what happens?
You need a mechanism to transfer the CephFS state (opens, locks, caps,
etc.) from Server A to Server B. Nothing like that exists today, but
we do have some tentative plans to allow cephfs clients to reclaim
state they previously held. In principle, that could be extended to
allow "takeover" in some fashion.
But wait...it gets worse!
Suppose we have a 3 node ganesha cluster and some of Server A's
clients decide to go to Server C instead? Now a simple takeover is not
enough -- you need a way to split that state granularly.
Couple all of this with the basic truism that failures in these sorts
of architectures are often cascading. You need to deal with the
possibility that any node could just die at any time, and decide how
you're going to deal with that. A lot of the original ganesha recovery
backend work had gaping holes in the "takeover" mechanisms where
failures at an inopportune time could make it so no clients could
recover anything.
This is very much a non-trivial problem in my experience, but don't
let me dissuade you if you have considered these scenarios and have
thoughts on how to address it.