Thank you for the info, but much of it is unrelated to the patch in question here.

I do have a patch that keeps the lists in an array and uses xp_fd for hashing. For now, I made it configurable by using RPC_Ioq_ThrdMax (for now against Ganesha2.5 code). This config param probably needs a rename.

Regards, Malahal.

On Tue, Sep 18, 2018 at 6:51 PM, William Allen Simpson <william.allen.simpson@gmail.com> wrote:
On 9/17/18 4:25 PM, Malahal Naineni wrote:
Today, we have one IOQ that is being used. It is just a list of responses to be sent over all xprts. How would making a list per xprt be worse than what it is today? Is it because of CPU cache-lines or something else?

Ah, how I love an intelligent question! :)  Long answer.

Yes, atomic cache lines help a lot.  That's "cache coherence".  I hope
by saying this over and over, it will eventually sink in enough that
some of the list members will read up on it.

Also, "something else".  The rpc_dplx_internal per transport lock is
used by many things in many paths.  That was one of the hot locks you
identified a few years back.  It has a lot fewer references now.

Note that I've also kept everything out of xp_lock.  It's not used
anywhere anymore.  But it's API visible to Ganesha, and should only be
used by Ganesha.  Avoiding lock inversion.

Circa 2014-2015, when I started trying to figure out how to fit in RDMA,
a major issue was locks.  Lots of lock inversion.  DanG had a big chart
on his whiteboard.  I'd try to fix one thing, and more issues became
apparent.

Finally, just re-wrote svc_xprt.  And threw away rpc_dplx.  And rpc_ctx.
So many locking conflicts.  Plus two fd rbtrees.  Why do things 3 times
that can be done once.... :(

Furthermore, the whole thing didn't scale.  Just as one thread per fd
won't scale.  For 1,024 fds you'll need 1,024 sync output threads, plus
3 times more memory (because each thread consumes memory), even though
the "top" half of the dispatcher only needs a few hundred threads to
process the incoming operations.

Remember, ultimately with async FSAL calls, the FSAL provides the thread
that carries the output.

Are we willing to guarantee that every FSAL backend will provide a
thread pool with enough threads to handle one per fd?  And by supporting
multiple FSALs, each one will provide 1,024 threads?

The current design bounds the number of output threads at 16.  If that
hurts too much, we could probably make it a configurable parameter.  But
I've not run into many configurations with even 16 interfaces.  A decade
or so ago, somebody I know was working on a 1,024 link switch on a chip.
But you don't see that stuff much.

Well, I've written all this before.  And given talks.  But new folks
keep coming on the project, and maybe older folks don't remember.