-----Original Message-----
From: William Allen Simpson [mailto:william.allen.simpson@gmail.com]
Sent: Monday, September 17, 2018 2:51 PM
To: Kropelin, Adam <kropelin(a)amazon.com>
Cc: devel(a)lists.nfs-ganesha.org
Subject: Re: [NFS-Ganesha-Devel] Re: Scalability issue with VFS FSAL and
large amounts of read i/o in flight
1) Did you test with the one line patch that I posted?
Not your precise patch, but an equivalent. It offers a solid improvement until number of
clients approaches number of queues. Then aggregate throughput is again limited to the sum
of the slowest client in each queue. When the "slowest client" is someone who
dropped off the network, the entire queue stalls and any client unfortunate enough to land
on the same queue is starved until tcp eventually decides to give up on the missing
client. Malahal's timeout patch can help with that, but it's a failsafe, not a
scalability solution.
2) What client distro are you using? What kernel version?
Various, but one example would be Amazon Linux (RHEL-derived distro available in AWS EC2)
with kernel 4.14.67. This is my testbed since it allows me to conjur a number of pristine
clients on-demand.
3) What server distro are you using? What kernel version?
Same.
4) What does your mount look like?
NFSv4.0, nolock:
10.2.145.170:/test on /home/ec2-user/mnt type nfs4
(rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.3.98.13,local_lock=none,addr=10.2.145.170)
On 9/17/18 12:40 PM, Kropelin, Adam wrote:
> See attached for the changes I'm testing with.
Your patch turns an async thread system into sync per fd.
To be fair, the system was never fully async. Up to 16 threads would drive network output
before. Same with the hash-by-fd patch. You can argue this is a small enough fraction of
however many threads are active, but it's not real async. With my change the limit
goes up to number-of-xprts, but with suggestions elsewhere in this thread to make the
number of queues configurable, you're just headed for the same thing anyway. The
question is just one of "how many queues is enough?" The answer is: How many
clients do you have? How much scalability do you need?
As an alternative, non-blocking I/O would allow for true async behavior with a small
number of queues and a small number of dedicated threads, likely one per interface again.
Epoll on the fds with non-zero-length queues and feed them as required. Any given worker
attempts to write in-line and if it receives EWOULDBLOCK, he queues the i/o for the epoll
worker to handle when the socket can take it again. Nobody ever blocks on output. But
that's a much bigger architectural change.
This really shouldn't be visible to the library caller, so
doesn't belong in the
SVCXPRT.
You can do whatever you want, but I already added an IOQ in the
rpc_dplx_rec for receiving. That would be a better place.
Thanks for the feedback, I'll take a look at that.
And none of this fixes the underlying problem. 1GB in 1MB chunks
all
processing in parallel is going to make 100,000 TCP segments, and Linux
default caps the outstanding segments at 1,000 or 10,000 per interface
(depending on the queue type). Multiply by the number of callers. Basically
you have piggy callers.
Piggy indeed. But I'm not instructing them to be piggy...they just are. I've
purposely designed my test case to eliminate every other potential bottleneck and
complicating factor. There's essentially zero disk I/O anywhere. I should see perfect
scalability (and I do with thread-per-xprt). A server needs to cope with clients doing
things that aren't always optimal.
All your patch does is randomize the fd that is serviced (because we
cannot
control which thread will be selected by the scheduler).
It allows us to feed more than one tcp stream concurrently. When the tcp stream itself is
bandwidth-limited on the client side, and you're using blocking sockets, you have to
employ multiple threads. How many threads is enough is the question. 16 isn't enough
when you have a fleet of more than that number of bandwidth-limited clients.
And leaves an awful lot of threads waiting.
Yes, that's a down-side with few alternatives other than non-blocking I/O.
--Adam