I did some more digging and bisected my way to V2.5-dev-11 as the first tag showing the
undesired behavior.
Looking deeper, the cause is this:
https://github.com/nfs-ganesha/ntirpc/pull/35/commits/3643f4fe45867f4b1ad...
If I'm understanding this right, it appears to be using a single thread per interface.
That would explain the observation that a single thread is blocked in writev() whenever I
look. It also explains why when clients are bandwidth-limited, the aggregate throughput
does not scale much above that consumed by a single client. With a modest amount of read
i/o in flight to each client, we'll quickly fill the socket buffer for any one client
and block waiting for data to flow while the other tcp streams sit idle.
I suspect it also means a single out-to-lunch client could stall *all* i/o on the
interface, which is another behavior I've been seeing recently. (Due to clients
rebooting or otherwise going awol without umounting or closing the tcp connection.)
Non-blocking I/O would be the answer here, but without that...throw some more threads at
it, I guess?
--Adam
-----Original Message-----
From: Kropelin, Adam [mailto:kropelin@amazon.com]
Sent: Tuesday, September 11, 2018 12:59 PM
To: devel(a)lists.nfs-ganesha.org
Subject: [NFS-Ganesha-Devel] Scalability issue with VFS FSAL and large amounts of read i/o
in flight
Hello,
I am observing a scalability issue with recent-ish versions of nfs-ganesha (including
-next) when NFS clients have a significant amount of in-flight read requests.
My test setup has a ganesha server with a single export on the VFS FSAL. I have multiple
Linux clients, all mounting that export with NFSv4.0. On the clients I run a simple read
workload using dd: 'dd if=/mnt/test/testfile of=/dev/null bs=1M'. All clients read
the same 1 GB file. Each client is bandwidth-limited to 1 Gbps while the server has 10
Gbps available. A single client achieves ~100 MB/sec. Adding a second client brings the
aggregate throughput up to ~120 MB/sec. A third client gets the aggregate to ~130 MB/sec,
and it pretty much plateaus at that point. Clearly this is well below the aggregate
bandwidth the server is capable of.
Additionally, and this is the behavior that made me originally discover this issue in
production, while the clients are performing their read test, the server becomes extremely
slow to respond to mount requests. By "extremely slow" I mean it takes 60
seconds or more to perform a simple mount while 8 clients are running the read test.
I've ruled out external bottlenecks -- disk i/o on the server is essentially zero
during the test (as would be expected since that 1 GB file will most certainly be in page
cache). The server shows no significant CPU load at all. Using the in-kernel NFS server
with the same clients I can easily saturate the 10 Gpbs network link from 8-10 clients
with no effect on mount times, so network is not a bottleneck here.
Other things of interest:
* -next and V2.5 both exhibit the issue, but V2.2 does not
* By observation on the wire I see that the Linux NFS client is submitting 16 or more 1 MB
READ RPCs at once. If I prevent that behavior by adding 'iflag=direct' to the dd
command, suddenly scalability is back where it should be. Something about having a lot of
read i/o in flight seems to matter here.
* I grabbed several core dumps of ganesha during a period where 8 clients were hitting it.
Every single thread is idle (typically pthread_cond_wait'ing for some work) except for
one rpc worker which is in writev. This is true repeatedly throughout the test. It is as
if somehow a single rpc worker thread is doing all of the network i/o to every client.
Thanks in advance for any ideas...
--Adam
_______________________________________________
Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to
devel-leave(a)lists.nfs-ganesha.org