Hello,
I am observing a scalability issue with recent-ish versions of nfs-ganesha (including
-next) when NFS clients have a significant amount of in-flight read requests.
My test setup has a ganesha server with a single export on the VFS FSAL. I have multiple
Linux clients, all mounting that export with NFSv4.0. On the clients I run a simple read
workload using dd: 'dd if=/mnt/test/testfile of=/dev/null bs=1M'. All clients read
the same 1 GB file. Each client is bandwidth-limited to 1 Gbps while the server has 10
Gbps available. A single client achieves ~100 MB/sec. Adding a second client brings the
aggregate throughput up to ~120 MB/sec. A third client gets the aggregate to ~130 MB/sec,
and it pretty much plateaus at that point. Clearly this is well below the aggregate
bandwidth the server is capable of.
Additionally, and this is the behavior that made me originally discover this issue in
production, while the clients are performing their read test, the server becomes extremely
slow to respond to mount requests. By "extremely slow" I mean it takes 60
seconds or more to perform a simple mount while 8 clients are running the read test.
I've ruled out external bottlenecks -- disk i/o on the server is essentially zero
during the test (as would be expected since that 1 GB file will most certainly be in page
cache). The server shows no significant CPU load at all. Using the in-kernel NFS server
with the same clients I can easily saturate the 10 Gpbs network link from 8-10 clients
with no effect on mount times, so network is not a bottleneck here.
Other things of interest:
* -next and V2.5 both exhibit the issue, but V2.2 does not
* By observation on the wire I see that the Linux NFS client is submitting 16 or more 1 MB
READ RPCs at once. If I prevent that behavior by adding 'iflag=direct' to the dd
command, suddenly scalability is back where it should be. Something about having a lot of
read i/o in flight seems to matter here.
* I grabbed several core dumps of ganesha during a period where 8 clients were hitting it.
Every single thread is idle (typically pthread_cond_wait'ing for some work) except for
one rpc worker which is in writev. This is true repeatedly throughout the test. It is as
if somehow a single rpc worker thread is doing all of the network i/o to every client.
Thanks in advance for any ideas...
--Adam