On Tue, Sep 11, 2018 at 9:59 AM Kropelin, Adam <kropelin@amazon.com> wrote:

Hello,

I am observing a scalability issue with recent-ish versions of nfs-ganesha (including -next) when NFS clients have a significant amount of in-flight read requests.

My test setup has a ganesha server with a single export on the VFS FSAL. I have multiple Linux clients, all mounting that export with NFSv4.0. On the clients I run a simple read workload using dd: 'dd if=/mnt/test/testfile of=/dev/null bs=1M'. All clients read the same 1 GB file. Each client is bandwidth-limited to 1 Gbps while the server has 10 Gbps available. A single client achieves ~100 MB/sec. Adding a second client brings the aggregate throughput up to ~120 MB/sec. A third client gets the aggregate to ~130 MB/sec, and it pretty much plateaus at that point. Clearly this is well below the aggregate bandwidth the server is capable of.

Additionally, and this is the behavior that made me originally discover this issue in production, while the clients are performing their read test, the server becomes extremely slow to respond to mount requests. By "extremely slow" I mean it takes 60 seconds or more to perform a simple mount while 8 clients are running the read test.

I've ruled out external bottlenecks -- disk i/o on the server is essentially zero during the test (as would be expected since that 1 GB file will most certainly be in page cache). The server shows no significant CPU load at all. Using the in-kernel NFS server with the same clients I can easily saturate the 10 Gpbs network link from 8-10 clients with no effect on mount times, so network is not a bottleneck here.

Other things of interest:
* -next and V2.5 both exhibit the issue, but V2.2 does not
* By observation on the wire I see that the Linux NFS client is submitting 16 or more 1 MB READ RPCs at once. If I prevent that behavior by adding 'iflag=direct' to the dd command, suddenly scalability is back where it should be. Something about having a lot of read i/o in flight seems to matter here.
* I grabbed several core dumps of ganesha during a period where 8 clients were hitting it. Every single thread is idle (typically pthread_cond_wait'ing for some work) except for one rpc worker which is in writev. This is true repeatedly throughout the test. It is as if somehow a single rpc worker thread is doing all of the network i/o to every client.

Thanks in advance for any ideas...
--Adam
_______________________________________________
Devel mailing list -- devel@lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave@lists.nfs-ganesha.org