On Mon, Sep 17, 2018 at 9:16 PM, William Allen Simpson <william.allen.simpson@gmail.com> wrote:

[Back to the OP]

On 9/11/18 12:58 PM, Kropelin, Adam wrote:

My test setup has a ganesha server with a single export on the VFS FSAL. I have multiple Linux clients, all mounting that export with NFSv4.0.

I'm at the Bake-a-thon, and the first question the kernel folks asked
was "What version of client are you running?"

Apparently, there was a time period where the buffer estimation was
broken. It isn't supposed to send too many parallel requests, until the
outstanding data was received.

On the clients I run a simple read workload using dd: 'dd if=/mnt/test/testfile of=/dev/null bs=1M'. All clients read the same 1 GB file. Each client is bandwidth-limited to 1 Gbps while the server has 10 Gbps available. A single client achieves ~100 MB/sec. Adding a second client brings the aggregate throughput up to ~120 MB/sec. A third client gets the aggregate to ~130 MB/sec, and it pretty much plateaus at that point. Clearly this is well below the aggregate bandwidth the server is capable of.

After consultation, this is now easy to explain. Each client is sending
very short requests. They show up nearly simultaneously. They are
passed into the VFS FSAL, each in its own thread, all at the same time.
They each make a system call (and wait). The systems calls all return at
nearly the same time. Then queued for output.

Essentially, you've turned a set of sequential reads into random reads.
Maybe you are hoping that will allow the disk scheduler to process all
the system requests in parallel, seeking in the most optimal order.

But usually, random reads are slower than sequential reads.

Additionally, and this is the behavior that made me originally discover this issue in production, while the clients are performing their read test, the server becomes extremely slow to respond to mount requests. By "extremely slow" I mean it takes 60 seconds or more to perform a simple mount while 8 clients are running the read test.

Yes. Because they are being performed in parallel, they are all
arriving at the output queue at nearly the same time. For all
the clients. Not necessarily in any order. A huge pile of data.
No system memory for it all.

The queue is full. It will not empty until the Acks have come back
from the clients. Those Acks are probably slow, because the clients
have to process the large amounts of data, and their links are
slower than the server. (Also TCP only Acks every other segment.)

Then your mount request arrives. It goes at the end of the queue.
It will wait for all the outstanding data.

Repeatedly, I proposed implementing "Weighted Fair Queuing". See:
Mon, 8 Jun 2015 15:00:56 -0400
Wed, 2 Sep 2015 10:35:38 -0400
Thu, 9 Mar 2017 02:44:29 -0500

Each time, I was reprimanded *IN* *PUBLIC* (on the list) for proposing
something beyond the then current plan.

I asked for help designing an API for the FSALs.

So now you've got async, top to bottom, in the RPC stack, simulated
where we don't have actual async operations. It's more than 400%
faster for short IOPs than the old Ganesha V2.3 crufty code.

I still think WFQ would have been helpful.... Now maybe you do too?

But the one line patch that I've proposed should help a small amount.
It's not a panacea.

MainNFSD/nfs_rpc_dispatcher_thread.c nfs_rpc_tcp_user_data(), add one
line immediately before the return (line 1239):
+ newxprt->xp_ifindex = newxprt->xp_fd;

Let us know whether it does?

_______________________________________________
Devel mailing list -- devel@lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave@lists.nfs-ganesha.org