From: Malahal Naineni [mailto:malahal@gmail.com]
William Allen Simpson <william.allen.simpson(a)gmail.com>
> If they've changed it, that would be a horrible
> waste of memory. Most sockets aren't active at the same time.
Never worked in kernel networking, but it seems to start small and
then go up based on TCP autotuning?
> When I originally designed this code 3 years ago, I'd
deliberately
> serialized the per socket transactions so this wouldn't happen.
NFS COMMITs/WRITEs take a lot more time than other commands. Same is
the case with a readdir on a very large directory. Clients send multiple requests at the
same time, and it makes sense to process them in parallel. I don't know if we get
benefit in parallelizing the COMPOUND itself though.
> we have a limited number of cores.
Believe it or not, few of our customers are running with 16K
NB_WORKER threads! I was surprised as well. They experimented and found that value better
than 1K or 4K etc. They have 200GB RAM and 50-100 cores. Also, in some cases, these are
dedicated systems just for NFS.
In many (most?) cases I'd expect us to be I/O bound in the FSAL. When that's true,
additional threads allow for concurrency without requiring a core to "back" each
one. N_THREADS > N_CORES is typical for I/O bound services.
>> So if you have two NFS clients; one on wifi that can only do
54Mbps/sec and the other on a cable that can do 10Gbps, your entire NFS server with a
single sender thread will be waiting in writev() for the most part due to the slow
client.
> Again, that would only happen by running out of buffer space on
the
> slow connection. If you're asking for more data than the buffer
> space, or queuing concurrent responses on the same socket that
> exceed the buffer space, the single thread will be stalled by the
> slowest connection.
This is essentially exactly what is happening, although the "buffer space" is
behaving as if it's per-socket, not system-wide. I can't say whether that's a
change in behavior of the kernel, but that's how it's acting today.
Locally I've experimented with two approaches to address this.
In one, we hash xprts to queues rather than using ifindex. This is similar in concept to
Malahal's old patch incrementing ifindex (which I was unaware of until now) but has
the additional benefit of ensuring that traffic for any given xprt always lands on the
same queue. This means a single slow client affects only one queue/thread, not all of
them. Of course as the number of clients approaches and then exceeds IOQ_IF_SIZE,
scalability becomes limited again. To address that I've also been testing a
queue-per-xprt patch which moves the poolq_head to the xprt structure (and leaves pretty
much everything else unchanged). This approach is showing excellent linear scalability and
is impervious to slow or otherwise misbehaving clients.
Here are some numbers from before/after the thread-per-xprt patch:
Client specs: 1 Gbps network, workload is 'cp /mnt/test/4GiB.file /dev/null'
Server specs: 10 Gpbs network, 32 cores, 244 GiB RAM
BEFORE:
Clients Mount Time (sec) Aggregate Throughput (MiB/sec)
0 0.04 0
1 2.56 112
2 6.40 122
4 14.37 125
8 28.45 130
16 53.13 132
32 <too impatient to wait>
64 <too impatient to wait>
128 257.18 140
150 <too impatient to wait>
AFTER:
Clients Mount Time (sec) Aggregate Throughput (MiB/sec)
0 0.04 0
1 0.04 120
2 0.04 240
4 0.04 480
8 0.05 960
16 0.46 1217
32 1.34 1217
64 2.90 1217
128 6.12 1217
150 7.19 1217
Note that in the BEFORE case, mount latency grows rapidly and aggregate throughput
achieved is little more than the capacity of a single client. After the change we see
linear scalability until the server NIC is saturated, and only then do we see a modest
increase in mount latency.
--Adam