Scalability issue with VFS FSAL and large amounts of read i/o in flight

Change in ffilz/nfs-ganesha[next]:...

Kropelin, Adam

Tuesday, 11 September 2018 Tue, 11 Sep '18

11:58 a.m.

Hello, I am observing a scalability issue with recent-ish versions of nfs-ganesha (including -next) when NFS clients have a significant amount of in-flight read requests. My test setup has a ganesha server with a single export on the VFS FSAL. I have multiple Linux clients, all mounting that export with NFSv4.0. On the clients I run a simple read workload using dd: 'dd if=/mnt/test/testfile of=/dev/null bs=1M'. All clients read the same 1 GB file. Each client is bandwidth-limited to 1 Gbps while the server has 10 Gbps available. A single client achieves ~100 MB/sec. Adding a second client brings the aggregate throughput up to ~120 MB/sec. A third client gets the aggregate to ~130 MB/sec, and it pretty much plateaus at that point. Clearly this is well below the aggregate bandwidth the server is capable of. Additionally, and this is the behavior that made me originally discover this issue in production, while the clients are performing their read test, the server becomes extremely slow to respond to mount requests. By "extremely slow" I mean it takes 60 seconds or more to perform a simple mount while 8 clients are running the read test. I've ruled out external bottlenecks -- disk i/o on the server is essentially zero during the test (as would be expected since that 1 GB file will most certainly be in page cache). The server shows no significant CPU load at all. Using the in-kernel NFS server with the same clients I can easily saturate the 10 Gpbs network link from 8-10 clients with no effect on mount times, so network is not a bottleneck here. Other things of interest: * -next and V2.5 both exhibit the issue, but V2.2 does not * By observation on the wire I see that the Linux NFS client is submitting 16 or more 1 MB READ RPCs at once. If I prevent that behavior by adding 'iflag=direct' to the dd command, suddenly scalability is back where it should be. Something about having a lot of read i/o in flight seems to matter here. * I grabbed several core dumps of ganesha during a period where 8 clients were hitting it. Every single thread is idle (typically pthread_cond_wait'ing for some work) except for one rpc worker which is in writev. This is true repeatedly throughout the test. It is as if somehow a single rpc worker thread is doing all of the network i/o to every client. Thanks in advance for any ideas... --Adam

Show replies by date

Imam Toufique

Tuesday, 11 September Tue, 11 Sep

3:31 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

we are about to do the same with v2.7 rc5 ( at this time) from ~ 200 clients, but we have not seen anything yet from our limited test case running with 12 clients, so far. We waiting on replacement HW for the InfiniBand card, as we can saturate the 10G card pretty quickly. Do you have any sysctl tuning parameters setup in the clients? On Tue, Sep 11, 2018 at 9:59 AM Kropelin, Adam <kropelin(a)amazon.com> wrote:

...

-- Regards, *Imam Toufique* *213-700-5485*

Kropelin, Adam

3:45 p.m.

I did some more digging and bisected my way to V2.5-dev-11 as the first tag showing the undesired behavior. Looking deeper, the cause is this: https://github.com/nfs-ganesha/ntirpc/pull/35/commits/3643f4fe45867f4b1ad... If I'm understanding this right, it appears to be using a single thread per interface. That would explain the observation that a single thread is blocked in writev() whenever I look. It also explains why when clients are bandwidth-limited, the aggregate throughput does not scale much above that consumed by a single client. With a modest amount of read i/o in flight to each client, we'll quickly fill the socket buffer for any one client and block waiting for data to flow while the other tcp streams sit idle. I suspect it also means a single out-to-lunch client could stall *all* i/o on the interface, which is another behavior I've been seeing recently. (Due to clients rebooting or otherwise going awol without umounting or closing the tcp connection.) Non-blocking I/O would be the answer here, but without that...throw some more threads at it, I guess? --Adam -----Original Message----- From: Kropelin, Adam [mailto:kropelin@amazon.com] Sent: Tuesday, September 11, 2018 12:59 PM To: devel(a)lists.nfs-ganesha.org Subject: [NFS-Ganesha-Devel] Scalability issue with VFS FSAL and large amounts of read i/o in flight Hello, I am observing a scalability issue with recent-ish versions of nfs-ganesha (including -next) when NFS clients have a significant amount of in-flight read requests. My test setup has a ganesha server with a single export on the VFS FSAL. I have multiple Linux clients, all mounting that export with NFSv4.0. On the clients I run a simple read workload using dd: 'dd if=/mnt/test/testfile of=/dev/null bs=1M'. All clients read the same 1 GB file. Each client is bandwidth-limited to 1 Gbps while the server has 10 Gbps available. A single client achieves ~100 MB/sec. Adding a second client brings the aggregate throughput up to ~120 MB/sec. A third client gets the aggregate to ~130 MB/sec, and it pretty much plateaus at that point. Clearly this is well below the aggregate bandwidth the server is capable of. Additionally, and this is the behavior that made me originally discover this issue in production, while the clients are performing their read test, the server becomes extremely slow to respond to mount requests. By "extremely slow" I mean it takes 60 seconds or more to perform a simple mount while 8 clients are running the read test. I've ruled out external bottlenecks -- disk i/o on the server is essentially zero during the test (as would be expected since that 1 GB file will most certainly be in page cache). The server shows no significant CPU load at all. Using the in-kernel NFS server with the same clients I can easily saturate the 10 Gpbs network link from 8-10 clients with no effect on mount times, so network is not a bottleneck here. Other things of interest: * -next and V2.5 both exhibit the issue, but V2.2 does not * By observation on the wire I see that the Linux NFS client is submitting 16 or more 1 MB READ RPCs at once. If I prevent that behavior by adding 'iflag=direct' to the dd command, suddenly scalability is back where it should be. Something about having a lot of read i/o in flight seems to matter here. * I grabbed several core dumps of ganesha during a period where 8 clients were hitting it. Every single thread is idle (typically pthread_cond_wait'ing for some work) except for one rpc worker which is in writev. This is true repeatedly throughout the test. It is as if somehow a single rpc worker thread is doing all of the network i/o to every client. Thanks in advance for any ideas... --Adam _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

William Allen Simpson

Wednesday, 12 September Wed, 12 Sep

6:12 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On 9/11/18 4:45 PM, Kropelin, Adam wrote:

...

I suspect it also means a single out-to-lunch client could stall *all* i/o on the interface, which is another behavior I've been seeing recently. (Due to clients rebooting or otherwise going awol without umounting or closing the tcp connection.)

This is true. Once the kernel I-O buffers are all full because a TCP client has stopped Ack'ing them, no other connection can send over that interface. That's just a fact of any kernel. Thus the real problem is the client asking for megabytes of data in the faint hope that will somehow be faster -- then crashing. This has been a known problem for decades. So the TCPM WG developed the TCP User Timeout option [RFC5482]. Malahal had a patch some time ago to timeout the client using another means, without depending upon the option. Didn't that go in?

...

Non-blocking I/O would be the answer here, but without that...throw some more threads at it, I guess?

Since V2.3 (before my time), we've been using IO vector zero-copy. Posix allows either iov or async, but not both in the same call. More threads won't help. It's a stall at the kernel level. In fact, one thread per interface proved to be fastest, as that minimizes locking conflicts and system calls (and improves CPU cache coherency).

Malahal Naineni

Thursday, 13 September Thu, 13 Sep

2:21 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

> Once the kernel I-O buffers are all full because a TCP

client has stopped Ack'ing them, no other connection can send over that interface. That's just a fact of any kernel. I would say it is a poor design if that is really true, and I am pretty sure Linux doesn't have that issue based on what we saw in the field. Ganesha was stalled in its writev() call but all other networking stuff was working just fine.

...

> More threads won't help. It's a stall at the kernel level.

We have a proof that more threads helped! On Wed, Sep 12, 2018 at 4:42 PM, William Allen Simpson < william.allen.simpson(a)gmail.com> wrote:

...

On 9/11/18 4:45 PM, Kropelin, Adam wrote: > I suspect it also means a single out-to-lunch client could stall *all* > i/o on the interface, which is another behavior I've been seeing recently. > (Due to clients rebooting or otherwise going awol without umounting or > closing the tcp connection.) > > This is true. Once the kernel I-O buffers are all full because a TCP client has stopped Ack'ing them, no other connection can send over that interface. That's just a fact of any kernel. Thus the real problem is the client asking for megabytes of data in the faint hope that will somehow be faster -- then crashing. This has been a known problem for decades. So the TCPM WG developed the TCP User Timeout option [RFC5482]. Malahal had a patch some time ago to timeout the client using another means, without depending upon the option. Didn't that go in? Non-blocking I/O would be the answer here, but without that...throw some > more threads at it, I guess? > > Since V2.3 (before my time), we've been using IO vector zero-copy. Posix allows either iov or async, but not both in the same call. More threads won't help. It's a stall at the kernel level. In fact, one thread per interface proved to be fastest, as that minimizes locking conflicts and system calls (and improves CPU cache coherency). _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

Imam Toufique

2:52 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

>> Once the kernel I-O buffers are all full because a TCP client has stopped Ack'ing them, no other connection can send over that interface. That's just a fact of any kernel. I would say it is a poor design if that is really true, and I am pretty sure Linux doesn't have that issue based on what we saw in the field. Ganesha was stalled in its writev() call but all other networking stuff was working just fine. >> More threads won't help. It's a stall at the kernel level. We have a proof that more threads helped! On Wed, Sep 12, 2018 at 4:42 PM, William Allen Simpson < william.allen.simpson(a)gmail.com> wrote: > On 9/11/18 4:45 PM, Kropelin, Adam wrote: > >> I suspect it also means a single out-to-lunch client could stall *all* >> i/o on the interface, which is another behavior I've been seeing recently. >> (Due to clients rebooting or otherwise going awol without umounting or >> closing the tcp connection.) >> >> This is true. Once the kernel I-O buffers are all full because a TCP > client has stopped Ack'ing them, no other connection can send over that > interface. That's just a fact of any kernel. > > Thus the real problem is the client asking for megabytes of data in the > faint hope that will somehow be faster -- then crashing. > > This has been a known problem for decades. So the TCPM WG developed > the TCP User Timeout option [RFC5482]. > > Malahal had a patch some time ago to timeout the client using another > means, without depending upon the option. Didn't that go in? > > > Non-blocking I/O would be the answer here, but without that...throw some >> more threads at it, I guess? >> >> Since V2.3 (before my time), we've been using IO vector zero-copy. > Posix allows either iov or async, but not both in the same call. > > More threads won't help. It's a stall at the kernel level. In fact, > one thread per interface proved to be fastest, as that minimizes > locking conflicts and system calls (and improves CPU cache coherency). > > _______________________________________________ > Devel mailing list -- devel(a)lists.nfs-ganesha.org > To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org > _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

-- Regards, *Imam Toufique* *213-700-5485*

Daniel Gryniewicz

Friday, 14 September Fri, 14 Sep

8:35 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

We definitely need to fix this. It's on the list. Daniel On 09/13/2018 03:52 PM, Imam Toufique wrote:

...

We are about to start using Ganesha , just starting to put things together. If this is right, shouldn't there be a fix or an ( if any ) workaround? :-) thanks. On Thu, Sep 13, 2018 at 12:21 PM Malahal Naineni <malahal(a)gmail.com <mailto:malahal@gmail.com>> wrote: >> Once the kernel I-O buffers are all full because a TCP client has stopped Ack'ing them, no other connection can send over that interface. That's just a fact of any kernel. I would say it is a poor design if that is really true, and I am pretty sure Linux doesn't have that issue based on what we saw in the field. Ganesha was stalled in its writev() call but all other networking stuff was working just fine. >> More threads won't help. It's a stall at the kernel level. We have a proof that more threads helped! On Wed, Sep 12, 2018 at 4:42 PM, William Allen Simpson <william.allen.simpson(a)gmail.com <mailto:william.allen.simpson@gmail.com>> wrote: On 9/11/18 4:45 PM, Kropelin, Adam wrote: I suspect it also means a single out-to-lunch client could stall *all* i/o on the interface, which is another behavior I've been seeing recently. (Due to clients rebooting or otherwise going awol without umounting or closing the tcp connection.) This is true. Once the kernel I-O buffers are all full because a TCP client has stopped Ack'ing them, no other connection can send over that interface. That's just a fact of any kernel. Thus the real problem is the client asking for megabytes of data in the faint hope that will somehow be faster -- then crashing. This has been a known problem for decades. So the TCPM WG developed the TCP User Timeout option [RFC5482]. Malahal had a patch some time ago to timeout the client using another means, without depending upon the option. Didn't that go in? Non-blocking I/O would be the answer here, but without that...throw some more threads at it, I guess? Since V2.3 (before my time), we've been using IO vector zero-copy. Posix allows either iov or async, but not both in the same call. More threads won't help. It's a stall at the kernel level. In fact, one thread per interface proved to be fastest, as that minimizes locking conflicts and system calls (and improves CPU cache coherency). _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org> To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org <mailto:devel-leave@lists.nfs-ganesha.org> _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org> To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org <mailto:devel-leave@lists.nfs-ganesha.org> -- Regards, */Imam Toufique/* /*213-700-5485*/ _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

Imam Toufique

3:05 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

can this fix be expected in the 2.7 release? thanks. On Fri, Sep 14, 2018 at 6:35 AM Daniel Gryniewicz <dang(a)redhat.com> wrote:

...

We definitely need to fix this. It's on the list. Daniel On 09/13/2018 03:52 PM, Imam Toufique wrote: > We are about to start using Ganesha , just starting to put things > together. If this is right, shouldn't there be a fix or an ( if any ) > workaround? :-) > > thanks. > > On Thu, Sep 13, 2018 at 12:21 PM Malahal Naineni <malahal(a)gmail.com > <mailto:malahal@gmail.com>> wrote: > > >> Once the kernel I-O buffers are all full because a TCP > client has stopped Ack'ing them, no other connection can send over that > interface. That's just a fact of any kernel. > > I would say it is a poor design if that is really true, and I am > pretty sure Linux doesn't have that issue based on what we saw in > the field. Ganesha was stalled in its writev() call but all other > networking stuff was working just fine. > >> More threads won't help. It's a stall at the kernel level. > > We have a proof that more threads helped! > > On Wed, Sep 12, 2018 at 4:42 PM, William Allen Simpson > <william.allen.simpson(a)gmail.com > <mailto:william.allen.simpson@gmail.com>> wrote: > > On 9/11/18 4:45 PM, Kropelin, Adam wrote: > > I suspect it also means a single out-to-lunch client could > stall *all* i/o on the interface, which is another behavior > I've been seeing recently. (Due to clients rebooting or > otherwise going awol without umounting or closing the tcp > connection.) > > This is true. Once the kernel I-O buffers are all full because > a TCP > client has stopped Ack'ing them, no other connection can send > over that > interface. That's just a fact of any kernel. > > Thus the real problem is the client asking for megabytes of data > in the > faint hope that will somehow be faster -- then crashing. > > This has been a known problem for decades. So the TCPM WG developed > the TCP User Timeout option [RFC5482]. > > Malahal had a patch some time ago to timeout the client using > another > means, without depending upon the option. Didn't that go in? > > > Non-blocking I/O would be the answer here, but without > that...throw some more threads at it, I guess? > > Since V2.3 (before my time), we've been using IO vector zero-copy. > Posix allows either iov or async, but not both in the same call. > > More threads won't help. It's a stall at the kernel level. In > fact, > one thread per interface proved to be fastest, as that minimizes > locking conflicts and system calls (and improves CPU cache > coherency). > > _______________________________________________ > Devel mailing list -- devel(a)lists.nfs-ganesha.org > <mailto:devel@lists.nfs-ganesha.org> > To unsubscribe send an email to > devel-leave(a)lists.nfs-ganesha.org > <mailto:devel-leave@lists.nfs-ganesha.org> > > > _______________________________________________ > Devel mailing list -- devel(a)lists.nfs-ganesha.org > <mailto:devel@lists.nfs-ganesha.org> > To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org > <mailto:devel-leave@lists.nfs-ganesha.org> > > > > -- > Regards, > */Imam Toufique/* > /*213-700-5485*/ > > > _______________________________________________ > Devel mailing list -- devel(a)lists.nfs-ganesha.org > To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org > _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

-- Regards, *Imam Toufique* *213-700-5485*

Frank Filz

3:13 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

This will have to be fixed in V2.7.1, I am in the process of completing this week’s merge and tagging V2.7.0. Frank From: Imam Toufique [mailto:techie879@gmail.com] Sent: Friday, September 14, 2018 1:05 PM To: dang(a)redhat.com Cc: devel(a)lists.nfs-ganesha.org Subject: [NFS-Ganesha-Devel] Re: Scalability issue with VFS FSAL and large amounts of read i/o in flight can this fix be expected in the 2.7 release? thanks. On Fri, Sep 14, 2018 at 6:35 AM Daniel Gryniewicz <dang(a)redhat.com <mailto:dang@redhat.com> > wrote: We definitely need to fix this. It's on the list. Daniel On 09/13/2018 03:52 PM, Imam Toufique wrote:

...

We are about to start using Ganesha , just starting to put things together. If this is right, shouldn't there be a fix or an ( if any ) workaround? :-) thanks. On Thu, Sep 13, 2018 at 12:21 PM Malahal Naineni <malahal(a)gmail.com <mailto:malahal@gmail.com> <mailto:malahal@gmail.com <mailto:malahal@gmail.com> >> wrote: >> Once the kernel I-O buffers are all full because a TCP client has stopped Ack'ing them, no other connection can send over that interface. That's just a fact of any kernel. I would say it is a poor design if that is really true, and I am pretty sure Linux doesn't have that issue based on what we saw in the field. Ganesha was stalled in its writev() call but all other networking stuff was working just fine. >> More threads won't help. It's a stall at the kernel level. We have a proof that more threads helped! On Wed, Sep 12, 2018 at 4:42 PM, William Allen Simpson <william.allen.simpson(a)gmail.com <mailto:william.allen.simpson@gmail.com> <mailto:william.allen.simpson@gmail.com <mailto:william.allen.simpson@gmail.com> >> wrote: On 9/11/18 4:45 PM, Kropelin, Adam wrote: I suspect it also means a single out-to-lunch client could stall *all* i/o on the interface, which is another behavior I've been seeing recently. (Due to clients rebooting or otherwise going awol without umounting or closing the tcp connection.) This is true. Once the kernel I-O buffers are all full because a TCP client has stopped Ack'ing them, no other connection can send over that interface. That's just a fact of any kernel. Thus the real problem is the client asking for megabytes of data in the faint hope that will somehow be faster -- then crashing. This has been a known problem for decades. So the TCPM WG developed the TCP User Timeout option [RFC5482]. Malahal had a patch some time ago to timeout the client using another means, without depending upon the option. Didn't that go in? Non-blocking I/O would be the answer here, but without that...throw some more threads at it, I guess? Since V2.3 (before my time), we've been using IO vector zero-copy. Posix allows either iov or async, but not both in the same call. More threads won't help. It's a stall at the kernel level. In fact, one thread per interface proved to be fastest, as that minimizes locking conflicts and system calls (and improves CPU cache coherency). _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org> <mailto:devel@lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org> > To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org <mailto:devel-leave@lists.nfs-ganesha.org> <mailto:devel-leave@lists.nfs-ganesha.org <mailto:devel-leave@lists.nfs-ganesha.org> > _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org> <mailto:devel@lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org> > To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org <mailto:devel-leave@lists.nfs-ganesha.org> <mailto:devel-leave@lists.nfs-ganesha.org <mailto:devel-leave@lists.nfs-ganesha.org> > -- Regards, */Imam Toufique/* /*213-700-5485*/ _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org> To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org <mailto:devel-leave@lists.nfs-ganesha.org>

_______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org> To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org <mailto:devel-leave@lists.nfs-ganesha.org> -- Regards, Imam Toufique 213-700-5485

William Allen Simpson

Wednesday, 12 September Wed, 12 Sep

6:36 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

[Going back to your Original Post] On 9/11/18 12:58 PM, Kropelin, Adam wrote:

...

* By observation on the wire I see that the Linux NFS client is submitting 16 or more 1 MB READ RPCs at once. If I prevent that behavior by adding 'iflag=direct' to the dd command, suddenly scalability is back where it should be. Something about having a lot of read i/o in flight seems to matter here.

Remember, an interface handles one packet at a time. 16 parallel read requests will give you improvements in the case the FSAL is storage I-O bound. But you also indicate they are all the same data, so it should be reading from cache. So you're going to be network I-O bound, and 16 parallel requests won't help.

...

* I grabbed several core dumps of ganesha during a period where 8 clients were hitting it. Every single thread is idle (typically pthread_cond_wait'ing for some work) except for one rpc worker which is in writev. This is true repeatedly throughout the test. It is as if somehow a single rpc worker thread is doing all of the network i/o to every client.

Since V2.6, all the input RPC paths are async. This shows that all the data has been fetched, and you're waiting for the data to be sent out over one interface. It only takes one worker to do that. In fact, one worker is optimal.

Malahal Naineni

Thursday, 13 September Thu, 13 Sep

2:09 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

As I reported here https://github.com/nfs-ganesha/ntirpc/pull/137, Linux kernel reserves send memory per socket. So if you have two NFS clients; one on wifi that can only do 54Mbps/sec and the other on a cable that can do 10Gbps, your entire NFS server with a single sender thread will be waiting in writev() for the most part due to the slow client. This issue is very easy to reproduce and my xp_ifindex patch (not the timeout one) helped in the following case: We have an NFS server that can do 40Gbps, but the connected NFS clients can only do 10Gbps. The plain ganesha2.5 code did max out one client connection with reads (throughput was close to 10Gbps) when one client was active. If we used 2 clients, the total throughput was still 10Gbps. After the fix, each client got close to 10Gbps. Regards, Malahal. On Wed, Sep 12, 2018 at 5:06 PM, William Allen Simpson < william.allen.simpson(a)gmail.com> wrote:

...

[Going back to your Original Post] On 9/11/18 12:58 PM, Kropelin, Adam wrote: > * By observation on the wire I see that the Linux NFS client is > submitting 16 or more 1 MB READ RPCs at once. If I prevent that behavior by > adding 'iflag=direct' to the dd command, suddenly scalability is back where > it should be. Something about having a lot of read i/o in flight seems to > matter here. > Remember, an interface handles one packet at a time. 16 parallel read requests will give you improvements in the case the FSAL is storage I-O bound. But you also indicate they are all the same data, so it should be reading from cache. So you're going to be network I-O bound, and 16 parallel requests won't help. * I grabbed several core dumps of ganesha during a period where 8 clients > were hitting it. Every single thread is idle (typically > pthread_cond_wait'ing for some work) except for one rpc worker which is in > writev. This is true repeatedly throughout the test. It is as if somehow a > single rpc worker thread is doing all of the network i/o to every client. > > Since V2.6, all the input RPC paths are async. This shows that all the data has been fetched, and you're waiting for the data to be sent out over one interface. It only takes one worker to do that. In fact, one worker is optimal. _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

William Allen Simpson

Friday, 14 September Fri, 14 Sep

2:41 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On 9/13/18 3:09 PM, Malahal Naineni wrote:

...

As I reported here https://github.com/nfs-ganesha/ntirpc/pull/137 <https://github.com/nfs-ganesha/ntirpc/pull/137>;, Linux kernel reserves send memory per socket.

Well, that's not how it used to work, and that doesn't match the documentation. If they've changed it, that would be a horrible waste of memory. Most sockets aren't active at the same time. But then, I've not contributed to the kernel TCP stack since 2009. I'm currently planning on attending Bake-a-thon next week. We should look into this!

...

So if you have two NFS clients; one on wifi that can only do 54Mbps/sec and the other on a cable that can do 10Gbps, your entire NFS server with a single sender thread will be waiting in writev() for the most part due to the slow client.

Again, that would only happen by running out of buffer space on the slow connection. If you're asking for more data than the buffer space, or queuing concurrent responses on the same socket that exceed the buffer space, the single thread will be stalled by the slowest connection. When I originally designed this code 3 years ago, I'd deliberately serialized the per socket transactions so this wouldn't happen. But a certain someone wanted multiple requests on the same socket to be concurrent, though that would mean some responses might finish out of order. (I've never thought that was a good idea.) Recently our maintainer was proposing that NFS Compound would also be parallelized. (I've never thought that was a good idea either.) As I've said and written over and over, we have a limited number of cores. Running a lot of threads means more overhead, and reduces cache coherency.

Malahal Naineni

3:29 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

> If they've changed it, that would be a horrible > waste of memory. Most sockets aren't active at the same time.

Never worked in kernel networking, but it seems to start small and then go up based on TCP autotuning?

...

> When I originally designed this code 3 years ago, I'd deliberately > serialized the per socket transactions so this wouldn't happen.

NFS COMMITs/WRITEs take a lot more time than other commands. Same is the case with a readdir on a very large directory. Clients send multiple requests at the same time, and it makes sense to process them in parallel. I don't know if we get benefit in parallelizing the COMPOUND itself though.

...

> we have a limited number of cores.

Believe it or not, few of our customers are running with 16K NB_WORKER threads! I was surprised as well. They experimented and found that value better than 1K or 4K etc. They have 200GB RAM and 50-100 cores. Also, in some cases, these are dedicated systems just for NFS. On Fri, Sep 14, 2018 at 1:11 PM, William Allen Simpson < william.allen.simpson(a)gmail.com> wrote:

...

On 9/13/18 3:09 PM, Malahal Naineni wrote: > As I reported here https://github.com/nfs-ganesha/ntirpc/pull/137 < > https://github.com/nfs-ganesha/ntirpc/pull/137>;, Linux kernel reserves > send memory per socket. > Well, that's not how it used to work, and that doesn't match the documentation. If they've changed it, that would be a horrible waste of memory. Most sockets aren't active at the same time. But then, I've not contributed to the kernel TCP stack since 2009. I'm currently planning on attending Bake-a-thon next week. We should look into this! So if you have two NFS clients; one on wifi that can only do 54Mbps/sec > and the other on a cable that can do 10Gbps, your entire NFS server with a > single sender thread will be waiting in writev() for the most part due to > the slow client. > > Again, that would only happen by running out of buffer space on the slow connection. If you're asking for more data than the buffer space, or queuing concurrent responses on the same socket that exceed the buffer space, the single thread will be stalled by the slowest connection. When I originally designed this code 3 years ago, I'd deliberately serialized the per socket transactions so this wouldn't happen. But a certain someone wanted multiple requests on the same socket to be concurrent, though that would mean some responses might finish out of order. (I've never thought that was a good idea.) Recently our maintainer was proposing that NFS Compound would also be parallelized. (I've never thought that was a good idea either.) As I've said and written over and over, we have a limited number of cores. Running a lot of threads means more overhead, and reduces cache coherency.

Kropelin, Adam

9:03 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

From: Malahal Naineni [mailto:malahal@gmail.com]

...

William Allen Simpson <william.allen.simpson(a)gmail.com> > If they've changed it, that would be a horrible > waste of memory. Most sockets aren't active at the same time.

...

Never worked in kernel networking, but it seems to start small and then go up based on TCP autotuning?

...

> When I originally designed this code 3 years ago, I'd deliberately > serialized the per socket transactions so this wouldn't happen.

...

> we have a limited number of cores.

...

In many (most?) cases I'd expect us to be I/O bound in the FSAL. When that's true, additional threads allow for concurrency without requiring a core to "back" each one. N_THREADS > N_CORES is typical for I/O bound services.

...

>> So if you have two NFS clients; one on wifi that can only do 54Mbps/sec and the other on a cable that can do 10Gbps, your entire NFS server with a single sender thread will be waiting in writev() for the most part due to the slow client.

...

> Again, that would only happen by running out of buffer space on the > slow connection. If you're asking for more data than the buffer > space, or queuing concurrent responses on the same socket that > exceed the buffer space, the single thread will be stalled by the > slowest connection.

This is essentially exactly what is happening, although the "buffer space" is behaving as if it's per-socket, not system-wide. I can't say whether that's a change in behavior of the kernel, but that's how it's acting today. Locally I've experimented with two approaches to address this. In one, we hash xprts to queues rather than using ifindex. This is similar in concept to Malahal's old patch incrementing ifindex (which I was unaware of until now) but has the additional benefit of ensuring that traffic for any given xprt always lands on the same queue. This means a single slow client affects only one queue/thread, not all of them. Of course as the number of clients approaches and then exceeds IOQ_IF_SIZE, scalability becomes limited again. To address that I've also been testing a queue-per-xprt patch which moves the poolq_head to the xprt structure (and leaves pretty much everything else unchanged). This approach is showing excellent linear scalability and is impervious to slow or otherwise misbehaving clients. Here are some numbers from before/after the thread-per-xprt patch: Client specs: 1 Gbps network, workload is 'cp /mnt/test/4GiB.file /dev/null' Server specs: 10 Gpbs network, 32 cores, 244 GiB RAM BEFORE: Clients Mount Time (sec) Aggregate Throughput (MiB/sec) 0 0.04 0 1 2.56 112 2 6.40 122 4 14.37 125 8 28.45 130 16 53.13 132 32 <too impatient to wait> 64 <too impatient to wait> 128 257.18 140 150 <too impatient to wait> AFTER: Clients Mount Time (sec) Aggregate Throughput (MiB/sec) 0 0.04 0 1 0.04 120 2 0.04 240 4 0.04 480 8 0.05 960 16 0.46 1217 32 1.34 1217 64 2.90 1217 128 6.12 1217 150 7.19 1217 Note that in the BEFORE case, mount latency grows rapidly and aggregate throughput achieved is little more than the capacity of a single client. After the change we see linear scalability until the server NIC is saturated, and only then do we see a modest increase in mount latency. --Adam

Kropelin, Adam

9:11 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

-----Original Message----- Locally I've experimented with two approaches to address this. In one, we hash xprts to queues rather than using ifindex. This is similar in concept to Malahal's old patch incrementing ifindex (which I was unaware of until now) but has the additional benefit of ensuring that traffic for any given xprt always lands on the same queue.

Malahal, I re-read your patch and realize now that you rotated ifindex at creation time so it has the same behavior as my hash in terms of xprt-to-queue affinity. I originally thought you were rotating through queues at enqueue time. So I think these approaches are equivalent. I was just using 'fd & IOQ_IF_MASK' as a cheap hash so I think your approach is better in the N-clients-to-M-queues space. But I'm interested in feedback on the queue-per-xprt approach as that one is giving me the scalability and reliability I really need with zero client interference. --Adam

Malahal Naineni

9:42 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

> -----Original Message----- > Locally I've experimented with two approaches to address this. > > In one, we hash xprts to queues rather than using ifindex. This is similar in > concept to Malahal's old patch incrementing ifindex (which I was unaware of > until now) but has the additional benefit of ensuring that traffic for any given > xprt always lands on the same queue. Malahal, I re-read your patch and realize now that you rotated ifindex at creation time so it has the same behavior as my hash in terms of xprt-to-queue affinity. I originally thought you were rotating through queues at enqueue time. So I think these approaches are equivalent. I was just using 'fd & IOQ_IF_MASK' as a cheap hash so I think your approach is better in the N-clients-to-M-queues space. But I'm interested in feedback on the queue-per-xprt approach as that one is giving me the scalability and reliability I really need with zero client interference. --Adam

Matt Benjamin

10 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

That's how it was when the ioq was introduced. I think Daniel has a fix in mind, but by all means, send this too. Matt On Fri, Sep 14, 2018 at 10:42 PM, Malahal Naineni <malahal(a)gmail.com> wrote:

...

I was working with 2.5 originally and wanted to see a fix with minimal changes. The patch was supposed to be Work In Progress but was squashed down as I didn't have enough arguments to back it up (ended up with a timeout patch). Now, we have systems in the lab exhibiting this in-house and customer locations as well (I added IOQ stats and they showed 1-second latency from queuing to calling writev alone under stress). I would put the IOQ in xprt itself as you did. Please post the patch and let us review it. Regards, Malahal. On Fri, Sep 14, 2018 at 7:41 PM, Kropelin, Adam <kropelin(a)amazon.com> wrote: > > > -----Original Message----- > > Locally I've experimented with two approaches to address this. > > > > In one, we hash xprts to queues rather than using ifindex. This is > > similar in > > concept to Malahal's old patch incrementing ifindex (which I was unaware > > of > > until now) but has the additional benefit of ensuring that traffic for > > any given > > xprt always lands on the same queue. > > Malahal, I re-read your patch and realize now that you rotated ifindex at > creation time so it has the same behavior as my hash in terms of > xprt-to-queue affinity. I originally thought you were rotating through > queues at enqueue time. So I think these approaches are equivalent. I was > just using 'fd & IOQ_IF_MASK' as a cheap hash so I think your approach is > better in the N-clients-to-M-queues space. But I'm interested in feedback on > the queue-per-xprt approach as that one is giving me the scalability and > reliability I really need with zero client interference. > > --Adam > _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

-- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309

William Allen Simpson

Saturday, 15 September Sat, 15 Sep

12:45 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On 9/14/18 11:00 PM, Matt Benjamin wrote:

...

That's how it was when the ioq was introduced. I think Daniel has a fix in mind, but by all means, send this too.

I have gone back and looked, and there is nothing in the historical code that supports this statement. There was no IOQ in the SVCXPRT. Nor would that be a good idea. The IOQ structure is designed to handle a stream made up of segments. It is utilized per nfs_req, not by the SVCXPRT.

...

On Fri, Sep 14, 2018 at 10:42 PM, Malahal Naineni <malahal(a)gmail.com> wrote: > I would put the IOQ in xprt itself as you did. Please post the patch and let > us review it. >

Putting an IOQ in the SVCXPRT would require constant lock thrashing, because each svc_req is on a separate asynchronous thread. You can have many simultaneous svc_req going.... I've spent an awful lot of time getting rid of the terrible locking problems we had in the past. Please don't do that again! (Malahal had helped identify those hot locks in the past.)

Matt Benjamin

12:54 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

You're right that isn't what changed. This behavior, however, is one created by the threading changes, in fact. Since I wrote the ioq, I do know what it doesm, in general. Matt On Sat, Sep 15, 2018 at 1:45 PM, William Allen Simpson <william.allen.simpson(a)gmail.com> wrote:

...

On 9/14/18 11:00 PM, Matt Benjamin wrote: > > That's how it was when the ioq was introduced. I think Daniel has a > fix in mind, but by all means, send this too. > I have gone back and looked, and there is nothing in the historical code that supports this statement. There was no IOQ in the SVCXPRT. Nor would that be a good idea. The IOQ structure is designed to handle a stream made up of segments. It is utilized per nfs_req, not by the SVCXPRT. > On Fri, Sep 14, 2018 at 10:42 PM, Malahal Naineni <malahal(a)gmail.com> > wrote: >> >> I would put the IOQ in xprt itself as you did. Please post the patch and >> let >> us review it. >> Putting an IOQ in the SVCXPRT would require constant lock thrashing, because each svc_req is on a separate asynchronous thread. You can have many simultaneous svc_req going.... I've spent an awful lot of time getting rid of the terrible locking problems we had in the past. Please don't do that again! (Malahal had helped identify those hot locks in the past.)

-- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309

Malahal Naineni

Monday, 17 September Mon, 17 Sep

3:25 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

Putting an IOQ in the SVCXPRT would require constant lock thrashing, because each svc_req is on a separate asynchronous thread. You can have many simultaneous svc_req going.

Today, we have one IOQ that is being used. It is just a list of responses to be sent over all xprts. How would making a list per xprt be worse than what it is today? Is it because of CPU cache-lines or something else? Regards, Malahal On Sat, Sep 15, 2018 at 11:15 PM, William Allen Simpson < william.allen.simpson(a)gmail.com> wrote:

...

On 9/14/18 11:00 PM, Matt Benjamin wrote: > That's how it was when the ioq was introduced. I think Daniel has a > fix in mind, but by all means, send this too. > > I have gone back and looked, and there is nothing in the historical code that supports this statement. There was no IOQ in the SVCXPRT. Nor would that be a good idea. The IOQ structure is designed to handle a stream made up of segments. It is utilized per nfs_req, not by the SVCXPRT. On Fri, Sep 14, 2018 at 10:42 PM, Malahal Naineni <malahal(a)gmail.com> > wrote: > >> I would put the IOQ in xprt itself as you did. Please post the patch and >> let >> us review it. >> >> Putting an IOQ in the SVCXPRT would require constant lock thrashing, because each svc_req is on a separate asynchronous thread. You can have many simultaneous svc_req going.... I've spent an awful lot of time getting rid of the terrible locking problems we had in the past. Please don't do that again! (Malahal had helped identify those hot locks in the past.)

William Allen Simpson

Tuesday, 18 September Tue, 18 Sep

8:21 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On 9/17/18 4:25 PM, Malahal Naineni wrote:

...

Ah, how I love an intelligent question! :) Long answer. Yes, atomic cache lines help a lot. That's "cache coherence". I hope by saying this over and over, it will eventually sink in enough that some of the list members will read up on it. Also, "something else". The rpc_dplx_internal per transport lock is used by many things in many paths. That was one of the hot locks you identified a few years back. It has a lot fewer references now. Note that I've also kept everything out of xp_lock. It's not used anywhere anymore. But it's API visible to Ganesha, and should only be used by Ganesha. Avoiding lock inversion. Circa 2014-2015, when I started trying to figure out how to fit in RDMA, a major issue was locks. Lots of lock inversion. DanG had a big chart on his whiteboard. I'd try to fix one thing, and more issues became apparent. Finally, just re-wrote svc_xprt. And threw away rpc_dplx. And rpc_ctx. So many locking conflicts. Plus two fd rbtrees. Why do things 3 times that can be done once.... :( Furthermore, the whole thing didn't scale. Just as one thread per fd won't scale. For 1,024 fds you'll need 1,024 sync output threads, plus 3 times more memory (because each thread consumes memory), even though the "top" half of the dispatcher only needs a few hundred threads to process the incoming operations. Remember, ultimately with async FSAL calls, the FSAL provides the thread that carries the output. Are we willing to guarantee that every FSAL backend will provide a thread pool with enough threads to handle one per fd? And by supporting multiple FSALs, each one will provide 1,024 threads? The current design bounds the number of output threads at 16. If that hurts too much, we could probably make it a configurable parameter. But I've not run into many configurations with even 16 interfaces. A decade or so ago, somebody I know was working on a 1,024 link switch on a chip. But you don't see that stuff much. Well, I've written all this before. And given talks. But new folks keep coming on the project, and maybe older folks don't remember.

Malahal Naineni

9:30 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

Thank you for the info, but much of it is unrelated to the patch in question here. I do have a patch that keeps the lists in an array and uses xp_fd for hashing. For now, I made it configurable by using RPC_Ioq_ThrdMax (for now against Ganesha2.5 code). This config param probably needs a rename. Regards, Malahal. On Tue, Sep 18, 2018 at 6:51 PM, William Allen Simpson < william.allen.simpson(a)gmail.com> wrote:

...

On 9/17/18 4:25 PM, Malahal Naineni wrote: > Today, we have one IOQ that is being used. It is just a list of responses > to be sent over all xprts. How would making a list per xprt be worse than > what it is today? Is it because of CPU cache-lines or something else? > Ah, how I love an intelligent question! :) Long answer. Yes, atomic cache lines help a lot. That's "cache coherence". I hope by saying this over and over, it will eventually sink in enough that some of the list members will read up on it. Also, "something else". The rpc_dplx_internal per transport lock is used by many things in many paths. That was one of the hot locks you identified a few years back. It has a lot fewer references now. Note that I've also kept everything out of xp_lock. It's not used anywhere anymore. But it's API visible to Ganesha, and should only be used by Ganesha. Avoiding lock inversion. Circa 2014-2015, when I started trying to figure out how to fit in RDMA, a major issue was locks. Lots of lock inversion. DanG had a big chart on his whiteboard. I'd try to fix one thing, and more issues became apparent. Finally, just re-wrote svc_xprt. And threw away rpc_dplx. And rpc_ctx. So many locking conflicts. Plus two fd rbtrees. Why do things 3 times that can be done once.... :( Furthermore, the whole thing didn't scale. Just as one thread per fd won't scale. For 1,024 fds you'll need 1,024 sync output threads, plus 3 times more memory (because each thread consumes memory), even though the "top" half of the dispatcher only needs a few hundred threads to process the incoming operations. Remember, ultimately with async FSAL calls, the FSAL provides the thread that carries the output. Are we willing to guarantee that every FSAL backend will provide a thread pool with enough threads to handle one per fd? And by supporting multiple FSALs, each one will provide 1,024 threads? The current design bounds the number of output threads at 16. If that hurts too much, we could probably make it a configurable parameter. But I've not run into many configurations with even 16 interfaces. A decade or so ago, somebody I know was working on a 1,024 link switch on a chip. But you don't see that stuff much. Well, I've written all this before. And given talks. But new folks keep coming on the project, and maybe older folks don't remember.

Kropelin, Adam

10:17 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

-----Original Message----- From: Matt Benjamin [mailto:mbenjami@redhat.com] Sent: Friday, September 14, 2018 11:01 PM To: Malahal Naineni <malahal(a)gmail.com> Cc: Kropelin, Adam <kropelin(a)amazon.com>; William Allen Simpson <william.allen.simpson(a)gmail.com>; devel(a)lists.nfs-ganesha.org Subject: Re: [NFS-Ganesha-Devel] Re: Scalability issue with VFS FSAL and large amounts of read i/o in flight That's how it was when the ioq was introduced. I think Daniel has a fix in mind, but by all means, send this too. Matt On Fri, Sep 14, 2018 at 10:42 PM, Malahal Naineni <malahal(a)gmail.com> wrote: > I was working with 2.5 originally and wanted to see a fix with minimal > changes. The patch was supposed to be Work In Progress but was > squashed down as I didn't have enough arguments to back it up (ended > up with a timeout patch). Now, we have systems in the lab exhibiting > this in-house and customer locations as well (I added IOQ stats and > they showed 1-second latency from queuing to calling writev alone under stress). > > I would put the IOQ in xprt itself as you did. Please post the patch > and let us review it.

See attached for the changes I'm testing with. Let me know if this should go to gerrithub or elsewhere. This is against v1.5.5 but is trivially adapted to -next. --Adam

William Allen Simpson

10:17 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

1) Did you test with the one line patch that I posted? 2) What client distro are you using? What kernel version? 3) What server distro are you using? What kernel version? 4) What does your mount look like? On 9/17/18 12:40 PM, Kropelin, Adam wrote:

...

See attached for the changes I'm testing with.

Your patch turns an async thread system into sync per fd. This really shouldn't be visible to the library caller, so doesn't belong in the SVCXPRT. You can do whatever you want, but I already added an IOQ in the rpc_dplx_rec for receiving. That would be a better place. And none of this fixes the underlying problem. 1GB in 1MB chunks all processing in parallel is going to make 100,000 TCP segments, and Linux default caps the outstanding segments at 1,000 or 10,000 per interface (depending on the queue type). Multiply by the number of callers. Basically you have piggy callers. All your patch does is randomize the fd that is serviced (because we cannot control which thread will be selected by the scheduler). And leaves an awful lot of threads waiting. Maybe we should go back to the many threads model and abandon async. I've said and written repeatedly that a better use of my time would have been the zero-copy interfaces needed for RDMA (my original reason for participating on this project). We'd have seen an immediate 35% improvement in throughput, instead of this meager 5-6% over 4 years.

Kropelin, Adam

11:58 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

-----Original Message----- From: William Allen Simpson [mailto:william.allen.simpson@gmail.com] Sent: Monday, September 17, 2018 2:51 PM To: Kropelin, Adam <kropelin(a)amazon.com> Cc: devel(a)lists.nfs-ganesha.org Subject: Re: [NFS-Ganesha-Devel] Re: Scalability issue with VFS FSAL and large amounts of read i/o in flight 1) Did you test with the one line patch that I posted?

Not your precise patch, but an equivalent. It offers a solid improvement until number of clients approaches number of queues. Then aggregate throughput is again limited to the sum of the slowest client in each queue. When the "slowest client" is someone who dropped off the network, the entire queue stalls and any client unfortunate enough to land on the same queue is starved until tcp eventually decides to give up on the missing client. Malahal's timeout patch can help with that, but it's a failsafe, not a scalability solution.

...

2) What client distro are you using? What kernel version?

Various, but one example would be Amazon Linux (RHEL-derived distro available in AWS EC2) with kernel 4.14.67. This is my testbed since it allows me to conjur a number of pristine clients on-demand.

...

3) What server distro are you using? What kernel version?

Same.

...

4) What does your mount look like?

NFSv4.0, nolock: 10.2.145.170:/test on /home/ec2-user/mnt type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.3.98.13,local_lock=none,addr=10.2.145.170)

...

On 9/17/18 12:40 PM, Kropelin, Adam wrote: > See attached for the changes I'm testing with. Your patch turns an async thread system into sync per fd.

To be fair, the system was never fully async. Up to 16 threads would drive network output before. Same with the hash-by-fd patch. You can argue this is a small enough fraction of however many threads are active, but it's not real async. With my change the limit goes up to number-of-xprts, but with suggestions elsewhere in this thread to make the number of queues configurable, you're just headed for the same thing anyway. The question is just one of "how many queues is enough?" The answer is: How many clients do you have? How much scalability do you need? As an alternative, non-blocking I/O would allow for true async behavior with a small number of queues and a small number of dedicated threads, likely one per interface again. Epoll on the fds with non-zero-length queues and feed them as required. Any given worker attempts to write in-line and if it receives EWOULDBLOCK, he queues the i/o for the epoll worker to handle when the socket can take it again. Nobody ever blocks on output. But that's a much bigger architectural change.

...

This really shouldn't be visible to the library caller, so doesn't belong in the SVCXPRT. You can do whatever you want, but I already added an IOQ in the rpc_dplx_rec for receiving. That would be a better place.

Thanks for the feedback, I'll take a look at that.

...

And none of this fixes the underlying problem. 1GB in 1MB chunks all processing in parallel is going to make 100,000 TCP segments, and Linux default caps the outstanding segments at 1,000 or 10,000 per interface (depending on the queue type). Multiply by the number of callers. Basically you have piggy callers.

Piggy indeed. But I'm not instructing them to be piggy...they just are. I've purposely designed my test case to eliminate every other potential bottleneck and complicating factor. There's essentially zero disk I/O anywhere. I should see perfect scalability (and I do with thread-per-xprt). A server needs to cope with clients doing things that aren't always optimal.

...

All your patch does is randomize the fd that is serviced (because we cannot control which thread will be selected by the scheduler).

It allows us to feed more than one tcp stream concurrently. When the tcp stream itself is bandwidth-limited on the client side, and you're using blocking sockets, you have to employ multiple threads. How many threads is enough is the question. 16 isn't enough when you have a fleet of more than that number of bandwidth-limited clients.

...

And leaves an awful lot of threads waiting.

Yes, that's a down-side with few alternatives other than non-blocking I/O. --Adam

Matt Benjamin

12:05 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On Tue, Sep 18, 2018 at 12:58 PM, Kropelin, Adam <kropelin(a)amazon.com> wrote:

...

The "old crufty" code used up to 200 threads, and that was a configurable limit. We had hit a related stall in 2.0.

...

> And leaves an awful lot of threads waiting. Yes, that's a down-side with few alternatives other than non-blocking I/O.

Indeed. Matt

...

--Adam _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

-- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309

William Allen Simpson

Wednesday, 19 September Wed, 19 Sep

1:03 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On 9/18/18 12:58 PM, Kropelin, Adam wrote:

...

Try V4.1. V4.0 is only best effort, but not performance. V3 we try not to break entirely. Also, you are overriding the read and write sizes. It shouldn't be terribly surprising that there's not enough buffer space. We're not allocating that much memory.

...

To be fair, the system was never fully async.

Especially in Ganesha V2.5. A lot of the async I-O code didn't go in until V2.6. I merely wrote it against V2.3. It took a long time to be integrated. Note I mentioned that a predecessor had decided upon writev(). All I've done is work around it. Repeatedly. Don't blame the messenger. (That old code wouldn't even pass pynfs. So you're welcome.) Maybe someday somebody will let us use sendmmsg(). It's available on FreeBSD now, too. Also, weighted fair queuing. Because we actually knew a few things in the early '90s....

...

Piggy indeed. [...] A server needs to cope with clients doing things that aren't always optimal.

Nope. The server needs to STARVE clients that aren't optimal. That's also the philosophy in the kernel TCP and net stack. And pretty much every router and switch in the network path. If you send 10 Gbps into a switch with a typical 4 way memory, and your target is 1 Gbps, you will stall. (Called head of line block.) Because the most you can drain is 3 Gbps on the other 3 links. That's why high performance routers move everything off to separate cards. Usually each with 3 or 4 interfaces. Use reasonably sized requests and wait your turn. That will give the best overall aggregate performance for all clients.

Malahal Naineni

Thursday, 20 September Thu, 20 Sep

6:27 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On Wed, Sep 19, 2018 at 11:33 PM, William Allen Simpson < william.allen.simpson(a)gmail.com> wrote:

...

On 9/18/18 12:58 PM, Kropelin, Adam wrote: > > NFSv4.0, nolock: > > 10.2.145.170:/test on /home/ec2-user/mnt type nfs4 > (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255 > ,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.3. > 98.13,local_lock=none,addr=10.2.145.170) > > Try V4.1. V4.0 is only best effort, but not performance. V3 we try not to break entirely.

What makes you think that his problem is going to go away with NSFv4.1? I am pretty sure that the same issues exists in all versions.

...

Also, you are overriding the read and write sizes. It shouldn't be terribly surprising that there's not enough buffer space. We're not allocating that much memory.

That is the default read/write size with GPFS FSAL. Smaller values generally perform poorer!

...

To be fair, the system was never fully async. > Especially in Ganesha V2.5. A lot of the async I-O code didn't go in until V2.6. I merely wrote it against V2.3. It took a long time to be integrated. Note I mentioned that a predecessor had decided upon writev(). All I've done is work around it. Repeatedly. Don't blame the messenger. (That old code wouldn't even pass pynfs. So you're welcome.)

Let not blame anyone here as that is NOT going to solve anything.

...

Piggy indeed. [...] A server needs to cope with clients doing things that > aren't always optimal. > > Nope. The server needs to STARVE clients that aren't optimal. That's also the philosophy in the kernel TCP and net stack. And pretty much every router and switch in the network path.

Correct, but I don't see any bad client here. Unfortunatley, we are starving good clients here as we were waiting in writev() with slow clients (I don't want to call them bad clients though). Regards, Malahal.

...

__ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

William Allen Simpson

Friday, 21 September Fri, 21 Sep

4:35 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On 9/20/18 7:27 AM, Malahal Naineni wrote:

...

On Wed, Sep 19, 2018 at 11:33 PM, William Allen Simpson <william.allen.simpson(a)gmail.com <mailto:william.allen.simpson@gmail.com>> wrote: > > Try V4.1. V4.0 is only best effort, but not performance. V3 we try > not to break entirely. > What makes you think that his problem is going to go away with NSFv4.1? I am pretty sure that the same issues exists in all versions.

Olga suggested trying V4.1. (She had also asked for the mount parameters.) So I passed it along. DanG told me the rest.

...

That is the default read/write size with GPFS FSAL. Smaller values generally perform poorer!

Now we know why GPFS needs 4K threads and 200GB RAM. But the NFS protocol is telling you those sizes are too big. They'll only work well after you increase the system and socket buffer sizes, and feed the correctly sized data accordingly. Otherwise, you just have a huge amount of bufferbloat. Anyway, talked with Bruce about what knfsd is doing. He says that they used to enforce sizing and reject piggy clients, but there were too many badly behaved operators. So they took the checks out. You are free to self-denial-of-service. The ideal is to have just enough data arrive just in time to the queue, on one CPU that is closest feed to the interface. That will give the best performance.

...

> Nope. The server needs to STARVE clients that aren't optimal. That's > also the philosophy in the kernel TCP and net stack. And pretty much > every router and switch in the network path. > Correct, but I don't see any bad client here. Unfortunatley, we are starving good clients here as we were waiting in writev() with slow clients (I don't want to call them bad clients though).

We've been doing Fair Queuing for 30+ years. I'm old; I remember when John proposed it. :) Whereas I learned about Weighted Fair Queuing from Fred Baker in the early '90s, when he was implementing it for ACC. I was part of the bufferbloat project, and one of the earliest testers of fq_codel. That's the best practice, the current state of the art. Look, I've mentioned that I was overridden on this code design. The evidence is in the list archives. You have shipping product. It needs to work in the field. So let's massage your patch a little, and get stuff working better. We can argue another day about the best design. Bake-a-thon is about ensuring things work!

J. Bruce Fields

Wednesday, 26 September Wed, 26 Sep

8:21 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On Fri, Sep 21, 2018 at 05:35:48AM -0400, William Allen Simpson wrote:

...

On 9/20/18 7:27 AM, Malahal Naineni wrote: >On Wed, Sep 19, 2018 at 11:33 PM, William Allen Simpson <william.allen.simpson(a)gmail.com <mailto:william.allen.simpson@gmail.com>> wrote: >> >> Try V4.1. V4.0 is only best effort, but not performance. V3 we try >> not to break entirely. >> > >What makes you think that his problem is going to go away with NSFv4.1? I am pretty sure that the same issues exists in all versions. > Olga suggested trying V4.1. (She had also asked for the mount parameters.) So I passed it along. DanG told me the rest. >That is the default read/write size with GPFS FSAL. Smaller values generally perform poorer! > Now we know why GPFS needs 4K threads and 200GB RAM. But the NFS protocol is telling you those sizes are too big. They'll only work well after you increase the system and socket buffer sizes, and feed the correctly sized data accordingly. Otherwise, you just have a huge amount of bufferbloat. Anyway, talked with Bruce about what knfsd is doing. He says that they used to enforce sizing and reject piggy clients, but there were too many badly behaved operators.

I'm not sure what "badly behaved operators" means. --b. > So they took the checks out. > You are free to self-denial-of-service. > > The ideal is to have just enough data arrive just in time to the > queue, on one CPU that is closest feed to the interface. That will > give the best performance. > > > >> Nope. The server needs to STARVE clients that aren't optimal. That's > >> also the philosophy in the kernel TCP and net stack. And pretty much > >> every router and switch in the network path. > >> > > > >Correct, but I don't see any bad client here. Unfortunatley, we are starving good clients here as we were waiting in writev() with slow clients (I don't want to call them bad clients though). > > > We've been doing Fair Queuing for 30+ years. I'm old; I remember when > John proposed it. :) > > Whereas I learned about Weighted Fair Queuing from Fred Baker in the > early '90s, when he was implementing it for ACC. > > I was part of the bufferbloat project, and one of the earliest testers of > fq_codel. That's the best practice, the current state of the art. > > Look, I've mentioned that I was overridden on this code design. The > evidence is in the list archives. You have shipping product. It > needs to work in the field. So let's massage your patch a little, > and get stuff working better. > > We can argue another day about the best design. Bake-a-thon is about > ensuring things work! > _______________________________________________ > Devel mailing list -- devel(a)lists.nfs-ganesha.org > To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

William Allen Simpson

Friday, 28 September Fri, 28 Sep

9:46 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On 9/26/18 9:21 PM, J. Bruce Fields wrote:

...

On Fri, Sep 21, 2018 at 05:35:48AM -0400, William Allen Simpson wrote: > Anyway, talked with Bruce about what knfsd is doing. He says that > they used to enforce sizing and reject piggy clients, but there > were too many badly behaved operators. I'm not sure what "badly behaved operators" means.

I wrote this shortly after talking to you, and a week has passed. If you have a better description of the knfsd experience, please tell us.

William Allen Simpson

Saturday, 15 September Sat, 15 Sep

12:26 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

For a queue per xprt, there is no need for a patch to ntirpc. Every transport calls back to nfs-ganesha via the xp_dispatch vector. These all run MainNFSD/nfs_rpc_dispatcher_thread.c nfs_rpc_tcp_user_data(). Therefore, for experimentation purposes, add one line immediately before the return (line 1239): + newxprt->xp_ifindex = newxprt->xp_fd; KISS.

Matt Benjamin

Friday, 14 September Fri, 14 Sep

7:29 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On Fri, Sep 14, 2018 at 3:41 AM, William Allen Simpson <william.allen.simpson(a)gmail.com> wrote:

...

On 9/13/18 3:09 PM, Malahal Naineni wrote: >

...

But a certain someone wanted multiple requests on the same socket to be concurrent, though that would mean some responses might finish out of order. (I've never thought that was a good idea.)

NFS clients do not require requests to finish in submit order. I don't understand what this has to do with the price of wheat, though--only one thread is sending on the socket at any time.

...

Recently our maintainer was proposing that NFS Compound would also be parallelized. (I've never thought that was a good idea either.)

What can be parallelized is something the FSAL knows, not an absolute. Matt -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309

Frank Filz

10:09 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

> But a certain someone wanted multiple requests on the same socket to > be concurrent, though that would mean some responses might finish out > of order. (I've never thought that was a good idea.) NFS clients do not require requests to finish in submit order. I don't understand what this has to do with the price of wheat, though--only one thread is sending on the socket at any time. > > Recently our maintainer was proposing that NFS Compound would also be > parallelized. (I've never thought that was a good idea either.) > What can be parallelized is something the FSAL knows, not an absolute.

I want to also point out that asynchronous I/O is not very useful if the NFS4 Compound processing can't be suspended. Note that the parallelization at the NFS v4 level is NOT to parallelize the ops in a single compound (that wouldn't accomplish anything, almost always each op depends on the previous op's results). The purpose is to allow suspension of an in-progress protocol operation when it is calling an asynchronous capable FSAL method which is not immediately returning. This of course will not be useful without the ability to suspend an in-progress RPC request. Frank

William Allen Simpson

Tuesday, 18 September Tue, 18 Sep

10:17 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

[Back to the OP] On 9/11/18 12:58 PM, Kropelin, Adam wrote:

...

My test setup has a ganesha server with a single export on the VFS FSAL. I have multiple Linux clients, all mounting that export with NFSv4.0.

I'm at the Bake-a-thon, and the first question the kernel folks asked was "What version of client are you running?" Apparently, there was a time period where the buffer estimation was broken. It isn't supposed to send too many parallel requests, until the outstanding data was received.

...

On the clients I run a simple read workload using dd: 'dd if=/mnt/test/testfile of=/dev/null bs=1M'. All clients read the same 1 GB file. Each client is bandwidth-limited to 1 Gbps while the server has 10 Gbps available. A single client achieves ~100 MB/sec. Adding a second client brings the aggregate throughput up to ~120 MB/sec. A third client gets the aggregate to ~130 MB/sec, and it pretty much plateaus at that point. Clearly this is well below the aggregate bandwidth the server is capable of.

After consultation, this is now easy to explain. Each client is sending very short requests. They show up nearly simultaneously. They are passed into the VFS FSAL, each in its own thread, all at the same time. They each make a system call (and wait). The systems calls all return at nearly the same time. Then queued for output. Essentially, you've turned a set of sequential reads into random reads. Maybe you are hoping that will allow the disk scheduler to process all the system requests in parallel, seeking in the most optimal order. But usually, random reads are slower than sequential reads.

...

Additionally, and this is the behavior that made me originally discover this issue in production, while the clients are performing their read test, the server becomes extremely slow to respond to mount requests. By "extremely slow" I mean it takes 60 seconds or more to perform a simple mount while 8 clients are running the read test.

Yes. Because they are being performed in parallel, they are all arriving at the output queue at nearly the same time. For all the clients. Not necessarily in any order. A huge pile of data. No system memory for it all. The queue is full. It will not empty until the Acks have come back from the clients. Those Acks are probably slow, because the clients have to process the large amounts of data, and their links are slower than the server. (Also TCP only Acks every other segment.) Then your mount request arrives. It goes at the end of the queue. It will wait for all the outstanding data. Repeatedly, I proposed implementing "Weighted Fair Queuing". See: Mon, 8 Jun 2015 15:00:56 -0400 Wed, 2 Sep 2015 10:35:38 -0400 Thu, 9 Mar 2017 02:44:29 -0500 Each time, I was reprimanded *IN* *PUBLIC* (on the list) for proposing something beyond the then current plan. I asked for help designing an API for the FSALs. So now you've got async, top to bottom, in the RPC stack, simulated where we don't have actual async operations. It's more than 400% faster for short IOPs than the old Ganesha V2.3 crufty code. I still think WFQ would have been helpful.... Now maybe you do too? But the one line patch that I've proposed should help a small amount. It's not a panacea. MainNFSD/nfs_rpc_dispatcher_thread.c nfs_rpc_tcp_user_data(), add one line immediately before the return (line 1239): + newxprt->xp_ifindex = newxprt->xp_fd; Let us know whether it does?

Malahal Naineni

10:42 a.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

...

> Essentially, you've turned a set of sequential reads into random reads.

If understand his email, he is reading the same file over and over, so the READ requests should NOT hit the disk. Most of it should be in the page cache. Also, same tests gave better throughput with a fix, so it is not sequential/random disk access. I also put statistics on the send queue. Time-stamped when we queue the request and calculated the wait time in the send queue when a thread actually calls svc_ioq_flushv() on the request. On our customer system, the average wait time was close to a second! I am going to give my xp_fd hash patch and see what the average send queue times say. Regards, Malahal. On Mon, Sep 17, 2018 at 9:16 PM, William Allen Simpson < william.allen.simpson(a)gmail.com> wrote:

...

[Back to the OP] On 9/11/18 12:58 PM, Kropelin, Adam wrote: > My test setup has a ganesha server with a single export on the VFS FSAL. > I have multiple Linux clients, all mounting that export with NFSv4.0. > I'm at the Bake-a-thon, and the first question the kernel folks asked was "What version of client are you running?" Apparently, there was a time period where the buffer estimation was broken. It isn't supposed to send too many parallel requests, until the outstanding data was received. On the clients I run a simple read workload using dd: 'dd > if=/mnt/test/testfile of=/dev/null bs=1M'. All clients read the same 1 GB > file. Each client is bandwidth-limited to 1 Gbps while the server has 10 > Gbps available. A single client achieves ~100 MB/sec. Adding a second > client brings the aggregate throughput up to ~120 MB/sec. A third client > gets the aggregate to ~130 MB/sec, and it pretty much plateaus at that > point. Clearly this is well below the aggregate bandwidth the server is > capable of. > > After consultation, this is now easy to explain. Each client is sending very short requests. They show up nearly simultaneously. They are passed into the VFS FSAL, each in its own thread, all at the same time. They each make a system call (and wait). The systems calls all return at nearly the same time. Then queued for output. Essentially, you've turned a set of sequential reads into random reads. Maybe you are hoping that will allow the disk scheduler to process all the system requests in parallel, seeking in the most optimal order. But usually, random reads are slower than sequential reads. Additionally, and this is the behavior that made me originally discover > this issue in production, while the clients are performing their read test, > the server becomes extremely slow to respond to mount requests. By > "extremely slow" I mean it takes 60 seconds or more to perform a simple > mount while 8 clients are running the read test. > > Yes. Because they are being performed in parallel, they are all arriving at the output queue at nearly the same time. For all the clients. Not necessarily in any order. A huge pile of data. No system memory for it all. The queue is full. It will not empty until the Acks have come back from the clients. Those Acks are probably slow, because the clients have to process the large amounts of data, and their links are slower than the server. (Also TCP only Acks every other segment.) Then your mount request arrives. It goes at the end of the queue. It will wait for all the outstanding data. Repeatedly, I proposed implementing "Weighted Fair Queuing". See: Mon, 8 Jun 2015 15:00:56 -0400 Wed, 2 Sep 2015 10:35:38 -0400 Thu, 9 Mar 2017 02:44:29 -0500 Each time, I was reprimanded *IN* *PUBLIC* (on the list) for proposing something beyond the then current plan. I asked for help designing an API for the FSALs. So now you've got async, top to bottom, in the RPC stack, simulated where we don't have actual async operations. It's more than 400% faster for short IOPs than the old Ganesha V2.3 crufty code. I still think WFQ would have been helpful.... Now maybe you do too? But the one line patch that I've proposed should help a small amount. It's not a panacea. MainNFSD/nfs_rpc_dispatcher_thread.c nfs_rpc_tcp_user_data(), add one line immediately before the return (line 1239): + newxprt->xp_ifindex = newxprt->xp_fd; Let us know whether it does? _______________________________________________ Devel mailing list -- devel(a)lists.nfs-ganesha.org To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org

William Allen Simpson

Wednesday, 19 September Wed, 19 Sep

12:24 p.m.

New subject: Scalability issue with VFS FSAL and large amounts of read i/o in flight

On 9/18/18 11:42 AM, Malahal Naineni wrote:

...

I also put statistics on the send queue. Time-stamped when we queue the request and calculated the wait time in the send queue when a thread actually calls svc_ioq_flushv() on the request. On our customer system, the average wait time was close to a second! I am going to give my xp_fd hash patch and see what the average send queue times say.

Talked with folks here, and now this sounds like some of your sockets have gone into TCP Retransmission Timeout. As I'm sure you know, we've been battling overly aggressive (piggy) clients for years. You probably have a switch or router in the path. Most everybody has one version or another of fair queuing. Or RED. Really would need a packet trace to see what's going on....

2476

days inactive

2493

days old

devel@lists.nfs-ganesha.org

Manage subscription

36 comments

8 participants

tags (0)

participants (8)

Daniel Gryniewicz
Frank Filz
Imam Toufique
J. Bruce Fields
Kropelin, Adam
Malahal Naineni
Matt Benjamin
William Allen Simpson

2025

2024

2023

2022

2021

2020

2019

2018

Scalability issue with VFS FSAL and large amounts of read i/o in flight