On 9/11/18 4:45 PM, Kropelin, Adam wrote:
I suspect it also means a single out-to-lunch client could stall
*all* i/o on the interface, which is another behavior I've been seeing recently. (Due
to clients rebooting or otherwise going awol without umounting or closing the tcp
connection.)
This is true. Once the kernel I-O buffers are all full because a TCP
client has stopped Ack'ing them, no other connection can send over that
interface. That's just a fact of any kernel.
Thus the real problem is the client asking for megabytes of data in the
faint hope that will somehow be faster -- then crashing.
This has been a known problem for decades. So the TCPM WG developed
the TCP User Timeout option [RFC5482].
Malahal had a patch some time ago to timeout the client using another
means, without depending upon the option. Didn't that go in?
Non-blocking I/O would be the answer here, but without that...throw
some more threads at it, I guess?
Since V2.3 (before my time), we've been using IO vector zero-copy.
Posix allows either iov or async, but not both in the same call.
More threads won't help. It's a stall at the kernel level. In fact,
one thread per interface proved to be fastest, as that minimizes
locking conflicts and system calls (and improves CPU cache coherency).