Good morning. Since the problem didn't happen for me any more in
the lab,
Avoiding Overlayfs... using NFS RO root with certain TMPFS directories
mounted on top for things like /etc, /var, /root, etc.
Closing the loop here.... With nfs4, I had times when mounts blasted from
all nodes caused every Ganesha to exit (all 9 servers).
(2592, so 288 requests around the same time per server).
I had other times where the nfs mount requests just hung on the clients
but later trying the mount from an unrelated node worked. I couldn't
figure that out.
When I switched to NFS v3, the mount hangs disappeared completely.
However, for this workload, it was far too slow. We also got some stale
NFS file handle errors even though nothing changed on the NFS server
side. I say that knowing I may be doing something wrong.
I tried adjusting up the threads but it didn't seem to help (NB_Worker).
I can't get you what you need at this moment because the customer
doesn't allow my copy stuff off the system. Now that there are a few
different issues in play, it is getting confusing. I do plan to request
some logs but I didn't see much in them -- especially the NIV_FULL_DEBUG
didn't seem to have much in it.
My dedicated time ended and I switched them back to the way it was
before.
What I will do for us is, when we get a big system in the factory
that I can access, I'll be able to have full control of it and get any
log or trace, try experiments, etc.
I really want to see Ganesha work with these workloads so we can move
off of Gluster NFS.
Thank you so much for the help so far, and I'll get back to you when I
have a bigger system or when I get at least some logs cleared to send.
Erik