Good morning! Responses inline.
Thanks for taking the time to dig through all that logging!
Can you explain again what you're doing in this test that's
failing? Is it
the full boot? Or are you manually running some commands from the shell?
If I can get a better idea of what's actually failing, I can maybe figure
out what to look for in the log.
The ultimate solution is a full boot. Where I started was... switching
from gluster NFS or kernel NFS to Ganesha made the boot fail. Research
showed it failing when processes were the non-root user. Root can read
everything but when the UID switches, files are not accessible any more.
To make the test case easier, I switched things so I stopped in the
miniroot environment (basically a fat initrd) right after the mount
was done for what would be the eventual root we "switch_root" to.
This allowed my to "chroot in" and reproduce the problem more simply
than a full system boot like where I started. This was just a test case
isolation exercise. It's far easier to poke at the problem just from a
chroot than from an operating system boot where nobody can log in due
to polkit and dbus being unhappy (and sshd wouldn't start, etc).
So that might have led to some of the confusion. The "just chroot in" is
just a simplified test case to catch the problem at the earliest point
without starting the actual INIT/systemd startup.
Also, I was under the impression that you're running Ganesha on
aarch64. Is
this true, or are just the compute nodes running on aarch64?
The clients are aarch64. Ganesha and Gluster are running on x86_64
servers (9, 3x3 for gluster). All RHEL 7.6.
I did note that aarch64 rhel76 is a very different kernel than x86_64
rhel76.
The important thing is that I'm using overlay to combine the RO NFS in
an underdir and TMPFS in an overdir. If I remove the overlay, I can
chroot in just fine. The overlay is critical to the solution though.
- my x86_64 clients work just fine with the same exact solution
- If I take away overlay and just use NFS, the problem doesn't repeat
- if I use kernel nfs or gluster nfs (we have it deployed with gluster
nfs) the problem does not repeat.
I don't have enough FS and NFS internals experience but my gut says there is
an issue in the overlay module for aarch64 that Ganesha some how triggers
while other NFS servers do not. I just don't know where else to turn.
I would guess something namespace-wise is confused by the overlay piece.
Erik