So, nothing really jumps out at me. The pcap has quite a few ENOENT
returns, but they appear to be standard things like looking for
libraries in LD_LIBRARY_PATH, trying to open a non-existant .profile,
looking for /etc/X11R6, and so on. Standard boot-up type things that
you'd expect to fail. None of the ACCESS calls failed (many have no
execute permission, but all have read/write permission). All the READs
succeed. There are no other errors than ENOENT on lookup.
Can you explain again what you're doing in this test that's failing? Is
it the full boot? Or are you manually running some commands from the
shell? If I can get a better idea of what's actually failing, I can
maybe figure out what to look for in the log.
Also, I was under the impression that you're running Ganesha on aarch64.
Is this true, or are just the compute nodes running on aarch64?
Daniel
On 8/12/19 3:36 PM, Erik Jacobson wrote:
OK. Here is what I did.
- Note: I have the test system setup in this test case so I can re-run
it easy no problems.
- I added a shellout in our 'miniroot' environment to give me a shell
right after it made the overlay mount to /a (/a is what becomes root).
- I did this *before* the "--move" lines we do to make switch_root happy
just in case (same result).
- I had it pause waiting for me to hit ENTER before doing the mount.
That way, I started the captures and cleared the logs right before
hitting ENTER and then stopped capture and ganesha right after the
test.
- I captured all traffic between leader1 (the ganesha server) and
n2521 (the aarch64 compute node).
- compute node IP: 172.23.0.16
- leader main ip: 172.23.0.3
- CTDB-managed IP alias (compute node mounts from there): 172.23.255.249
- I captured all traffic between leader one and the other 8 leaders
but excluded nfs and ctdb. I was concerned if I restricted just to nfs
and gluster I'd miss a port. I am happy to re-run the test
differently.
- I attached an xz-compressed tarball.
- Content:
ganesha.conf (With the suggested way to disable ACL)
ganesha-gfapi.log
ganesha.log
gluster-brick-log-readme.txt (a sample and a remark that all were
about the same)
leaders-no-ctdb-no-nfs.pcap (capture among compute nodes)
n2521.pcap (capture between leader/nfs server and compute node)
node-output.txt (output from the failing "su" command on the problem node)
tcpdump-cmd-lines-readme.txt (How I did the capture, how to read)
- I game to try anything that can help resolve this. I really appreciate
your time so far. I wish I could be more helpful on my end.
Erik