Good morning. Since the problem didn't happen for me any more in the lab,
I tried it on the ARM supercomputer during what may be my last slot
before it moves to a place I can't get to any more network-wise.
I am not allowed to send logs directly without them clearing them for
release. However, when using overlay with nfs v4 and the ganesha
config file below... While booting 2592 nodes spread across 9 Gluster/Ganesha
servers, the ganesha daemons all died on all 9 servers.
This run was full debug with the config below.
I'm going to archive these logs so that I can get them blessed for
release later.
I'll mention that the logs seem to have this at the end:
<snip> :NFS4 :F_DBG :Maximum allowed attr index = 76
I am going back in the thread to find the other log setup you wanted for
one more run. Then I will switch away from overlayfs and use TMPFS
overmounts for the writable areas instead, which seemed to be more
reliable in the lab.
If you happen to be reading this now and need me to collect anything in
a different way, let me know. After my time slot I may never get access
again to this specific system. The next large systems I'll have access
to months from now will be x86_64 based clusters.
NFS_CORE_PARAM {
Clustered = TRUE;
RPC_Max_Connections = 2048;
}
NFSv4 {
Graceless = True;
Allow_Numeric_Owners = True;
Only_Numeric_Owners = True;
}
EXPORT_DEFAULTS {
Squash = none;
}
Disable_ACL = TRUE;
EXPORT
{
# Export Id (mandatory, each EXPORT must have a unique Export_Id)
Export_Id = 10;
Disable_ACL = TRUE;
# Exported path (mandatory)
Path = "/cm_shared"; # assuming 'testvol' is the Gluster volume name
# Pseudo Path (required for NFS v4)
Pseudo = "/cm_shared";
# Required for access (default is None)
# Could use CLIENT blocks instead
Access_Type = RW;
Squash = none;
# Security flavor supported
SecType = "sys";
# Exporting FSAL
FSAL {
Name = "GLUSTER";
Hostname = "127.0.0.1"; # IP of one of the nodes in the trusted pool
Volume = "cm_shared";
}
}
LOG {
Components {
ALL = FULL_DEBUG;
}
}