Was there any core generated when the ganesha process crashed. If yes,
could you provide backtrace?

looking at the first failure observed in the logs,

17/08/2019 17:40:07 : epoch 5d58821e : leader1 :
ganesha.nfsd-30466[svc_25] nfs41_Session_Get_Pointer :SESSIONS :F_DBG
:Session sessionid=(16:0x010000001e82585d0100000000000000) Not Found
17/08/2019 17:40:07 : epoch 5d58821e : leader1 :
ganesha.nfsd-30466[svc_25] nfs4_op_sequence :SESSIONS :DEBUG :SESSIONS:
DEBUG: SEQUENCE returning status NFS4ERR_BADSESSION
17/08/2019 17:40:07 : epoch 5d58821e : leader1 :
ganesha.nfsd-30466[svc_25] LogCompoundFH :FH :F_DBG :Current FH File
Handle V4: Len=0 <null>
17/08/2019 17:40:07 : epoch 5d58821e : leader1 :
ganesha.nfsd-30466[svc_25] LogCompoundFH :FH :F_DBG :Saved FH    File
Handle V4: Len=0 <null>
17/08/2019 17:40:07 : epoch 5d58821e : leader1 :
ganesha.nfsd-30466[svc_25] complete_op :NFS4 :DEBUG :Status of
OP_SEQUENCE in position 0 = NFS4ERR_BADSESSION, op response size is 4
total response size is 40
17/08/2019 17:40:07 : epoch 5d58821e : leader1 :
ganesha.nfsd-30466[svc_25] complete_nfs4_compound :NFS4 :DEBUG :End
status = NFS4ERR_BADSESSION lastindex = 1

This session seem to have got deleted as part of clientid expire in
CREATE_SESSION

17/08/2019 17:40:05 : epoch 5d58821e : leader1 :
ganesha.nfsd-30466[svc_22] nfs4_op_create_session :SESSIONS :DEBUG
:Expiring 0x7f03c8001cd0 ClientID={Epoch=0x5d58821e Counter=0x00000001}
CONFIRMED Client={0x7f03c8001bf0 name=(20:Linux NFSv4.1 (none))
refcount=3} t_delta=0 reservations=0 refcount=10
0x7ffff22b2da4, WSTOPPED, NULL) = ? ERESTARTSYS (To be restarted if
SA_RESTART is set)_id_expire :RW LOCK :F_DBG :Acquired mutex
0x7f03c8001d50 (&clientid->cid_mutewait4(1469, 0x7ffff22b2da4, WSTOPPED,
NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)5d58821e :
leader1 : ganesha.nfsd-30466[svc_22] nfs_client_id_expire :CLIENT
Iwait4(1469, piring {0x7f03c8001cd0 ClientID={Epoch=0x5d58821e
Counter=0x00000001} CONFIRMED Client={0x7f03c8001bf0 name=(20:Linux
NFSv4.1 (none)) refcount=3} t_delta=0 reservations=0 refcount=10}
...
17/08/2019 17:40:05 : epoch 5d58821e : leader1 :
ganesha.nfsd-30466[svc_22] hashtable_getlatch :SESSIONS :F_DBG :Get
(null) returning Value=0x7f03c0001840 {session 0x7f03c0001840
{sessionid=(16:0x010000001e82585d0100000000000000)}}
17/08/2019 17:40:05 : epoch 5d58821e : leader1 :
ganesha.nfsd-30466[svc_22] hashtable_deletelatched :SESSIONS :F_DBG
:Delete (null) Key=0x7f03c0001840
{sessionid=(16:0x010000001e82585d0100000000000000)} Value=0x7f03c0001840
{session 0x7f03c0001840
{sessionid=(16:0x010000001e82585d0100000000000000)}} index=1 rbt_hash=1
was removed

I see a potential bug in nfs4_op_create_session --

156                 rc = nfs_client_id_get_confirmed(clientid, &conf);
172                 client_record = conf->cid_client_record;
173                 found = conf;

366         /* add to head of session list (encapsulate?) */
367         PTHREAD_MUTEX_lock(&found->cid_mutex);
368         glist_add(&found->cid_cb.v41.cb_session_list,
369                   &nfs41_session->session_link);
370         PTHREAD_MUTEX_unlock(&found->cid_mutex);
371

427         if (conf != NULL && conf->cid_clientid != clientid) {
428                 /* Old confirmed record - need to expire it */
429                 if (isDebug(component)) {
430                         char str[LOG_BUFF_LEN] = "\0";
431                         struct display_buffer dspbuf = {sizeof(str),
str, str};
432
433                         display_client_id_rec(&dspbuf, conf);
434                         LogDebug(component, "Expiring %s", str);
435                 }

Old clientid is expired after adding the session created to its
cb_session_list. Thus when this old clientid is getting cleaned, it
might have deleted even the newly created valid session. This seem to
have caused ERR_BADSESSION and probably ERR_EXPIRED,ERR_STALE_CLIENTID
errors as well.

Also since graceless is set to TRUE, clients were not able to claim
their previous state. I am not sure if this can cause any application
error but ideally grace period should be as long as lease period set to
allow clients to recover their lost state.

Request Frank/Dan to comment.

Thanks,
Soumya

On 8/18/19 4:18 AM, Erik Jacobson wrote:
> Hello. I'm statring a new thread for this problem.
>
> I have a 3x3 Gluster Volume and I'm trying to use Ganesha for NFS
> services.
>
> One of the 9 server nodes, I have enabled Ganesha NFS server on one.
>
> The volume is being used being host clients with NFS roots.
>
> A long separate thread shows how I got to this point but what works on
> the client side is:
>
> RHEL 7.6 aarch64 4.14.0-115.el7a.aarch64
> RHEL 7.6 x86_64 3.10.0-957.el7.x86_64
>
> OverlayFS - NFS v4 underdir with TMPFS overlay.
>
> The Ganesha server has
>        Allow_Numeric_Owners = True;
>        Only_Numeric_Owners = True;
>          Disable_ACL = TRUE;
>
>
> Disable_ACL is required for the aarch64 overlay to properly read
> non-root files. (However, Disable_ACL must be false for aarch64
> if you are using NFS v3 strangely).
>
> The x86_64 node fully boots through full init/systemd startup to the
> login prompt.
>
> When I startup the aarch64 node, it gets various degrees of done... then
> both NFS clients freeze up 100%.
>
> Restarting nfs-ganesha gets them going for a moment, then they freeze
> again. It turned out in some cases the nfs-ganesha daemon was present
> during the freeze but no longer serving the nodes. However, a more
> common case (and the captured one) is nfs-ganesha is gone.
>
> I will attach a tarball with a bunch of information on the problem
> including the config file I used, debugging logs, and some traces.
>
> Ganesha 2.8.2
>   - Ganesha, Gluster servers x86_64
>
> Since the aarch64 node causes Ganesha to crash early, and the debug
> log can get to 2GB quickly, I set up a test case as follows:
>
> Tracing starts...
>   - x86_64 fully nfs-root-booted, it comes up fine.
>      * Actively using nfs for root during tests below
>   - aarch64 node - boot to the miniroot env (a "fat" initrd that has
>     more tools and from which we do the NFS mount)
>   - It stops before switching control to the init start to run the tests
>     like below.
>   - cp'd /dev/null to ganesha log here
>   - started the tcpdump to the problem node
>   - Ran the following. Ganesha died at 'wc -l', also notice the
>     Input/Output error on the first attempt:
>
> bash-4.2# bash reset4.sh
> + umount /a
> umount: /a: not mounted
> + umount /root_ro_nfs
> umount: /root_ro_nfs: not mounted
> + umount /rootfs.rw
> + mount -o ro,nolock 172.23.255.249:/cm_shared/image/images_ro_nfs/rhel76-aarch64-newkernel /root_ro_nfs
> + mount -t tmpfs -o mpol=interleave tmpfs /rootfs.rw
> + mkdir /rootfs.rw/upperdir
> + mkdir /rootfs.rw/work
> + mount -t overlay overlay -o lowerdir=/root_ro_nfs,upperdir=/rootfs.rw/upperdir,workdir=/rootfs.rw/work /a
> bash-4.2# chroot /a
> chroot: failed to run command '/bin/sh': Input/output error
> bash-4.2# chroot /a
> sh: no job control in this shell
> sh-4.2# ls /usr/bin|wc -l
>
> - When the above froze and ganesha died, I stopped tcpdump and collected
>    the pieces in to a tarball.
>
> See attached.
>
> Erik
>
>
> _______________________________________________
> Devel mailing list -- devel@lists.nfs-ganesha.org
> To unsubscribe send an email to devel-leave@lists.nfs-ganesha.org
>