I turned off 'Graceless'.
I ran the daemon by hand instead of systemd, still no core dump.
/usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N
NIV_EVENT -F
To repeat, I just chroot in to the image and do 'su - erikj' a few times
in a row. The daemon then exits. No message in -F mode, no core dump.
[root@leader1 ~]# ulimit -c
unlimited
[root@leader1 ~]# /usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f
/etc/ganesha/ganesha.conf -N NIV_EVENT -F
[root@leader1 ~]#
Would you like a new collection with graceless off? Let me know. Or
anything else I can run to help, let me know. It's easy to reproduce the
problem so no big deal to gather more data.
Was there any core generated when the ganesha process crashed. If
yes, could
you provide backtrace?
looking at the first failure observed in the logs,
17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
nfs41_Session_Get_Pointer :SESSIONS :F_DBG :Session
sessionid=(16:0x010000001e82585d0100000000000000) Not Found
17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
nfs4_op_sequence :SESSIONS :DEBUG :SESSIONS: DEBUG: SEQUENCE returning
status NFS4ERR_BADSESSION
17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
LogCompoundFH :FH :F_DBG :Current FH File Handle V4: Len=0 <null>
17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
LogCompoundFH :FH :F_DBG :Saved FH File Handle V4: Len=0 <null>
17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
complete_op :NFS4 :DEBUG :Status of OP_SEQUENCE in position 0 =
NFS4ERR_BADSESSION, op response size is 4 total response size is 40
17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
complete_nfs4_compound :NFS4 :DEBUG :End status = NFS4ERR_BADSESSION
lastindex = 1
This session seem to have got deleted as part of clientid expire in
CREATE_SESSION
17/08/2019 17:40:05 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_22]
nfs4_op_create_session :SESSIONS :DEBUG :Expiring 0x7f03c8001cd0
ClientID={Epoch=0x5d58821e Counter=0x00000001} CONFIRMED
Client={0x7f03c8001bf0 name=(20:Linux NFSv4.1 (none)) refcount=3} t_delta=0
reservations=0 refcount=10
0x7ffff22b2da4, WSTOPPED, NULL) = ? ERESTARTSYS (To be restarted if
SA_RESTART is set)_id_expire :RW LOCK :F_DBG :Acquired mutex 0x7f03c8001d50
(&clientid->cid_mutewait4(1469, 0x7ffff22b2da4, WSTOPPED, NULL) = ?
ERESTARTSYS (To be restarted if SA_RESTART is set)5d58821e : leader1 :
ganesha.nfsd-30466[svc_22] nfs_client_id_expire :CLIENT Iwait4(1469, piring
{0x7f03c8001cd0 ClientID={Epoch=0x5d58821e Counter=0x00000001} CONFIRMED
Client={0x7f03c8001bf0 name=(20:Linux NFSv4.1 (none)) refcount=3} t_delta=0
reservations=0 refcount=10}
...
17/08/2019 17:40:05 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_22]
hashtable_getlatch :SESSIONS :F_DBG :Get (null) returning
Value=0x7f03c0001840 {session 0x7f03c0001840
{sessionid=(16:0x010000001e82585d0100000000000000)}}
17/08/2019 17:40:05 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_22]
hashtable_deletelatched :SESSIONS :F_DBG :Delete (null) Key=0x7f03c0001840
{sessionid=(16:0x010000001e82585d0100000000000000)} Value=0x7f03c0001840
{session 0x7f03c0001840 {sessionid=(16:0x010000001e82585d0100000000000000)}}
index=1 rbt_hash=1 was removed
I see a potential bug in nfs4_op_create_session --
156 rc = nfs_client_id_get_confirmed(clientid, &conf);
172 client_record = conf->cid_client_record;
173 found = conf;
366 /* add to head of session list (encapsulate?) */
367 PTHREAD_MUTEX_lock(&found->cid_mutex);
368 glist_add(&found->cid_cb.v41.cb_session_list,
369 &nfs41_session->session_link);
370 PTHREAD_MUTEX_unlock(&found->cid_mutex);
371
427 if (conf != NULL && conf->cid_clientid != clientid) {
428 /* Old confirmed record - need to expire it */
429 if (isDebug(component)) {
430 char str[LOG_BUFF_LEN] = "\0";
431 struct display_buffer dspbuf = {sizeof(str),
str, str};
432
433 display_client_id_rec(&dspbuf, conf);
434 LogDebug(component, "Expiring %s", str);
435 }
Old clientid is expired after adding the session created to its
cb_session_list. Thus when this old clientid is getting cleaned, it might
have deleted even the newly created valid session. This seem to have caused
ERR_BADSESSION and probably ERR_EXPIRED,ERR_STALE_CLIENTID errors as well.
Also since graceless is set to TRUE, clients were not able to claim their
previous state. I am not sure if this can cause any application error but
ideally grace period should be as long as lease period set to allow clients
to recover their lost state.
Request Frank/Dan to comment.
Thanks,
Soumya
On 8/18/19 4:18 AM, Erik Jacobson wrote:
> Hello. I'm statring a new thread for this problem.
>
> I have a 3x3 Gluster Volume and I'm trying to use Ganesha for NFS
> services.
>
> One of the 9 server nodes, I have enabled Ganesha NFS server on one.
>
> The volume is being used being host clients with NFS roots.
>
> A long separate thread shows how I got to this point but what works on
> the client side is:
>
> RHEL 7.6 aarch64 4.14.0-115.el7a.aarch64
> RHEL 7.6 x86_64 3.10.0-957.el7.x86_64
>
> OverlayFS - NFS v4 underdir with TMPFS overlay.
>
> The Ganesha server has
> Allow_Numeric_Owners = True;
> Only_Numeric_Owners = True;
> Disable_ACL = TRUE;
>
>
> Disable_ACL is required for the aarch64 overlay to properly read
> non-root files. (However, Disable_ACL must be false for aarch64
> if you are using NFS v3 strangely).
>
> The x86_64 node fully boots through full init/systemd startup to the
> login prompt.
>
> When I startup the aarch64 node, it gets various degrees of done... then
> both NFS clients freeze up 100%.
>
> Restarting nfs-ganesha gets them going for a moment, then they freeze
> again. It turned out in some cases the nfs-ganesha daemon was present
> during the freeze but no longer serving the nodes. However, a more
> common case (and the captured one) is nfs-ganesha is gone.
>
> I will attach a tarball with a bunch of information on the problem
> including the config file I used, debugging logs, and some traces.
>
> Ganesha 2.8.2
> - Ganesha, Gluster servers x86_64
>
> Since the aarch64 node causes Ganesha to crash early, and the debug
> log can get to 2GB quickly, I set up a test case as follows:
>
> Tracing starts...
> - x86_64 fully nfs-root-booted, it comes up fine.
> * Actively using nfs for root during tests below
> - aarch64 node - boot to the miniroot env (a "fat" initrd that has
> more tools and from which we do the NFS mount)
> - It stops before switching control to the init start to run the tests
> like below.
> - cp'd /dev/null to ganesha log here
> - started the tcpdump to the problem node
> - Ran the following. Ganesha died at 'wc -l', also notice the
> Input/Output error on the first attempt:
>
> bash-4.2# bash reset4.sh
> + umount /a
> umount: /a: not mounted
> + umount /root_ro_nfs
> umount: /root_ro_nfs: not mounted
> + umount /rootfs.rw
> + mount -o ro,nolock
172.23.255.249:/cm_shared/image/images_ro_nfs/rhel76-aarch64-newkernel /root_ro_nfs
> + mount -t tmpfs -o mpol=interleave tmpfs /rootfs.rw
> + mkdir /rootfs.rw/upperdir
> + mkdir /rootfs.rw/work
> + mount -t overlay overlay -o
lowerdir=/root_ro_nfs,upperdir=/rootfs.rw/upperdir,workdir=/rootfs.rw/work /a
> bash-4.2# chroot /a
> chroot: failed to run command '/bin/sh': Input/output error
> bash-4.2# chroot /a
> sh: no job control in this shell
> sh-4.2# ls /usr/bin|wc -l
>
> - When the above froze and ganesha died, I stopped tcpdump and collected
> the pieces in to a tarball.
>
> See attached.
>
> Erik
>
>
> _______________________________________________
> Devel mailing list -- devel(a)lists.nfs-ganesha.org
> To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org
>
Erik Jacobson
Software Engineer
erik.jacobson(a)hpe.com
+1 612 851 0550 Office
Eagan, MN
hpe.com