[NFS-Ganesha-Devel] Re: nfs-ganesha clients freeze, nfs ganesha daemon dies

Monday, 19 August 2019

You can easily run Ganesha in gdb, which should give you a backtrace 
even if coredumps are enabled.  Just run like this:

gdb /usr/bin/ganesha.nfsd
(gdb) run -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf 
-N NIV_EVENT -F

Then, when it exits, it will tell you when if it crashed or exited 
normally.  If it crached, you can get a backtrace.  (You'll need the 
debug packages for Ganesha and ntirpc to get valid symbols)

Daniel

On 8/19/19 11:20 AM, Erik Jacobson wrote:
...
 I turned off 'Graceless'.

 I ran the daemon by hand instead of systemd, still no core dump.

 /usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f /etc/ganesha/ganesha.conf -N
NIV_EVENT -F

 To repeat, I just chroot in to the image and do 'su - erikj' a few times
 in a row. The daemon then exits. No message in -F mode, no core dump.

 [root@leader1 ~]# ulimit -c
 unlimited
 [root@leader1 ~]# /usr/bin/ganesha.nfsd -L /var/log/ganesha/ganesha.log -f
/etc/ganesha/ganesha.conf -N NIV_EVENT -F
 [root@leader1 ~]#

 Would you like a new collection with graceless off? Let me know. Or
 anything else I can run to help, let me know. It's easy to reproduce the
 problem so no big deal to gather more data.

> Was there any core generated when the ganesha process crashed. If yes, could
> you provide backtrace?
>
> looking at the first failure observed in the logs,
>
> 17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
> nfs41_Session_Get_Pointer :SESSIONS :F_DBG :Session
> sessionid=(16:0x010000001e82585d0100000000000000) Not Found
> 17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
> nfs4_op_sequence :SESSIONS :DEBUG :SESSIONS: DEBUG: SEQUENCE returning
> status NFS4ERR_BADSESSION
> 17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
> LogCompoundFH :FH :F_DBG :Current FH  File Handle V4: Len=0 <null>
> 17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
> LogCompoundFH :FH :F_DBG :Saved FH    File Handle V4: Len=0 <null>
> 17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
> complete_op :NFS4 :DEBUG :Status of OP_SEQUENCE in position 0 =
> NFS4ERR_BADSESSION, op response size is 4 total response size is 40
> 17/08/2019 17:40:07 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_25]
> complete_nfs4_compound :NFS4 :DEBUG :End status = NFS4ERR_BADSESSION
> lastindex = 1
>
> This session seem to have got deleted as part of clientid expire in
> CREATE_SESSION
>
> 17/08/2019 17:40:05 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_22]
> nfs4_op_create_session :SESSIONS :DEBUG :Expiring 0x7f03c8001cd0
> ClientID={Epoch=0x5d58821e Counter=0x00000001} CONFIRMED
> Client={0x7f03c8001bf0 name=(20:Linux NFSv4.1 (none)) refcount=3} t_delta=0
> reservations=0 refcount=10
> 0x7ffff22b2da4, WSTOPPED, NULL) = ? ERESTARTSYS (To be restarted if
> SA_RESTART is set)_id_expire :RW LOCK :F_DBG :Acquired mutex 0x7f03c8001d50
> (&clientid->cid_mutewait4(1469, 0x7ffff22b2da4, WSTOPPED, NULL) = ?
> ERESTARTSYS (To be restarted if SA_RESTART is set)5d58821e : leader1 :
> ganesha.nfsd-30466[svc_22] nfs_client_id_expire :CLIENT Iwait4(1469, piring
> {0x7f03c8001cd0 ClientID={Epoch=0x5d58821e Counter=0x00000001} CONFIRMED
> Client={0x7f03c8001bf0 name=(20:Linux NFSv4.1 (none)) refcount=3} t_delta=0
> reservations=0 refcount=10}
> ...
> 17/08/2019 17:40:05 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_22]
> hashtable_getlatch :SESSIONS :F_DBG :Get (null) returning
> Value=0x7f03c0001840 {session 0x7f03c0001840
> {sessionid=(16:0x010000001e82585d0100000000000000)}}
> 17/08/2019 17:40:05 : epoch 5d58821e : leader1 : ganesha.nfsd-30466[svc_22]
> hashtable_deletelatched :SESSIONS :F_DBG :Delete (null) Key=0x7f03c0001840
> {sessionid=(16:0x010000001e82585d0100000000000000)} Value=0x7f03c0001840
> {session 0x7f03c0001840 {sessionid=(16:0x010000001e82585d0100000000000000)}}
> index=1 rbt_hash=1 was removed
>
>
> I see a potential bug in nfs4_op_create_session --
>
> 156                 rc = nfs_client_id_get_confirmed(clientid, &conf);
> 172                 client_record = conf->cid_client_record;
> 173                 found = conf;
>
> 366         /* add to head of session list (encapsulate?) */
> 367         PTHREAD_MUTEX_lock(&found->cid_mutex);
> 368         glist_add(&found->cid_cb.v41.cb_session_list,
> 369                   &nfs41_session->session_link);
> 370         PTHREAD_MUTEX_unlock(&found->cid_mutex);
> 371
>
> 427         if (conf != NULL && conf->cid_clientid != clientid) {
> 428                 /* Old confirmed record - need to expire it */
> 429                 if (isDebug(component)) {
> 430                         char str[LOG_BUFF_LEN] = "\0";
> 431                         struct display_buffer dspbuf = {sizeof(str),
> str, str};
> 432
> 433                         display_client_id_rec(&dspbuf, conf);
> 434                         LogDebug(component, "Expiring %s", str);
> 435                 }
>
> Old clientid is expired after adding the session created to its
> cb_session_list. Thus when this old clientid is getting cleaned, it might
> have deleted even the newly created valid session. This seem to have caused
> ERR_BADSESSION and probably ERR_EXPIRED,ERR_STALE_CLIENTID errors as well.
>
> Also since graceless is set to TRUE, clients were not able to claim their
> previous state. I am not sure if this can cause any application error but
> ideally grace period should be as long as lease period set to allow clients
> to recover their lost state.
>
> Request Frank/Dan to comment.
>
> Thanks,
> Soumya
>
> On 8/18/19 4:18 AM, Erik Jacobson wrote:
>> Hello. I'm statring a new thread for this problem.
>>
>> I have a 3x3 Gluster Volume and I'm trying to use Ganesha for NFS
>> services.
>>
>> One of the 9 server nodes, I have enabled Ganesha NFS server on one.
>>
>> The volume is being used being host clients with NFS roots.
>>
>> A long separate thread shows how I got to this point but what works on
>> the client side is:
>>
>> RHEL 7.6 aarch64 4.14.0-115.el7a.aarch64
>> RHEL 7.6 x86_64 3.10.0-957.el7.x86_64
>>
>> OverlayFS - NFS v4 underdir with TMPFS overlay.
>>
>> The Ganesha server has
>> 	Allow_Numeric_Owners = True;
>> 	Only_Numeric_Owners = True;
>>           Disable_ACL = TRUE;
>>
>>
>> Disable_ACL is required for the aarch64 overlay to properly read
>> non-root files. (However, Disable_ACL must be false for aarch64
>> if you are using NFS v3 strangely).
>>
>> The x86_64 node fully boots through full init/systemd startup to the
>> login prompt.
>>
>> When I startup the aarch64 node, it gets various degrees of done... then
>> both NFS clients freeze up 100%.
>>
>> Restarting nfs-ganesha gets them going for a moment, then they freeze
>> again. It turned out in some cases the nfs-ganesha daemon was present
>> during the freeze but no longer serving the nodes. However, a more
>> common case (and the captured one) is nfs-ganesha is gone.
>>
>> I will attach a tarball with a bunch of information on the problem
>> including the config file I used, debugging logs, and some traces.
>>
>> Ganesha 2.8.2
>>    - Ganesha, Gluster servers x86_64
>>
>> Since the aarch64 node causes Ganesha to crash early, and the debug
>> log can get to 2GB quickly, I set up a test case as follows:
>>
>> Tracing starts...
>>    - x86_64 fully nfs-root-booted, it comes up fine.
>>       * Actively using nfs for root during tests below
>>    - aarch64 node - boot to the miniroot env (a "fat" initrd that has
>>      more tools and from which we do the NFS mount)
>>    - It stops before switching control to the init start to run the tests
>>      like below.
>>    - cp'd /dev/null to ganesha log here
>>    - started the tcpdump to the problem node
>>    - Ran the following. Ganesha died at 'wc -l', also notice the
>>      Input/Output error on the first attempt:
>>
>> bash-4.2# bash reset4.sh
>> + umount /a
>> umount: /a: not mounted
>> + umount /root_ro_nfs
>> umount: /root_ro_nfs: not mounted
>> + umount /rootfs.rw
>> + mount -o ro,nolock
172.23.255.249:/cm_shared/image/images_ro_nfs/rhel76-aarch64-newkernel /root_ro_nfs
>> + mount -t tmpfs -o mpol=interleave tmpfs /rootfs.rw
>> + mkdir /rootfs.rw/upperdir
>> + mkdir /rootfs.rw/work
>> + mount -t overlay overlay -o
lowerdir=/root_ro_nfs,upperdir=/rootfs.rw/upperdir,workdir=/rootfs.rw/work /a
>> bash-4.2# chroot /a
>> chroot: failed to run command '/bin/sh': Input/output error
>> bash-4.2# chroot /a
>> sh: no job control in this shell
>> sh-4.2# ls /usr/bin|wc -l
>>
>> - When the above froze and ganesha died, I stopped tcpdump and collected
>>     the pieces in to a tarball.
>>
>> See attached.
>>
>> Erik
>>
>>
>> _______________________________________________
>> Devel mailing list -- devel(a)lists.nfs-ganesha.org
>> To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org
>>

 Erik Jacobson
 Software Engineer

 erik.jacobson(a)hpe.com
 +1 612 851 0550 Office

 Eagan, MN
 hpe.com

2025

2024

2023

2022

2021

2020

2019

2018

[NFS-Ganesha-Devel] Re: nfs-ganesha clients freeze, nfs ganesha daemon dies