Hello,
 
We noticed the following FATAL trace after which ganesha exited  (we are using ganesha 2.7.3 with some of our patches):
ganesha.nfsd-2609[reaper] release_openstate :STATE :FATAL :Could not complete cleanup of lock state for lock owner STATE_OPEN_OWNER_NFSV4 0x3ffea404a8a0: clientid={0x3ffcc001f960 ClientID={Epoch=0x000602e1 Counter=0x00000013} EXPIRED Client={} t_delta=113 reservations=0 refcount=4 cb_prog=1073741824 r_addr=192.168.117.159.141.56 r_netid=tcp} owner=(24:0x6f70656e2069643a000000290000000000175a07e7bb26ec) confirmed=1 seqid=1667 refcount=3
 
I replaced the exit() in Fatal() with an abort() to get a coredump to know more about this.
 
Observed the following from the coredump:
(gdb) t 1
[Switching to thread 1 (Thread 0x3fff52ffe820 (LWP 737))]
#0  0x00003fff82f99614 in .raise () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00003fff82f99614 in .raise () from /lib64/libpthread.so.0
#1  0x0000000010056bec in crash_handler (signo=6, info=0x3fff52ffb478, ctx=0x3fff52ffa700) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/MainNFSD/nfs_init.c:243
#2  <signal handler called>
#3  0x00003fff82d5129c in .raise () from /lib64/libc.so.6
#4  0x00003fff82d534a8 in .abort () from /lib64/libc.so.6
#5  0x0000000010198bdc in Fatal () at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/log/log_functions.c:310
#6  0x000000001019cf50 in display_log_component_level (component=COMPONENT_STATE, file=0x10221ac8 "/root/rpmbuild/BUILD/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_state.c",
    line=861, function=0x10222178 <__func__.20812> "release_openstate", level=NIV_FATAL, format=0x10221fc0 "Could not complete cleanup of lock state for lock owner %s",
    arguments=0x3fff52ffbae0 "") at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/log/log_functions.c:1530
#7  0x000000001019cfe4 in DisplayLogComponentLevel (component=COMPONENT_STATE, file=0x10221ac8 "/root/rpmbuild/BUILD/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_state.c",
    line=861, function=0x10222178 <__func__.20812> "release_openstate", level=NIV_FATAL, format=0x10221fc0 "Could not complete cleanup of lock state for lock owner %s")
    at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/log/log_functions.c:1704
#8  0x0000000010127c44 in release_openstate (owner=0x3ffeac00e730) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_state.c:859
#9  0x000000001011f2c4 in nfs_client_id_expire (clientid=0x3ffd30004e60, make_stale=true) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_clientid.c:1071
#10 0x00000000101326f8 in nfs_release_v4_client (ip=0x1002c4d3e26 "192.168.11.78") at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_recovery.c:805
#11 0x000000001012efac in nfs_start_grace (gsp=0x3fff52ffd908) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_recovery.c:196
#12 0x0000000010063e44 in admin_dbus_grace (args=0x3fff52ffda20, reply=0x1002c4118e0, error=0x3fff52ffda00)
    at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/MainNFSD/nfs_admin_thread.c:173
#13 0x00000000101dd77c in dbus_message_entrypoint (conn=0x1002c413790, msg=0x1002c413bd0, user_data=0x10261160 <admin_interfaces>)
    at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/dbus/dbus_server.c:562
#14 0x00003fff833658c4 in ._dbus_object_tree_dispatch_and_unlock () from /lib64/libdbus-1.so.3
#15 0x00003fff83350d60 in .dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#16 0x00003fff83351108 in ._dbus_connection_read_write_dispatch () from /lib64/libdbus-1.so.3
#17 0x00000000101de708 in gsh_dbus_thread (arg=0x0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/dbus/dbus_server.c:795
#18 0x00003fff82f8c95c in .start_thread () from /lib64/libpthread.so.0
#19 0x00003fff82e33bbc in .__clone () from /lib64/libc.so.6
(gdb) f 8
#8  0x0000000010127c44 in release_openstate (owner=0x3ffeac00e730) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_state.c:859
859                     LogFatal(COMPONENT_STATE,

 
FATAL trace was logged when ganesha was in GRACE period.
release_openstate() has a while(errcnt < STATE_ERR_MAX) loop which is working on a state=0x3ffc780103e0.  It calls 'get_state_obj_export_owner_refs(state, &obj, &export, NULL)' which returns with 'obj=NULL',  thread 10 is trying to close this 'obj' as can be seen in the backtrace for thread 10 below. But thread 10 is waiting to get a 'obj_lock' for the 'obj', but 'obj_lock' is taken by thread 215. The thread 215 is doing commit, which is probably taking time.
In release_openstate() after moving in the while(errcnt < STATE_ERR_MAX) loop for 100 iterations, the FATAL trace is logged.
 
Inside the while(errcnt < STATE_ERR_MAX) loop, we increment 'errcnt' whenever 'get_state_obj_export_owner_refs()' fails. If it fails STATE_ERR_MAX times, then we log FATAL trace. This looks suspicious as this while loop could complete very fast. Please check the same. Thank you.
 
(gdb) t 215
[Switching to thread 215 (Thread 0x3ffe171fe820 (LWP 10425))]
#0  0x00003fff82e25400 in .__ioctl () from /lib64/libc.so.6
(gdb) bt
#0  0x00003fff82e25400 in .__ioctl () from /lib64/libc.so.6
#1  0x00003fff80ff55c0 in gpfs_ganesha (op=123, oarg=0x3ffe171fc6b8) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/FSAL/FSAL_GPFS/gpfsext.c:207
#2  0x00003fff80fe5fa4 in gpfs_commit_fd (my_fd=3654, obj_hdl=0x3ffc94002a40, offset=0, len=0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/FSAL/FSAL_GPFS/file.c:1058
#3  0x00003fff80fe61d0 in gpfs_commit2 (obj_hdl=0x3ffc94002a40, offset=0, len=0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/FSAL/FSAL_GPFS/file.c:1112
#4  0x00000000101c67c0 in mdcache_commit2 (obj_hdl=0x3ffc94020748, offset=0, len=0)
    at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:741
#5  0x0000000010086f10 in fsal_commit (obj=0x3ffc94020748, offset=0, len=0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/include/fsal.h:424
#6  0x00000000100870c0 in nfs4_op_commit (op=0x3ffd080089c0, data=0x3ffe171fcb38, resp=0x3ffd08003d00)
    at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/Protocols/NFS/nfs4_op_commit.c:95
#7  0x00000000100830a4 in nfs4_Compound (arg=0x3ffd0800cb48, req=0x3ffd0800c440, res=0x3ffd080080c0)
    at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/Protocols/NFS/nfs4_Compound.c:942
#8  0x000000001007e978 in nfs_rpc_process_request (reqdata=0x3ffd0800c440) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/MainNFSD/nfs_worker_thread.c:1328
#9  0x000000001007f34c in nfs_rpc_valid_NFS (req=0x3ffd0800c440) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/MainNFSD/nfs_worker_thread.c:1538
#10 0x00003fff8347b450 in svc_vc_decode (req=0x3ffd0800c440) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/libntirpc/src/svc_vc.c:829
#11 0x000000001006a868 in nfs_rpc_decode_request (xprt=0x3ffc5822c4e0, xdrs=0x3ffd08001650)
    at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
#12 0x00003fff8347b2c0 in svc_vc_recv (xprt=0x3ffc5822c4e0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/libntirpc/src/svc_vc.c:802
#13 0x00003fff83475b88 in svc_rqst_xprt_task (wpe=0x3ffc5822c6f8) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/libntirpc/src/svc_rqst.c:769
#14 0x00003fff834761fc in svc_rqst_epoll_events (sr_rec=0x1002c4fb740, n_events=1) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/libntirpc/src/svc_rqst.c:941
#15 0x00003fff83476598 in svc_rqst_epoll_loop (sr_rec=0x1002c4fb740) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/libntirpc/src/svc_rqst.c:1014
#16 0x00003fff834766cc in svc_rqst_run_task (wpe=0x1002c4fb740) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/libntirpc/src/svc_rqst.c:1050
#17 0x00003fff83484118 in work_pool_thread (arg=0x3ffdb0007070) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/libntirpc/src/work_pool.c:181
#18 0x00003fff82f8c95c in .start_thread () from /lib64/libpthread.so.0
#19 0x00003fff82e33bbc in .__clone () from /lib64/libc.so.6
(gdb) f 3
#3  0x00003fff80fe61d0 in gpfs_commit2 (obj_hdl=0x3ffc94002a40, offset=0, len=0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/FSAL/FSAL_GPFS/file.c:1112
1112                    status = gpfs_commit_fd(out_fd->fd, obj_hdl, offset, len);
(gdb) l -
1101            status = fsal_reopen_obj(obj_hdl, false, false, FSAL_O_WRITE,
1102                                     (struct fsal_fd *)&myself->u.file.fd,
1103                                     &myself->u.file.share,
1104                                     gpfs_open_func, gpfs_close_func,
1105                                     (struct fsal_fd **)&out_fd, &has_lock,
1106                                     &closefd);
1107
1108            if (!FSAL_IS_ERROR(status)) {
1109
1110                    fsal_set_credentials(op_ctx->creds);
1111
(gdb) p has_lock
$1 = true

 
In gpfs_commit2() function fsal_reopen_obj() was called which acquired lock - 'obj_hdl->obj_lock'. 
 
 
(gdb) t 10
[Switching to thread 10 (Thread 0x3fff51ffe820 (LWP 739))]
#0  0x00003fff82f914c8 in .__pthread_rwlock_wrlock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00003fff82f914c8 in .__pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x00003fff80fe7b48 in gpfs_close2 (obj_hdl=0x3ffc94002a40, state=0x3ffc780103e0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/FSAL/FSAL_GPFS/file.c:1464
#2  0x00000000101c6b00 in mdcache_close2 (obj_hdl=0x3ffc94020748, state=0x3ffc780103e0)
    at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_file.c:830
#3  0x00000000101245b0 in _state_del_locked (state=0x3ffc780103e0, func=0x10222178 <__func__.20812> "release_openstate", line=840)
    at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_state.c:403
#4  0x00000000101279c8 in release_openstate (owner=0x3ffeac00e730) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_state.c:840
#5  0x000000001011f2c4 in nfs_client_id_expire (clientid=0x3ffd30004e60, make_stale=false) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/SAL/nfs4_clientid.c:1071
#6  0x000000001005ae48 in reap_hash_table (ht_reap=0x1002c414b40) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/MainNFSD/nfs_reaper_thread.c:109
#7  0x000000001005b814 in reaper_run (ctx=0x1002c4cf2f0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/MainNFSD/nfs_reaper_thread.c:232
#8  0x00000000101754ec in fridgethr_start_routine (arg=0x1002c4cf2f0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/support/fridgethr.c:550
#9  0x00003fff82f8c95c in .start_thread () from /lib64/libpthread.so.0
#10 0x00003fff82e33bbc in .__clone () from /lib64/libc.so.6
(gdb) f 1
#1  0x00003fff80fe7b48 in gpfs_close2 (obj_hdl=0x3ffc94002a40, state=0x3ffc780103e0) at /usr/src/debug/nfs-ganesha-2.7.3-ibm050.05FatalAbort/FSAL/FSAL_GPFS/file.c:1464
1464                    PTHREAD_RWLOCK_wrlock(&obj_hdl->obj_lock);
 
Here, reaper_run() leads to call to _state_del_locked(), which calls obj->obj_ops->close2().  gpfs_close2() is waiting for the same 'obj_hdl->obj_lock' which has been taken by thread 215.
 
Thanks,
Madhu Thorat.