Thanks for taking a look at this, Daniel. Below is the only interesting bit from the Ceph MDS log at the time of the crash but I suspect the slow requests are a result of the Ganesha crash rather than the cause of it. Copying the Ceph list in case anyone has any ideas.

2019-07-15 15:06:54.624007 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 6 slow requests, 5 included below; oldest blocked for > 34.588509 secs
2019-07-15 15:06:54.624017 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 33.113514 seconds old, received at 2019-07-15 15:06:21.510423: client_request(client.16140784:5571174 setattr mtime=2019-07-15 14:59:45.642408 #0x10009079cfb 2019-07
-15 14:59:45.642408 caller_uid=1161, caller_gid=1131{}) currently failed to xlock, waiting
2019-07-15 15:06:54.624020 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 34.588509 seconds old, received at 2019-07-15 15:06:20.035428: client_request(client.16129440:1067288 create #0x1000907442e/filePathEditorRegistryPrefs.melDXAtss 201
9-07-15 14:59:53.694087 caller_uid=1161, caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,35
22,3520,3523,}) currently failed to wrlock, waiting
2019-07-15 15:06:54.624025 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 34.583918 seconds old, received at 2019-07-15 15:06:20.040019: client_request(client.16140784:5570551 getattr pAsLsXsFs #0x1000907443b 2019-07-15 14:59:44.171408 cal
ler_uid=1161, caller_gid=1131{}) currently failed to rdlock, waiting
2019-07-15 15:06:54.624028 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 34.580632 seconds old, received at 2019-07-15 15:06:20.043305: client_request(client.16129440:1067293 unlink #0x1000907442e/filePathEditorRegistryPrefs.melcdzxxc 201
9-07-15 14:59:53.701964 caller_uid=1161, caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,35
22,3520,3523,}) currently failed to wrlock, waiting
2019-07-15 15:06:54.624032 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 34.538332 seconds old, received at 2019-07-15 15:06:20.085605: client_request(client.16129440:1067308 create #0x1000907442e/filePathEditorRegistryPrefs.melHHljMk 201
9-07-15 14:59:53.744266 caller_uid=1161, caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,}) currently failed to wrlock, waiting
2019-07-15 15:06:55.014073 7f5fdcdc0700  1 mds.mds01 Updating MDS map to version 68166 from mon.2
2019-07-15 15:06:59.624041 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 7 slow requests, 2 included below; oldest blocked for > 39.588571 secs
2019-07-15 15:06:59.624048 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 30.495843 seconds old, received at 2019-07-15 15:06:29.128156: client_request(client.16129440:1072227 create #0x1000907442e/filePathEditorRegistryPrefs.mel58AQSv 2019-07-15 15:00:02.786754 caller_uid=1161, caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,}) currently failed to wrlock, waiting
2019-07-15 15:06:59.624053 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 39.432848 seconds old, received at 2019-07-15 15:06:20.191151: client_request(client.16140784:5570649 mknod #0x1000907442e/filePathEditorRegistryPrefs.mel3HZLNE 2019-07-15 14:59:44.322408 caller_uid=1161, caller_gid=1131{}) currently failed to wrlock, waiting
2019-07-15 15:07:03.014108 7f5fdcdc0700  1 mds.mds01 Updating MDS map to version 68167 from mon.2
2019-07-15 15:07:04.624096 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 8 slow requests, 1 included below; oldest blocked for > 44.588632 secs
2019-07-15 15:07:04.624103 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 34.904077 seconds old, received at 2019-07-15 15:06:29.719983: client_request(client.16129440:1072228 getattr pAsLsXsFs #0x1000907443b 2019-07-15 15:00:03.378512 caller_uid=1161, caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,}) currently failed to rdlock, waiting
2019-07-15 15:07:07.013972 7f5fdcdc0700  1 mds.mds01 Updating MDS map to version 68168 from mon.2
2019-07-15 15:07:09.624166 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 10 slow requests, 2 included below; oldest blocked for > 49.588693 secs
2019-07-15 15:07:09.624173 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 32.689838 seconds old, received at 2019-07-15 15:06:36.934283: client_request(client.16129440:1072271 getattr pAsLsXsFs #0x1000907443b 2019-07-15 15:00:10.592734 caller_uid=1161, caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,}) currently failed to rdlock, waiting
2019-07-15 15:07:09.624177 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 34.962719 seconds old, received at 2019-07-15 15:06:34.661402: client_request(client.16129440:1072256 getattr pAsLsXsFs #0x1000907443b 2019-07-15 15:00:08.319912 caller_uid=1161, caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,}) currently failed to rdlock, waiting
2019-07-15 15:07:11.519928 7f5fdcdc0700  1 mds.mds01 Updating MDS map to version 68169 from mon.2
2019-07-15 15:07:19.624272 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 11 slow requests, 1 included below; oldest blocked for > 59.588812 secs
2019-07-15 15:07:19.624278 7f5fda5bb700  0 log_channel(cluster) log [WRN] : slow request 32.164260 seconds old, received at 2019-07-15 15:06:47.459980: client_request(client.16129440:1072326 getattr pAsLsXsFs #0x1000907443b 2019-07-15 15:00:21.118372 caller_uid=1161, caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,}) currently failed to rdlock, waiting


On Tue, Jul 16, 2019 at 1:18 PM Daniel Gryniewicz <dang@redhat.com> wrote:
This is not one I've seen before, and a quick look at the code looks
strange.  The only assert in that bit is asserting the parent is a
directory, but the parent directory is not something that was passed in
by Ganesha, but rather something that was looked up internally in
libcephfs.  This is beyond my expertise, at this point.  Maybe some ceph
logs would help?

Daniel

On 7/15/19 10:54 AM, David C wrote:
> This list has been deprecated. Please subscribe to the new devel list at lists.nfs-ganesha.org.
>
>
> Hi All
>
> I'm running 2.7.3 using the CEPH FSAL to export CephFS (Luminous), it
> ran well for a few days and crashed. I have a coredump, could someone
> assist me in debugging this please?
>
> (gdb) bt
> #0  0x00007f04dcab6207 in raise () from /lib64/libc.so.6
> #1  0x00007f04dcab78f8 in abort () from /lib64/libc.so.6
> #2  0x00007f04d2a9d6c5 in ceph::__ceph_assert_fail(char const*, char
> const*, int, char const*) () from /usr/lib64/ceph/libceph-common.so.0
> #3  0x00007f04d2a9d844 in ceph::__ceph_assert_fail(ceph::assert_data
> const&) () from /usr/lib64/ceph/libceph-common.so.0
> #4  0x00007f04cc807f04 in Client::_lookup_name(Inode*, Inode*, UserPerm
> const&) () from /lib64/libcephfs.so.2
> #5  0x00007f04cc81c41f in Client::ll_lookup_inode(inodeno_t, UserPerm
> const&, Inode**) () from /lib64/libcephfs.so.2
> #6  0x00007f04ccadbf0e in create_handle (export_pub=0x1baff10,
> desc=<optimized out>, pub_handle=0x7f0470fd4718,
> attrs_out=0x7f0470fd4740) at
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/FSAL_CEPH/export.c:256
> #7  0x0000000000523895 in mdcache_locate_host (fh_desc=0x7f0470fd4920,
> export=export@entry=0x1bafbf0, entry=entry@entry=0x7f0470fd48b8,
> attrs_out=attrs_out@entry=0x0)
>      at
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:1011
> #8  0x000000000051d278 in mdcache_create_handle (exp_hdl=0x1bafbf0,
> fh_desc=<optimized out>, handle=0x7f0470fd4900, attrs_out=0x0) at
> /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1578
> #9  0x000000000046d404 in nfs4_mds_putfh
> (data=data@entry=0x7f0470fd4ea0) at
> /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_putfh.c:211
> #10 0x000000000046d8e8 in nfs4_op_putfh (op=0x7f03effaf1d0,
> data=0x7f0470fd4ea0, resp=0x7f03ec1de1f0) at
> /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_putfh.c:281
> #11 0x000000000045d120 in nfs4_Compound (arg=<optimized out>,
> req=<optimized out>, res=0x7f03ec1de9d0) at
> /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_Compound.c:942
> #12 0x00000000004512cd in nfs_rpc_process_request
> (reqdata=0x7f03ee5ed4b0) at
> /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_worker_thread.c:1328
> #13 0x0000000000450766 in nfs_rpc_decode_request (xprt=0x7f02180c2320,
> xdrs=0x7f03ec568ab0) at
> /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
> #14 0x00007f04df45d07d in svc_rqst_xprt_task (wpe=0x7f02180c2538) at
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:769
> #15 0x00007f04df45d59a in svc_rqst_epoll_events (n_events=<optimized
> out>, sr_rec=0x4bb53e0) at
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:941
> #16 svc_rqst_epoll_loop (sr_rec=<optimized out>) at
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1014
> #17 svc_rqst_run_task (wpe=0x4bb53e0) at
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1050
> #18 0x00007f04df465123 in work_pool_thread (arg=0x7f044c0008c0) at
> /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/work_pool.c:181
> #19 0x00007f04dda05dd5 in start_thread () from /lib64/libpthread.so.0
> #20 0x00007f04dcb7dead in clone () from /lib64/libc.so.6
>
> Package versions:
>
> nfs-ganesha-2.7.3-0.1.el7.x86_64
> nfs-ganesha-ceph-2.7.3-0.1.el7.x86_64
> libcephfs2-14.2.1-0.el7.x86_64
> librados2-14.2.1-0.el7.x86_64
>
> I notice in my Ceph log I have a bunch of slow requests around the time
> it went down, I'm not sure if it's a symptom of Ganesha segfaulting or
> if it was a contributing factor.
>
> Thanks,
> David
>
>
> _______________________________________________
> Nfs-ganesha-devel mailing list
> Nfs-ganesha-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
>