If it happens regularly, then maybe we can instrument with ASAN? If
it's a refcount bug, then lttng would be very helpful.
On Fri, Jan 31, 2020 at 1:48 AM Madhu P Punjabi
<madhu.punjabi(a)in.ibm.com> wrote:
Hi Daniel,
Thank you very much for the response.
>> There appears to be an unexport in process. The entry is in CLEANUP,
and that only happens when the export containing it is being removed,
but the entry is in use (in this case by being the parent of a chunk).
So it should be cleaning up here, but apparently the sub-handle was
freed out from under the entry.
- The customer sees crash frequently in the same code path, sometimes while accessing the
"entry->sub_handle" in mdcache_lru_clean(..) and sometimes when
mdcache_lru_clean(..) calls fsal_obj_handle_fini(&entry->obj_handle). Will check if
an unexport was in progress. But some of the entry's fields have unexpected values, so
wondering if we can trust the current value of entru->lru.qid (here LRU_ENTRY_CLEANUP)
or not ? For example, entry->obj_handle has some garbage values:
(gdb) p entry->obj_handle
$27 = {handles = {next = 0x2886290afa5892f8, prev = 0x3d01000001000000}, fs = 0x3013836,
fsal = 0x6e0000000074e963, obj_ops = 0x8400000003013833, obj_lock = {__data = {
__lock = 318767104, __nr_readers = 3, __readers_wakeup = 20461166, __writer_wakeup
= 448, __nr_readers_queued = 1, __nr_writers_queued = 18, __writer = 1146909539,
__shared = 1751476595, __pad1 = 7596798812954843694, __pad2 = 4848124998864338967,
__flags = 1146905448},
__size =
"\023\000\000\000\000\000\000\003\001\070\066n\000\000\001\300\000\000\000\001\000\000\000\022D\\XXXXXXXX(a)XXXX.XX\000\000\000\000\000\027D\\chab-p",
__align = 1369094286720630787}, type = 1752789805, fsid = {major =
4995697438821140480, minor = 0}, fileid = 512, state_hdl = 0x5dfc43d5}
>> Are there any logs associated with this incident?
Don't have logs around the time when the crash was seen. Will ask for logs.
Thanks,
Madhu Thorat
IBM-India Software Labs, Pune.
----- Original message -----
From: Daniel Gryniewicz <dang(a)redhat.com>
To: Madhu P Punjabi <madhu.punjabi(a)in.ibm.com>
Cc: devel(a)lists.nfs-ganesha.org, ffilzlnx(a)mindspring.com
Subject: [EXTERNAL] Re: Crash in mdcache_lru_clean(..) with 2.7.6
Date: Thu, Jan 30, 2020 7:40 PM
There appears to be an unexport in process. The entry is in CLEANUP,
and that only happens when the export containing it is being removed,
but the entry is in use (in this case by being the parent of a chunk).
So it should be cleaning up here, but apparently the sub-handle was
freed out from under the entry.
Are there any logs associated with this incident?
Daniel
On 1/29/20 1:02 PM, Madhu P Punjabi wrote:
> Hi Daniel,
> Thank you very much for the response. Our customer is seeing the crash
> with ganesha 2.7.6 which already has the below patches you referred. We
> have the customer coredump, so can provide more information using it if
> needed..
> >> commit f0d5b8d4f6dcce4597459c3ebaf36d1e96e4645b
> >> MDCACHE - Fix chunk refcounting in readdir
>
> >> Also this one:
> >> 11e0e375e40658267cbf449afacaa53a136f7097
> >> MDCACHE - Fix race between lru functions for the chunk and the parent
> Thanks,
> Madhu Thorat
> IBM-India Software Labs, Pune.
>
> ----- Original message -----
> From: Daniel Gryniewicz <dang(a)redhat.com>
> To: Madhu P Punjabi <madhu.punjabi(a)in.ibm.com>,
> devel(a)lists.nfs-ganesha.org, ffilzlnx(a)mindspring.com
> Cc:
> Subject: [EXTERNAL] Re: Crash in mdcache_lru_clean(..) with 2.7.6
> Date: Wed, Jan 29, 2020 7:21 PM
> Best guess is that this is related to readdir chunking refcounting.
> Pulling all the chunking fixes is likely to fix this. For example, this
> one:
>
> commit f0d5b8d4f6dcce4597459c3ebaf36d1e96e4645b
> MDCACHE - Fix chunk refcounting in readdir
>
> Also this one:
> 11e0e375e40658267cbf449afacaa53a136f7097
> MDCACHE - Fix race between lru functions for the chunk and the parent
>
> Daniel
>
> On 1/29/20 6:23 AM, Madhu P Punjabi wrote:
> > Hi All,
> > A customer using 2.7.6 reported a crash inmdcache_lru_clean(..) when
> > "entry->sub_handle" had an unexpected address.
> > It is happening when lru_reap_chunk_impl(..) calls mdcache_put(..) in
> > the following code:
> > / 815 if (entry == parent ||
> > 816
> pthread_rwlock_trywrlock(&entry->content_lock)
> > == 0) {
> > ...
> > 819 if (entry != parent) {
> > ...
> > 823 (void)
> > atomic_inc_int32_t(&entry->lru.refcnt);
> > 824 }
> > 825
> > 826 /* Dequeue the chunk so it won't
> show up
> > anymore */
> > 827 CHUNK_LRU_DQ_SAFE(lru, lq);
> > 828 chunk->chunk_lru.qid = LRU_ENTRY_NONE;
> > ...
> > 839 mdcache_clean_dirent_chunk(chunk);
> > 840
> atomic_clear_uint32_t_bits(&entry->mde_flags,
> > 841
> >
> MDCACHE_DIR_POPULATED);
> > 842
> > 843 if (entry != parent) {
> > ...
> > 848
> >
>
PTHREAD_RWLOCK_unlock(&entry->content_lock);
> > * 849 mdcache_put(entry);*
> > 850 }/
> > At the time of crash:
> > /*entries_hiwat = 3000000, entries_used = 3000000, chunks_hiwat =
> > 100000, chunks_used = 100000*/
> > Has anybody seen this kind of crash ? Any patches that may help
> to fix
> > the crash ? We have the customer coredump, so can provide more
> > information using it.
> > Backtrace for reference:
> > /(gdb) bt
> > #0 0x00003fff955294a4 in .raise () from /lib64/libpthread.so.0
> > #1 0x000000001005b564 in crash_handler (signo=11,
> info=0x3ffd73eebee8,
> > ctx=0x3ffd73eeb170) at
> > /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/MainNFSD/nfs_init.c:244
> > #2 <signal handler called>
> > #3 0x00000000101cf5fc in mdcache_lru_clean (entry=0x3ffd78f1d540) at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:581
> > #4 0x00000000101d66ec in _mdcache_lru_unref (entry=0x3ffd78f1d540,
> > flags=0, func=0x10262100 <__func__.21092> "mdcache_put",
line=196)
> > at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2023
> > #5 0x00000000101cda88 in mdcache_put (entry=0x3ffd78f1d540) at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.h:196
> > #6 0x00000000101d0f3c in lru_reap_chunk_impl (qid=LRU_ENTRY_L2,
> > parent=0x3ffee854d080)
> > at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:849
> > #7 0x00000000101d1314 in mdcache_get_chunk (parent=0x3ffee854d080,
> > prev_chunk=0x0, whence=0)
> > at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:893
> > #8 0x00000000101f6c78 in mdcache_populate_dir_chunk
> > (directory=0x3ffee854d080, whence=0, dirent=0x3ffd73eec988,
> > prev_chunk=0x0, eod_met=0x3ffd73eec990)
> > at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:2572
> > #9 0x00000000101f90a4 in mdcache_readdir_chunked
> > (directory=0x3ffee854d080, whence=0, dir_state=0x3ffd73eecc18,
> > cb=@0x1028aaa8: 0x10044310 <populate_dirent>, attrmask=122830,
> > eod_met=0x3ffd73eecd64) at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:3009
> > #10 0x00000000101dfeb4 in mdcache_readdir (dir_hdl=0x3ffee854d0b8,
> > whence=0x3ffd73eeccd0, dir_state=0x3ffd73eecc18, cb=@0x1028aaa8:
> > 0x10044310 <populate_dirent>,
> > attrmask=122830, eod_met=0x3ffd73eecd64) at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:559
> > #11 0x0000000010044ff4 in fsal_readdir (directory=0x3ffee854d0b8,
> > cookie=0, nbfound=0x3ffd73eecd60, eod_met=0x3ffd73eecd64,
> attrmask=122830,
> > cb=@0x102907c8: 0x100dcfe8 <nfs3_readdirplus_callback>,
> > opaque=0x3ffd73eecd80) at
> > /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/fsal_helper.c:1160
> > #12 0x00000000100dcd64 in nfs3_readdirplus (arg=0x3ff9e8fa79c8,
> > req=0x3ff9e8fa72c0, res=0x3ff9e8ca4b50)
> > at
> >
> /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/Protocols/NFS/nfs3_readdirplus.c:310
> > #13 0x00000000100867f0 in nfs_rpc_process_request
> > (reqdata=0x3ff9e8fa72c0) at
> >
> /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/MainNFSD/nfs_worker_thread.c:1325
> > #14 0x0000000010087340 in nfs_rpc_valid_NFS (req=0x3ff9e8fa72c0) at
> >
> /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/MainNFSD/nfs_worker_thread.c:1545
> > #15 0x00003fff95a0d448 in svc_vc_decode (req=0x3ff9e8fa72c0) at
> > /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/libntirpc/src/svc_vc.c:829
> > #16 0x0000000010070f74 in nfs_rpc_decode_request
> (xprt=0x3ffcb0eeba00,
> > xdrs=0x3ff9e811f090)
> > at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
> > #17 0x00003fff95a0d2a0 in svc_vc_recv (xprt=0x3ffcb0eeba00) at
> > /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/libntirpc/src/svc_vc.c:802
> > #18 0x00003fff95a07884 in svc_rqst_xprt_task (wpe=0x3ffcb0eebc58) at
> >
> /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/libntirpc/src/svc_rqst.c:769
> > #19 0x00003fff95a07f20 in svc_rqst_epoll_events
> (sr_rec=0x1002bca2390,
> > n_events=1) at
> >
> /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/libntirpc/src/svc_rqst.c:941
> > #20 0x00003fff95a082e4 in svc_rqst_epoll_loop
> (sr_rec=0x1002bca2390) at
> >
> /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/libntirpc/src/svc_rqst.c:1014
> > #21 0x00003fff95a08428 in svc_rqst_run_task (wpe=0x1002bca2390) at
> >
> /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/libntirpc/src/svc_rqst.c:1050
> > #22 0x00003fff95a168fc in work_pool_thread (arg=0x3ffd80001f40) at
> >
> /usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/libntirpc/src/work_pool.c:181
> > #23 0x00003fff9551c99c in .start_thread () from
> /lib64/libpthread.so.0
> > #24 0x00003fff953c781c in .__clone () from /lib64/libc.so.6/
> >
> > /(gdb) frame 3
> > #3 0x00000000101cf5fc in mdcache_lru_clean (entry=0x3ffd78f1d540) at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:581
> > 581 subcall(
> > (gdb) list
> > 576 LogCrit(COMPONENT_CACHE_INODE_LRU,
> > 577 "Error closing file in
> cleanup: %s",
> > 578 fsal_err_txt(status));
> > 579 }
> > 580
> > 581 subcall(
> >
> 582
entry->sub_handle->obj_ops->release(entry->sub_handle)
> > 583 );
> > 584 entry->sub_handle = NULL;
> > 585
> > *(gdb) p entry->sub_handle
> > $5 = (struct fsal_obj_handle *) 0x1e489f3800000000 -------->
> this has
> > unexpected address, accessing it causing crash*/
> > /(gdb) frame 6
> > #6 0x00000000101d0f3c in lru_reap_chunk_impl (qid=LRU_ENTRY_L2,
> > parent=0x3ffee854d080)
> > at
> >
>
/usr/src/debug/nfs-ganesha-2.7.5-ibm053.02/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:849
> > 849 mdcache_put(entry);
> > (gdb) p entry
> > $13 = (mdcache_entry_t *) 0x3ffd78f1d540
> > (gdb) p parent
> > $14 = (mdcache_entry_t *) 0x3ffee854d080/
> > /*(gdb) p lru_state
> > $15 = {entries_hiwat = 3000000, entries_used = 3000000,
> chunks_hiwat =
> > 100000, chunks_used = 100000, fds_system_imposed = 160000,
> > fds_hard_limit = 144000, fds_hiwat = 96000,
> > fds_lowat = 32000, futility = 0, per_lane_work = 50,
> biggest_window =
> > 64000, prev_fd_count = 38875, prev_time = 1580013456, fd_state = 0}*/
> >
> > /(gdb) p *entry
> > $7 = {attr_lock = {__data = {__lock = 16381, __nr_readers =
> 2024405152,
> > __readers_wakeup = 16381, __writer_wakeup = 2019009936,
> > __nr_readers_queued = 0,
> > __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 =
> > 2860448219136, __pad2 = 228761534566, __flags = 805318656},
> > __size =
"\000\000?\375x\251\370\240\000\000?\375xW\245\220",
> > '\000' <repeats 18 times>,
> >
>
"\002\232\000\000\000\000\000\000\000\065C@\000f0\000\060\000\n\000\002",
> > __align = 70357883680928}, obj_handle = {handles = {next =
> > 0x2886290afa5892f8, prev = 0x3d01000001000000}, fs = 0x3013836,
> fsal =
> > 0x6e0000000074e963,
> > obj_ops = 0x8400000003013833, obj_lock = {__data = {__lock =
> > 318767104, __nr_readers = 3, __readers_wakeup = 20461166,
> > __writer_wakeup = 448, __nr_readers_queued = 1,
> > __nr_writers_queued = 18, __writer = 1146909539, __shared =
> > 1751476595, __pad1 = 7596798812954843694, __pad2 =
> 4848124998864338967,
> > __flags = 1146905448},
> > __size =
> >
>
"\023\000\000\000\000\000\000\003\001\070\066n\000\000\001\300\000\000\000\001\000\000\000\022D\\XXXXXXXX(a)XXX.XX\000\000\000\000\000\027D\\chab-p",
> > __align = 1369094286720630787}, type = 1752789805, fsid =
> {major
> > = 4995697438821140480, minor = 0}, fileid = 512, state_hdl =
> 0x5dfc43d5},
> > sub_handle = 0x1e489f3800000000, attrs = {request_mask =
> > 6677607623807894576, valid_mask = 1236774467, supported = 3, type =
> > 20461166, filesize = 41, fsid = {
> > major = 9667270145471608893, minor = 932007903232}, acl = 0x0,
> > fileid = 31387111933, mode = 384, numlinks = 2, owner = 172197,
> group =
> > 194403, rawdev = {major = 0,
> > minor = 0}, atime = {tv_sec = 1580013344, tv_nsec =
> 192799000},
> > creation = {tv_sec = 0, tv_nsec = 0}, ctime = {tv_sec = 1580013344,
> > tv_nsec = 197238000}, mtime = {
> > tv_sec = 1580013344, tv_nsec = 195122884}, chgtime = {tv_sec =
> > 1580013344, tv_nsec = 197238000}, spaceused = 512, change =
> 1777251344,
> > generation = 0,
> > expire_time_attr = 60, fs_locations = 0x0, sec_label =
> {slai_lfs =
> > {lfs_lfs = 0, lfs_pi = 0}, slai_data = {slai_data_len = 0,
> slai_data_val
> > = 0x0}}}, fh_hk = {node_k = {
> > left = 0x0, right = 0x0, parent = 70361210201962}, key = {hk =
> > 18326489524089677003, fsal = 0x3fff90ec0000 <GPFS>, kv = {addr =
0x0,
> > len = 0}}, inavl = false},
> > mde_flags = 2, attr_time = 1580013344, acl_time = 1580013344,
> > fs_locations_time = 0, lru = {q = {next = 0x0, prev = 0x0}, qid =
> > LRU_ENTRY_CLEANUP, refcnt = 0, flags = 3,
> > lane = 66, cf = 0}, export_list = {next = 0x3ffd78f1d790, prev =
> > 0x3ffd78f1d790}, first_export_id = -1, content_lock = {__data =
> {__lock
> > = 0, __nr_readers = 0,
> > __readers_wakeup = 0, __writer_wakeup = 0,
> __nr_readers_queued =
> > 0, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0,
> > __pad2 = 0, __flags = 0},
> > __size = '\000' <repeats 55 times>, __align = 0}, fsobj
= {hdl =
> > {state_lock = {__data = {__lock = 0, __nr_readers = 0,
> __readers_wakeup
> > = 0, __writer_wakeup = 0,
> > __nr_readers_queued = 0, __nr_writers_queued = 0,
> __writer =
> > 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size =
> '\000'
> > <repeats 55 times>, __align = 0},
> > no_cleanup = false, {file = {obj = 0x3ffd78f1d578,
> list_of_states
> > = {next = 0x3ffd78f1d828, prev = 0x3ffd78f1d828}, layoutrecall_list =
> > {next = 0x3ffd78f1d838,
> > prev = 0x3ffd78f1d838}, lock_list = {next =
> 0x3ffd78f1d848,
> > prev = 0x3ffd78f1d848}, nlm_share_list = {next = 0x3ffd78f1d858,
> prev =
> > 0x3ffd78f1d858},
> > write_delegated = false, fdeleg_stats =
> {fds_curr_delegations
> > = 0, fds_deleg_type = OPEN_DELEGATE_NONE, fds_delegation_count = 0,
> > fds_recall_count = 0,
> > fds_avg_hold = 0, fds_last_delegation = 0,
> fds_last_recall
> > = 0, fds_num_opens = 0, fds_first_open = 0}, anon_ops = 0}, dir =
> > {junction_export = 0x3ffd78f1d578,
> > export_roots = {next = 0x3ffd78f1d828, prev =
> > 0x3ffd78f1d828}, exp_root_refcount = 16381}}}, fsdir = {chunks =
> {next =
> > 0x0, prev = 0x0}, detached = {next = 0x0,
> > prev = 0x0}, spin = 0, detached_count = 0, dhdl =
> {state_lock =
> > {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0,
> > __writer_wakeup = 0,
> > __nr_readers_queued = 0, __nr_writers_queued = 0,
> __writer
> > = 16381, __shared = 2029114744, __pad1 = 70357888391208, __pad2 =
> > 70357888391208, __flags = 16381},
> > __size = '\000' <repeats 26 times>,
> >
>
"?\375x\361\325x\000\000?\375x\361\330(\000\000?\375x\361\330(\000\000?\375x\361\330\070",
> > __align = 0}, no_cleanup = false, {
> > file = {obj = 0x3ffd78f1d848, list_of_states = {next =
> > 0x3ffd78f1d848, prev = 0x3ffd78f1d858}, layoutrecall_list = {next =
> > 0x3ffd78f1d858, prev = 0x0}, lock_list = {
> > next = 0x0, prev = 0x0}, nlm_share_list = {next = 0x0,
> > prev = 0x0}, write_delegated = false, fdeleg_stats =
> > {fds_curr_delegations = 0,
> > fds_deleg_type = OPEN_DELEGATE_NONE,
> fds_delegation_count
> > = 0, fds_recall_count = 0, fds_avg_hold = 0, fds_last_delegation = 0,
> > fds_last_recall = 0,
> > fds_num_opens = 0, fds_first_open = 0}, anon_ops = 0},
> > dir = {junction_export = 0x3ffd78f1d848, export_roots = {next =
> > 0x3ffd78f1d848, prev = 0x3ffd78f1d858},
> > exp_root_refcount = 16381}}}, parent = {addr = 0x0,
> len =
> > 0}, parent_time = 1580009483, first_ck = 0, avl = {t = {root = 0x0,
> > cmp_fn = @0x10299918: 0x101fbfa0 <avl_dirent_name_cmpf>,
> > height = -1, first = 0x0, last = 0x0, size = 0}, ck = {root = 0x0,
> > cmp_fn = @0x10299928: 0x101fc08c <avl_dirent_ck_cmpf>,
> height
> > = -1, first = 0x0, last = 0x61, size = 70357874528096}, sorted =
> {root =
> > 0x3ffd7857a9a0,
> > cmp_fn = @0x10299938: 0x101fc140
<avl_dirent_sorted_cmpf>,
> > height = -1, first = 0x0, last = 0x0, size = 0}, collisions = 0}}}} /
> >
> > Thanks,
> > Madhu Thorat.
> >
>
>