[NFS-Ganesha-Devel] Ganesha crashed in mdcache_clean_dirent_chunk

Thursday, 7 March 2019

(gdb) bt
#0  0x00007f0e1193759b in raise () from /lib64/libpthread.so.0
#1  0x0000000000448dd4 in crash_handler (signo=11,
info=0x7f0e0d3fcdf0, ctx=0x7f0e0d3fccc0) at
/usr/src/debug/nfs-ganesha-2.7.1/MainNFSD/nfs_init.c:246
#2  <signal handler called>
#3  0x000000000054777e in mdcache_clean_dirent_chunk
(chunk=0x7f0df5883470) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:448
#4  0x00000000005386c2 in lru_clean_chunk (chunk=0x7f0df5883470) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2078
#5  0x000000000053883b in mdcache_lru_unref_chunk
(chunk=0x7f0df5883470) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2097
#6  0x00000000005369ac in chunk_lru_run_lane (lane=3) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1509
#7  0x0000000000536d46 in chunk_lru_run (ctx=0x7f0e0dc0f080) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1563
#8  0x0000000000508af9 in fridgethr_start_routine (arg=0x7f0e0dc0f080)
at /usr/src/debug/nfs-ganesha-2.7.1/support/fridgethr.c:550
#9  0x00007f0e1192fe25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f0e11237bad in clone () from /lib64/libc.so.6

(gdb) f 3
#3  0x000000000054777e in mdcache_clean_dirent_chunk
(chunk=0x7f0df5883470) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:448
448 glist_for_each_safe(glist, glistn, &chunk->dirents) {

(gdb) info locals
glist = 0x0
glistn = 0x0
parent = 0x7f0dd8c68200

I can see that in the above case, the content lock on the parent is not held.
A parallel rmdir/rename on the dirent pointed by the glist could set the
glist->next to NULL triggering the crash.
In chunk_lru_run_lane() we are releasing qlock and than reacquire it
in mdcache_lru_unref_chunk()
Without the content lock held in the above scenario,
mdc_try_get_cached() could sneak in, get the qlock for bumping the
chunk's lru and return the cached dirent.
Also, in 2.6, mdcache_clean_dirent_chunk() was always called with
content lock held from both the callers: lru_reap_chunk_impl and
lru_remove_chunk. Probably that's why this crash was not seen with 2.6.
Does this seems to be a plausible scenario?

Ashish

2025

2024

2023

2022

2021

2020

2019

2018

[NFS-Ganesha-Devel] Ganesha crashed in mdcache_clean_dirent_chunk