(gdb) bt
#0 0x00007f0e1193759b in raise () from /lib64/libpthread.so.0
#1 0x0000000000448dd4 in crash_handler (signo=11,
info=0x7f0e0d3fcdf0, ctx=0x7f0e0d3fccc0) at
/usr/src/debug/nfs-ganesha-2.7.1/MainNFSD/nfs_init.c:246
#2 <signal handler called>
#3 0x000000000054777e in mdcache_clean_dirent_chunk
(chunk=0x7f0df5883470) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:448
#4 0x00000000005386c2 in lru_clean_chunk (chunk=0x7f0df5883470) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2078
#5 0x000000000053883b in mdcache_lru_unref_chunk
(chunk=0x7f0df5883470) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2097
#6 0x00000000005369ac in chunk_lru_run_lane (lane=3) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1509
#7 0x0000000000536d46 in chunk_lru_run (ctx=0x7f0e0dc0f080) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1563
#8 0x0000000000508af9 in fridgethr_start_routine (arg=0x7f0e0dc0f080)
at /usr/src/debug/nfs-ganesha-2.7.1/support/fridgethr.c:550
#9 0x00007f0e1192fe25 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f0e11237bad in clone () from /lib64/libc.so.6
(gdb) f 3
#3 0x000000000054777e in mdcache_clean_dirent_chunk
(chunk=0x7f0df5883470) at
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:448
448 glist_for_each_safe(glist, glistn, &chunk->dirents) {
(gdb) info locals
glist = 0x0
glistn = 0x0
parent = 0x7f0dd8c68200
I can see that in the above case, the content lock on the parent is not held.
A parallel rmdir/rename on the dirent pointed by the glist could set the
glist->next to NULL triggering the crash.
In chunk_lru_run_lane() we are releasing qlock and than reacquire it
in mdcache_lru_unref_chunk()
Without the content lock held in the above scenario,
mdc_try_get_cached() could sneak in, get the qlock for bumping the
chunk's lru and return the cached dirent.
Also, in 2.6, mdcache_clean_dirent_chunk() was always called with
content lock held from both the callers: lru_reap_chunk_impl and
lru_remove_chunk. Probably that's why this crash was not seen with 2.6.
Does this seems to be a plausible scenario?
Ashish