Something like this:
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/447484
On Thu, Mar 7, 2019 at 7:25 AM Ashish Sangwan <ashishsangwan2(a)gmail.com> wrote:
>
> (gdb) bt
> #0 0x00007f0e1193759b in raise () from /lib64/libpthread.so.0
> #1 0x0000000000448dd4 in crash_handler (signo=11,
> info=0x7f0e0d3fcdf0, ctx=0x7f0e0d3fccc0) at
> /usr/src/debug/nfs-ganesha-2.7.1/MainNFSD/nfs_init.c:246
> #2 <signal handler called>
> #3 0x000000000054777e in mdcache_clean_dirent_chunk
> (chunk=0x7f0df5883470) at
>
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:448
> #4 0x00000000005386c2 in lru_clean_chunk (chunk=0x7f0df5883470) at
>
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2078
> #5 0x000000000053883b in mdcache_lru_unref_chunk
> (chunk=0x7f0df5883470) at
>
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2097
> #6 0x00000000005369ac in chunk_lru_run_lane (lane=3) at
>
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1509
> #7 0x0000000000536d46 in chunk_lru_run (ctx=0x7f0e0dc0f080) at
>
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1563
> #8 0x0000000000508af9 in fridgethr_start_routine (arg=0x7f0e0dc0f080)
> at /usr/src/debug/nfs-ganesha-2.7.1/support/fridgethr.c:550
> #9 0x00007f0e1192fe25 in start_thread () from /lib64/libpthread.so.0
> #10 0x00007f0e11237bad in clone () from /lib64/libc.so.6
>
> (gdb) f 3
> #3 0x000000000054777e in mdcache_clean_dirent_chunk
> (chunk=0x7f0df5883470) at
>
/usr/src/debug/nfs-ganesha-2.7.1/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:448
> 448 glist_for_each_safe(glist, glistn, &chunk->dirents) {
>
> (gdb) info locals
> glist = 0x0
> glistn = 0x0
> parent = 0x7f0dd8c68200
>
> I can see that in the above case, the content lock on the parent is not held.
> A parallel rmdir/rename on the dirent pointed by the glist could set the
> glist->next to NULL triggering the crash.
> In chunk_lru_run_lane() we are releasing qlock and than reacquire it
> in mdcache_lru_unref_chunk()
> Without the content lock held in the above scenario,
> mdc_try_get_cached() could sneak in, get the qlock for bumping the
> chunk's lru and return the cached dirent.
> Also, in 2.6, mdcache_clean_dirent_chunk() was always called with
> content lock held from both the callers: lru_reap_chunk_impl and
> lru_remove_chunk. Probably that's why this crash was not seen with 2.6.
> Does this seems to be a plausible scenario?
>
> Ashish
> _______________________________________________
> Devel mailing list -- devel(a)lists.nfs-ganesha.org
> To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org