Hello,
We saw a crash at the following line in mdcache_get_chunk() as prev_chunk's dirent list is empty.
chunk->reload_ck = glist_last_entry(&prev_chunk->dirents,
mdcache_dir_entry_t,
chunk_list)->ck;
The backtrace of the coredump is at the end of the mail.
I could reproduce similar crash by doing the following:
1. In mdcache_readdir_chunked inserted a sleep(30) after the content_lock is released and before the content_lock is acquired for writing as follows:
2805 again:
2806 /* Get here on first pass, retry if we don't hold the write lock,
2807 * and repeated passes if we need to fetch another chunk.
2808 */
2809
2810 LogFullDebugAlt(COMPONENT_NFS_READDIR, COMPONENT_CACHE_INODE,
2811 "Readdir chunked next_ck=0x%"PRIx64" look_ck=%"PRIx64,
2812 next_ck, look_ck);
2813
2814 if (look_ck == 0 ||
2815 !mdcache_avl_lookup_ck(directory, look_ck, &dirent)) {
2816 fsal_status_t status;
2817 /* This starting position isn't in our cache...
2818 * Go populate the cache and process from there.
2819 */
2820 if (!has_write) {
2821 /* Upgrade to write lock and retry just in case
2822 * another thread managed to populate this cookie
2823 * in the meantime.
2824 */
2825 PTHREAD_RWLOCK_unlock(&directory->content_lock);
2826 sleep(30); // Sleep here
2827 PTHREAD_RWLOCK_wrlock(&directory->content_lock);
2828 has_write = true;
2829 goto again;
2830 }
2. From 1st client run 'ls' inside a mounted directory for an export. 'ls' is made to wait as there is sleep() in mdcache_readdir_chunked()
3. Immediately from 2nd client remove all the entries inside the mounted directory for the same export.
4. After sleep() time is over, ganesha crashes as the 'prev_chunk' is not valid in mdcache_readdir_chunked()
Following is the coredump for reference. The code used was ganesha 2.5 and it has patches for 'readdir' taken from https://github.com/dang/nfs-ganesha/tree/v2.5-readdir, the code for mdcache_readdir_chunked() looks similar to mdcache_readdir_chunked() in 2.8
#0 0x00007fae938dc4ab in raise () from /lib64/libpthread.so.0
#1 0x000000000045549e in crash_handler (signo=11,
info=0x7fae25f48eb0, ctx=0x7fae25f48d80) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/MainNFSD/nfs_init.c:225
#2 <signal handler called>
#3 mdcache_get_chunk (parent=0x7faa1001a290,
prev_chunk=0x7fade0206350, whence=2147483647) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:909
#4 0x000000000054fbe9 in mdcache_populate_dir_chunk
(directory=0x7faa1001a290, whence=2147483647, dirent=0x7fae25f49680,
prev_chunk=0x7fade0206350, eod_met=0x7fae25f4967f) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:2659
#5 0x0000000000551767 in mdcache_readdir_chunked
(directory=0x7faa1001a290, whence=2147483647,
dir_state=0x7fae25f49990, cb=0x43310f <populate_dirent>, attrmask=0,
eod_met=0x7fae25f49e8b) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:3053
#6 0x000000000053f39f in mdcache_readdir (dir_hdl=0x7faa1001a2c8,
whence=0x7fae25f49970, dir_state=0x7fae25f49990, cb=0x43310f
<populate_dirent>, attrmask=0, eod_met=0x7fae25f49e8b) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:639
#7 0x00000000004339f3 in fsal_readdir (directory=0x7faa1001a2c8,
cookie=2147483647, nbfound=0x7fae25f49e8c, eod_met=0x7fae25f49e8b,
attrmask=0, cb=0x495d70 <nfs3_readdir_callback>,
opaque=0x7fae25f49e40) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/FSAL/fsal_helper.c:1502
#8 0x0000000000495b57 in nfs3_readdir (arg=0x7fa8b4f75e80,
req=0x7fa8b4f75678, res=0x7faad82c8c70) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/Protocols/NFS/nfs3_readdir.c:289
#9 0x000000000044ccde in nfs_rpc_execute (reqdata=0x7fa8b4f75650) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1290
#10 0x000000000044d4e8 in worker_run (ctx=0x4926600) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1562
#11 0x000000000050c57f in fridgethr_start_routine (arg=0x4926600) at
/usr/src/debug/nfs-ganesha-2.5.3-ibm031.00-0.1.1-Source/support/fridgethr.c:550
(gdb) frame 3
(gdb) p *prev_chunk
$9 = {chunks = {next = 0x7faad808a570, prev = 0x7fade00000d8}, dirents
= {next = 0x7fade0206360, prev = 0x7fade0206360}, parent = 0x0,
chunk_lru = {q = {next = 0x0, prev = 0x0}, qid = LRU_ENTRY_L1, refcnt
= 0, flags = 0, lane = 534, cf = 0}, reload_ck = 1453366958, next_ck =
0, num_entries = 112}
To fix this I have posted a patch:
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/440079
Thanks,
Madhu Thorat.