Testing with version 2.7.2 of Ganesha with the following patch applied.
https://github.com/nfs-ganesha/nfs-ganesha/commit/23c05a5a3e37a8bd960073e...
Evidence in the log files point to the crash because mdc_lookup_uncached can free the
dirent that continues to be used by mdcache_readdir_chunked. In mdcache_avl_insert it
finds an existing dirent, so removes and frees the old dirent and inserts the new one.
mdcache_readdir_chunked still has a pointer to the dirent that is freed.
Suggest a change in mdcache_readdir_chunked (circa line 3046) – if the call to
mdc_lookup_uncached returns success, to jump back to again to get the dirent. ( I am not
sure which states need to be restored before jumping back to again)
status = mdc_lookup_uncached(directory, dirent->name,
&entry, NULL);
if (FSAL_IS_ERROR(status)) {
. . .
return status;
}
+ mdcache_put(entry);
+ first_pass = true;
+ chunk = NULL;
+ goto again;
}
Relevant lines from the log file:
mdcache_handle.c:557 :mdcache_readdir :NFS READDIR :DEBUG :NFS READDIR: DEBUG: Calling
mdcache_readdir_chunked whence=0
mdcache_helpers.c:2934 :mdcache_readdir_chunked :NFS READDIR :F_DBG :NFS READDIR:
FULLDEBUG: found dirent in cached chunk 0x51e375e0 dirent 0x4a2eadf0 created-on-rd-1
mdcache_helpers.c:2976 :mdcache_readdir_chunked :NFS READDIR :F_DBG :NFS READDIR:
FULLDEBUG: Lookup by key for created-on-rd-1 failed, lookup by name now
(gets the write lock and repeats)
mdcache_helpers.c:663 :mdcache_new_entry :INODE :DEBUG :Adding a REGULAR_FILE,
entry=0x17ab9150
mdcache_helpers.c:758 :mdcache_new_entry :INODE :F_DBG :New entry 0x17ab9150 added with
fh_hk.key hk=d68582e4df05b9f5 fsal=0x7f4e48b0ed20 key=0xfb956a03000000000100
mdcache_handle.c:112 :mdcache_alloc_and_check_handle :INODE :F_DBG :lookup Created entry
0x17ab9150 FSAL FOO for created-on-rd-1
mdcache_helpers.c:1447 :mdcache_dirent_add :INODE :F_DBG :Add dir entry created-on-rd-1
mdcache_avl.c:327 :mdcache_avl_insert :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Insert
dir entry 0x51731610 created-on-rd-1
mdcache_avl.c:385 :mdcache_avl_insert :NFS READDIR :DEBUG :NFS READDIR: DEBUG: Already
existent when inserting new dirent on entry=0x329efe50 name=created-on-rd-1
mdcache_avl.c:406 :mdcache_avl_insert :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Keys
for created-on-rd-1 don't match v=hk=d68582e4df05b9f5 fsal=0x7f4e48b0ed20
key=0xfb956a03000000000100 v2=hk=41227e96857f4696 fsal=0x7f4e48b0ed20
key=0xe08a5903000000000100
mdcache_avl.c:171 :unchunk_dirent :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Unchunking
0x4a2eadf0 created-on-rd-1
mdcache_avl.c:249 :mdcache_avl_remove :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Just
freed dirent 0x4a2eadf0 from chunk 0x51e375e0 parent 0x329efe50
mdcache_avl.c:373 :mdcache_avl_insert :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG:
Inserted dirent created-on-rd-1 with ckey hk=d68582e4df05b9f5 fsal=0x7f4e48b0ed20
key=0xfb956a03000000000100
Long running tests with a windows client doing a readdir over 5 million files while other
threads do IO.
Reproduced with fsal, nfs_readdir, and inode_cache debug turned on.
I am happy to provide any additional debug info from the logs that will help.
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `bin/ganesha.nfsd -f etc/ganesha/ganesha.conf -p var/run/ganesha.pid
-F'.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000005146e8 in display_opaque_bytes (dspbuf=0x7f4e416a35b0,
value=0x1f72a01d349d6820,
len=1219554592) at /src/src/log/display.c:364
364 /src/src/log/display.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install sgw-nfs-ganesha-2.0.32.0-1.x86_64
(gdb) bt
#0 0x00000000005146e8 in display_opaque_bytes (dspbuf=0x7f4e416a35b0,
value=0x1f72a01d349d6820,
len=1219554592) at /src/src/log/display.c:364
#1 0x000000000053a8be in display_mdcache_key (dspbuf=0x7f4e416a35b0, key=0x6096508)
at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:858
#2 0x000000000053a9cd in mdcache_find_keyed_reason (key=0x6096508,
entry=0x7f4e416a3738,
reason=MDC_REASON_SCAN) at
/src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:890
#3 0x0000000000542603 in mdcache_readdir_chunked (directory=0x329efe50, whence=0,
dir_state=0x7f4e416a3900,
cb=0x43225c <populate_dirent>, attrmask=122830, eod_met=0x7f4e416a3ffb)
at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:2968
#4 0x0000000000530387 in mdcache_readdir (dir_hdl=0x329efe88, whence=0x7f4e416a38e0,
dir_state=0x7f4e416a3900, cb=0x43225c <populate_dirent>, attrmask=122830,
eod_met=0x7f4e416a3ffb)
at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:559
#5 0x0000000000432b83 in fsal_readdir (directory=0x329efe88, cookie=0,
nbfound=0x7f4e416a3ffc,
eod_met=0x7f4e416a3ffb, attrmask=122830, cb=0x493020
<nfs3_readdirplus_callback>, opaque=0x7f4e416a3fb0)
at /src/src/FSAL/fsal_helper.c:1164
#6 0x0000000000492e79 in nfs3_readdirplus (arg=0x458a3d48, req=0x458a3640,
res=0xee64330)
at /src/src/Protocols/NFS/nfs3_readdirplus.c:310
#7 0x0000000000457d0e in nfs_rpc_process_request (reqdata=0x458a3640)
at /src/src/MainNFSD/nfs_worker_thread.c:1328
(gdb) select-frame 3
(gdb) print *dirent
$1 = {chunk_list = {next = 0x0, prev = 0xc1}, chunk = 0x41e3a6a0, node_name = {left =
0x4a2eade0, right = 0x0,
parent = 0}, node_ck = {left = 0x0, right = 0x2, parent = 0}, node_sorted = {left =
0x0, right = 0x0,
parent = 0}, ck = 0, eod = false, namehash = 0, ckey = {hk = 0, fsal =
0xc9e3c5e3b0667c56, kv = {
addr = 0x1f72a01d349d6820, len = 139974203731232}}, flags = 0, name = 0x0,
name_buffer = 0x6096538 ""}