Testing with version 2.7.2 of Ganesha with the following patch applied.

https://github.com/nfs-ganesha/nfs-ganesha/commit/23c05a5a3e37a8bd960073ef591ace718fca050a

 

Evidence in the log files point to the crash because mdc_lookup_uncached can free the dirent that continues to be used by mdcache_readdir_chunked.  In mdcache_avl_insert it finds an existing dirent, so removes and frees the old dirent and inserts the new one.  mdcache_readdir_chunked still has a pointer to the dirent that is freed.

 

Suggest a change in mdcache_readdir_chunked (circa line 3046) – if the call to mdc_lookup_uncached returns success, to jump back to again to get the dirent.  ( I am not sure which states need to be restored before jumping back to again)

                          status = mdc_lookup_uncached(directory, dirent->name,

                                                       &entry, NULL);

  

                          if (FSAL_IS_ERROR(status)) {

                                  . . .

                                  return status;

                          }

                         

               +           mdcache_put(entry);

               +           first_pass = true;

               +           chunk = NULL;

               +           goto again;

                                           }

 

Relevant lines from the log file:

mdcache_handle.c:557 :mdcache_readdir :NFS READDIR :DEBUG :NFS READDIR: DEBUG: Calling mdcache_readdir_chunked whence=0

mdcache_helpers.c:2934 :mdcache_readdir_chunked :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: found dirent in cached chunk 0x51e375e0 dirent 0x4a2eadf0 created-on-rd-1

mdcache_helpers.c:2976 :mdcache_readdir_chunked :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Lookup by key for created-on-rd-1 failed, lookup by name now

(gets the write lock and repeats)

mdcache_helpers.c:663 :mdcache_new_entry :INODE :DEBUG :Adding a REGULAR_FILE, entry=0x17ab9150

mdcache_helpers.c:758 :mdcache_new_entry :INODE :F_DBG :New entry 0x17ab9150 added with fh_hk.key hk=d68582e4df05b9f5 fsal=0x7f4e48b0ed20 key=0xfb956a03000000000100

mdcache_handle.c:112 :mdcache_alloc_and_check_handle :INODE :F_DBG :lookup Created entry 0x17ab9150 FSAL FOO for created-on-rd-1

mdcache_helpers.c:1447 :mdcache_dirent_add :INODE :F_DBG :Add dir entry created-on-rd-1

mdcache_avl.c:327 :mdcache_avl_insert :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Insert dir entry 0x51731610 created-on-rd-1

mdcache_avl.c:385 :mdcache_avl_insert :NFS READDIR :DEBUG :NFS READDIR: DEBUG: Already existent when inserting new dirent on entry=0x329efe50 name=created-on-rd-1

mdcache_avl.c:406 :mdcache_avl_insert :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Keys for created-on-rd-1 don't match v=hk=d68582e4df05b9f5 fsal=0x7f4e48b0ed20 key=0xfb956a03000000000100 v2=hk=41227e96857f4696 fsal=0x7f4e48b0ed20 key=0xe08a5903000000000100

mdcache_avl.c:171 :unchunk_dirent :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Unchunking 0x4a2eadf0 created-on-rd-1

mdcache_avl.c:249 :mdcache_avl_remove :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Just freed dirent 0x4a2eadf0 from chunk 0x51e375e0 parent 0x329efe50

mdcache_avl.c:373 :mdcache_avl_insert :NFS READDIR :F_DBG :NFS READDIR: FULLDEBUG: Inserted dirent created-on-rd-1 with ckey hk=d68582e4df05b9f5 fsal=0x7f4e48b0ed20 key=0xfb956a03000000000100

 

Long running tests with a windows client doing a readdir over 5 million files while other threads do IO.

Reproduced with fsal, nfs_readdir, and inode_cache debug  turned on.

I am happy to provide any additional debug info from the logs that will help.

 

Using host libthread_db library "/lib64/libthread_db.so.1".

Core was generated by `bin/ganesha.nfsd -f etc/ganesha/ganesha.conf -p var/run/ganesha.pid -F'.

Program terminated with signal 11, Segmentation fault.

#0  0x00000000005146e8 in display_opaque_bytes (dspbuf=0x7f4e416a35b0, value=0x1f72a01d349d6820, 

    len=1219554592) at /src/src/log/display.c:364

364    /src/src/log/display.c: No such file or directory.

Missing separate debuginfos, use: debuginfo-install sgw-nfs-ganesha-2.0.32.0-1.x86_64

(gdb) bt

#0  0x00000000005146e8 in display_opaque_bytes (dspbuf=0x7f4e416a35b0, value=0x1f72a01d349d6820, 

    len=1219554592) at /src/src/log/display.c:364

#1  0x000000000053a8be in display_mdcache_key (dspbuf=0x7f4e416a35b0, key=0x6096508)

    at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:858

#2  0x000000000053a9cd in mdcache_find_keyed_reason (key=0x6096508, entry=0x7f4e416a3738, 

    reason=MDC_REASON_SCAN) at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:890

#3  0x0000000000542603 in mdcache_readdir_chunked (directory=0x329efe50, whence=0, dir_state=0x7f4e416a3900, 

    cb=0x43225c <populate_dirent>, attrmask=122830, eod_met=0x7f4e416a3ffb)

    at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:2968

#4  0x0000000000530387 in mdcache_readdir (dir_hdl=0x329efe88, whence=0x7f4e416a38e0, 

    dir_state=0x7f4e416a3900, cb=0x43225c <populate_dirent>, attrmask=122830, eod_met=0x7f4e416a3ffb)

    at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:559

#5  0x0000000000432b83 in fsal_readdir (directory=0x329efe88, cookie=0, nbfound=0x7f4e416a3ffc, 

    eod_met=0x7f4e416a3ffb, attrmask=122830, cb=0x493020 <nfs3_readdirplus_callback>, opaque=0x7f4e416a3fb0)

    at /src/src/FSAL/fsal_helper.c:1164

#6  0x0000000000492e79 in nfs3_readdirplus (arg=0x458a3d48, req=0x458a3640, res=0xee64330)

    at /src/src/Protocols/NFS/nfs3_readdirplus.c:310

#7  0x0000000000457d0e in nfs_rpc_process_request (reqdata=0x458a3640)

    at /src/src/MainNFSD/nfs_worker_thread.c:1328

 

(gdb) select-frame 3

(gdb) print *dirent

$1 = {chunk_list = {next = 0x0, prev = 0xc1}, chunk = 0x41e3a6a0, node_name = {left = 0x4a2eade0, right = 0x0, 

    parent = 0}, node_ck = {left = 0x0, right = 0x2, parent = 0}, node_sorted = {left = 0x0, right = 0x0, 

    parent = 0}, ck = 0, eod = false, namehash = 0, ckey = {hk = 0, fsal = 0xc9e3c5e3b0667c56, kv = {

      addr = 0x1f72a01d349d6820, len = 139974203731232}}, flags = 0, name = 0x0, name_buffer = 0x6096538 ""}