Daniel,
It is very reproducible in my test setup and I am happy to run a version with additional
debugging and provide you the logs.
Test setup:
File share with 5 million files. (2500 files in 2000 directory)
One thread doing a readdir over the entire listing.
Multiple other threads doing a variety of create/write/read/delete for a set of content.
It seems to reproduce easier and more frequently when the test is running from a Windows
client.
I looked at multiple cores - and all of them had a similar bt.
Thanks,
Vandana
On 6/19/19, 9:06 AM, "Daniel Gryniewicz" <dang(a)redhat.com> wrote:
On 6/15/19 12:15 PM, Rungta, Vandana wrote:
Updated to the final version of
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/457970
and also included
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/458124/1
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/458125/1
Following crash:
mdcache_lru_unref_chunk on line 3310 in mdcache_helper.c is called after releasing the
content lock on the directory. Does it need to be called with the write lock held (based
on the comment for _mdcache_lru_unref_chunk)? What about when has_write is false?
Yes and no. That unref shouldn't ever free the chunk, so it doesn't
really need the write lock (although for correctness it should be up
above the unlock).
The real problem here is that that's somehow the last ref on this chunk,
which it shouldn't be.
Can you characterize the workload? We may need to add some debugging
and get logs from your run, assuming I can't reproduce it.
Daniel
(gdb) bt
#0 0x00007f8777cd3277 in raise () from /lib64/libc.so.6
#1 0x00007f8777cd4968 in abort () from /lib64/libc.so.6
#2 0x00007f8777d15d97 in __libc_message () from /lib64/libc.so.6
#3 0x00007f8777d1e4f9 in _int_free () from /lib64/libc.so.6
#4 0x0000000000543eac in gsh_free (p=0x3ffd1a80) at
/src/src/include/abstract_mem.h:246
#5 0x0000000000544b1c in mdcache_avl_remove (parent=0x52c27e0, dirent=0x3ffd1a80)
at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_avl.c:240
#6 0x0000000000539f93 in mdcache_clean_dirent_chunk (chunk=0x3ac82990)
at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:538
#7 0x000000000052aecd in lru_clean_chunk (chunk=0x3ac82990) at
/src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2033
#8 0x000000000052b060 in _mdcache_lru_unref_chunk (chunk=0x3ac82990, func=0x598a00
<__func__.23678> "mdcache_readdir_chunked",
line=3310) at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2062
#9 0x0000000000543a21 in mdcache_readdir_chunked (directory=0x52c27e0, whence=0,
dir_state=0x7f8771a50af0,
cb=0x4323ed <populate_dirent>, attrmask=0, eod_met=0x7f8771a50feb)
at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:3310
#10 0x000000000053054c in mdcache_readdir (dir_hdl=0x52c2818, whence=0x7f8771a50ad0,
dir_state=0x7f8771a50af0,
cb=0x4323ed <populate_dirent>, attrmask=0, eod_met=0x7f8771a50feb)
at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:559
On 6/14/19, 11:23 AM, "Daniel Gryniewicz" <dang(a)redhat.com> wrote:
Can you update to a newer version of
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/457970 I've made
several changes to it.
Daniel
On 6/14/19 12:16 AM, Rungta, Vandana wrote:
> Nfs-ganesha V2.7.4 with the following MDCACHE patches applied:
>
>
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/457970
>
>
https://github.com/nfs-ganesha/nfs-ganesha/commit/c98ad1e238f5db9db5ab8db...
>
>
https://github.com/nfs-ganesha/nfs-ganesha/commit/136df4f262c3f9bc29df456...
>
> Crash when freeing a dirent.
>
> (gdb) bt
>
> #00x00007fe0a2084277 in raise () from /lib64/libc.so.6
>
> #10x00007fe0a2085968 in abort () from /lib64/libc.so.6
>
> #20x00007fe0a20c6d97 in __libc_message () from /lib64/libc.so.6
>
> #30x00007fe0a20cf4f9 in _int_free () from /lib64/libc.so.6
>
> #40x0000000000543e9c in gsh_free (p=0x1b72140) at
> /src/src/include/abstract_mem.h:246
>
> #50x0000000000544b0c in mdcache_avl_remove (parent=0x1b17000,
> dirent=0x1b72140)
>
> at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_avl.c:240
>
> #60x0000000000545e4d in mdcache_avl_clean_trees (parent=0x1b17000)
>
> at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_avl.c:614
>
> #70x000000000053a11b in mdcache_dirent_invalidate_all (entry=0x1b17000)
>
> at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:606
>
> #80x0000000000541b15 in mdcache_readdir_chunked (directory=0x1b17000,
> whence=5012127, dir_state=0x7fe097debaf0,
>
> cb=0x4323ed <populate_dirent>, attrmask=0, eod_met=0x7fe097debfeb)
>
> at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:2864
>
> #90x000000000053054c in mdcache_readdir (dir_hdl=0x1b17038,
> whence=0x7fe097debad0, dir_state=0x7fe097debaf0,
>
> cb=0x4323ed <populate_dirent>, attrmask=0, eod_met=0x7fe097debfeb)
>
> at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:559
>
> #10 0x0000000000432d14 in fsal_readdir (directory=0x1b17038,
> cookie=5012127, nbfound=0x7fe097debfec,
>
> eod_met=0x7fe097debfeb, attrmask=0, cb=0x491d35
<nfs3_readdir_callback>,
> opaque=0x7fe097debfa0)
>
> at /src/src/FSAL/fsal_helper.c:1164
>
> #11 0x0000000000491b1d in nfs3_readdir (arg=0x1e12508, req=0x1e11e00,
> res=0x1be41e0)
>
> at /src/src/Protocols/NFS/nfs3_readdir.c:289
>
> #12 0x0000000000457c16 in nfs_rpc_process_request (reqdata=0x1e11e00) at
> /src/src/MainNFSD/nfs_worker_thread.c:1328
>
> #13 0x00000000004583d5 in nfs_rpc_valid_NFS (req=0x1e11e00) at
> /src/src/MainNFSD/nfs_worker_thread.c:1548
>
> #14 0x00007fe0a3821034 in svc_vc_decode (req=0x1e11e00) at
> /src/src/libntirpc/src/svc_vc.c:829
>
> #15 0x000000000044adc5 in nfs_rpc_decode_request (xprt=0x1b040e0,
> xdrs=0x1be4000)
>
> at /src/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
>
> #16 0x00007fe0a3820f45 in svc_vc_recv (xprt=0x1b040e0) at
> /src/src/libntirpc/src/svc_vc.c:802
>
> #17 0x00007fe0a381d689 in svc_rqst_xprt_task (wpe=0x1b042f8) at
> /src/src/libntirpc/src/svc_rqst.c:769
>
> #18 0x00007fe0a381dae6 in svc_rqst_epoll_events (sr_rec=0x1ae45f0,
> n_events=1)
>
> at /src/src/libntirpc/src/svc_rqst.c:941
>
> #19 0x00007fe0a381dd7b in svc_rqst_epoll_loop (sr_rec=0x1ae45f0) at
> /src/src/libntirpc/src/svc_rqst.c:1014
>
> ---Type <return> to continue, or q <return> to quit---q
>
> Quit
>
> (gdb) select-frame 5
>
> (gdb) info args
>
> parent = 0x1b17000
>
> dirent = 0x1b72140
>
> (gdb) info locals
>
> chunk = 0x1b4b510
>
> __func__ = "mdcache_avl_remove"
>
> (gdb) print *chunk
>
> $1 = {chunks = {next = 0x1b172a0, prev = 0x1b172a0}, dirents = {next =
> 0x1b29930, prev = 0x1bc2840},
>
> parent = 0x1b17000, chunk_lru = {q = {next = 0x0, prev = 0x0}, qid =
> LRU_ENTRY_L1, refcnt = 0, flags = 0,
>
> lane = 16, cf = 0}, reload_ck = 5012127, next_ck = 0, num_entries = 297}
>
> (gdb) print *dirent
>
> $2 = {chunk_list = {next = 0x0, prev = 0x0}, chunk = 0x0, node_name =
> {left = 0x0, right = 0x0, parent = 28852730},
>
> node_ck = {left = 0x1b78ff0, right = 0x1b68970, parent = 28718322},
> node_sorted = {left = 0x0, right = 0x0,
>
> parent = 0}, ck = 5019809, eod = false, namehash = 6207075547366218433,
> ckey = {hk = 15362064496009940491,
>
> fsal = 0x7fe09f949080 <FOO>, kv = {addr = 0x0, len = 0}}, flags = 0,
> entry = 0x0, name = 0x1b721f0 "random.37",
>
> name_buffer = 0x1b721f0 "random.37"}
>
> (gdb)
>
> I am happy to provide any additional debug info you want from the core.
>
> Thanks,
>
> Vandana
>
>
> _______________________________________________
> Devel mailing list -- devel(a)lists.nfs-ganesha.org
> To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org
>
_______________________________________________
Devel mailing list -- devel(a)lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org
_______________________________________________
Devel mailing list -- devel(a)lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org