Yes, it does. Good catch.
Does the following code path need chunk to be set to NULL before the
goto restart. After restart chunk could be passed to mdcache_populate_dir_chunk - which
uses the chunk as prev_chunk and assumes it has a ref count on it?
status = mdc_lookup_uncached(directory, dirent->name,
&entry, NULL);
if (FSAL_IS_ERROR(status)) {
PTHREAD_RWLOCK_unlock(&directory->content_lock);
LogFullDebugAlt(COMPONENT_NFS_READDIR,
COMPONENT_CACHE_INODE,
"lookup by name failed
status=%s",
fsal_err_txt(status));
if (status.major == ERR_FSAL_STALE)
mdcache_kill_entry(directory);
mdcache_lru_unref_chunk(chunk);
return status;
}
if (!test_mde_flags(directory, MDCACHE_TRUST_CONTENT |
MDCACHE_TRUST_DIR_CHUNKS)) {
/* mdc_lookup_uncached() replaced the dirent,
* because the key for the old and new entries
* don't match. If the new dirent couldn't
be
* placed, we need to restart readdir.
*/
>>>>>>>>
mdcache_lru_unref_chunk(chunk);
>>>>>>>> goto restart;
}
Thanks,
Vandana
On 7/1/19, 7:25 AM, "Daniel Gryniewicz" <dang(a)redhat.com> wrote:
Thanks. The chunk shouldn't be invalidated, because we have a ref on
it, so it should be safe. You can try moving the unref above the
unlock, if you want, but it shouldn't make a difference. More likely,
we have a corner case of bad refcounting.
Daniel
On 7/1/19 9:56 AM, Rungta, Vandana wrote:
> Daniel,
>
> Attached is a diff file with the Mdcache patches I have applied to the V2.7.4.
If this does not work, I'll create a branch on github in a new repo.
>
> Patches applied:
>
https://github.com/nfs-ganesha/nfs-ganesha/commit/136df4f262c3f9bc29df456...
>
https://github.com/nfs-ganesha/nfs-ganesha/commit/c98ad1e238f5db9db5ab8db...
>
https://github.com/nfs-ganesha/nfs-ganesha/commit/f0d5b8d4f6dcce4597459c3...
>
https://github.com/nfs-ganesha/nfs-ganesha/commit/e11cd5b9f8d6ffbb46061e7...
>
https://github.com/nfs-ganesha/nfs-ganesha/commit/5439f63b8a1e00f6ff56a4d...
>
https://github.com/nfs-ganesha/nfs-ganesha/commit/e306c01685b74970ada2259...
>
> Here are the code snippets around the lines of code in the core, after
mdcache_find_keyed_reason gets an error.
>
> Could this happen because after releasing the content_lock another thead
acquired the lock and invalidated the chunk before this thread acquired the lock again?
> Potential solution - Unref the chunk before releasing the read lock since it
can no longer be trusted after releasing the lock?
>
> if (!has_write) {
> /* We will have to re-find this dirent after
we
> * re-acquire the lock.
> */
> look_ck = dirent->ck;
>
>
PTHREAD_RWLOCK_unlock(&directory->content_lock);
>
PTHREAD_RWLOCK_wrlock(&directory->content_lock);
> has_write = true;
>
> /* Dropping the content_lock may have
> * invalidated some or all of the dirents
and/or
> * chunks in this directory. We need to
start
> * over from this point. look_ck is now
correct
> * if the dirent is still cached, and we
haven't
> * changed next_ck, so it's still
correct for
> * reloading the chunk.
> */
> first_pass = true;
> >>>>>>>>>> Line 3133
mdcache_lru_unref_chunk(chunk);
> chunk = NULL;
>
> /* Now we need to look for this dirent
again.
> * We haven't updated next_ck for this
dirent
> * yet, so it is the right whence to use for
a
> * repopulation readdir if the chunk is
> * discarded.
> */
> goto again;
>
> Thanks,
> Vandana
>
>
>
> On 7/1/19, 5:40 AM, "Daniel Gryniewicz" <dang(a)redhat.com>
wrote:
>
> Hi, Rungta
>
> Because of the patches you've added, none of the line numbers match
any
> code I have right now. Can you put a branch on github with all your
> patches applied, and point me to it?
>
> Thanks,
> Daniel
>
> On 6/29/19 8:24 PM, Rungta, Vandana wrote:
> > Daniel,
> >
> > Core in mdcache when mdcache_lru_unref_chunk. Ref count is 0.
> >
> > I started these tests 10 days ago with all the mdcache patches
including
> > the one moving the unref_chunk in readdir_chunked to inside the
lock.
> > The test continues to readdir over 5 million entries while
> > writing/reading/deleting content. I have 2 test setups – both using
a
> > Windows client that hit this core.
> >
> > Let me know if there is additional information you need from the
cores.
> >
> > Thanks,
> >
> > Vandana
> >
> > [Thread debugging using libthread_db enabled]
> >
> > Using host libthread_db library
"/lib64/libthread_db.so.1".
> >
> > Core was generated by `bin/ganesha.nfsd -f etc/ganesha/ganesha.conf
-p
> > var/run/ganesha.pid -F'.
> >
> > Program terminated with signal 11, Segmentation fault.
> >
> > #00x00007fbd14d39c40 in pthread_mutex_lock () from
/lib64/libpthread.so.0
> >
> > Missing separate debuginfos, use: debuginfo-install
> > sgw-nfs-ganesha-2.0.93.0-1.x86_64
> >
> > (gdb) bt
> >
> > #00x00007fbd14d39c40 in pthread_mutex_lock () from
/lib64/libpthread.so.0
> >
> > #10x000000000052af4a in _mdcache_lru_unref_chunk (chunk=0x3836d1a0,
> > func=0x598a00 <__func__.23678>
"mdcache_readdir_chunked", line=3133)
> >
> > at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:2058
> >
> > #20x0000000000542d84 in mdcache_readdir_chunked
(directory=0x664fa280,
> > whence=37935356, dir_state=0x7fbd0a3dfaf0,
> >
> > cb=0x4323ed <populate_dirent>, attrmask=0,
eod_met=0x7fbd0a3dffeb) at
> > /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:3133
> >
> > #30x000000000053054c in mdcache_readdir (dir_hdl=0x664fa2b8,
> > whence=0x7fbd0a3dfad0, dir_state=0x7fbd0a3dfaf0, cb=0x4323ed
> > <populate_dirent>,
> >
> > attrmask=0, eod_met=0x7fbd0a3dffeb) at
> > /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:559
> >
> > #40x0000000000432d14 in fsal_readdir (directory=0x664fa2b8,
> > cookie=37935356, nbfound=0x7fbd0a3dffec, eod_met=0x7fbd0a3dffeb,
attrmask=0,
> >
> > cb=0x491d35 <nfs3_readdir_callback>, opaque=0x7fbd0a3dffa0) at
> > /src/src/FSAL/fsal_helper.c:1164
> >
> > #50x0000000000491b1d in nfs3_readdir (arg=0xa8fc5d48,
req=0xa8fc5640,
> > res=0xa32fbb30) at /src/src/Protocols/NFS/nfs3_readdir.c:289
> >
> > #60x0000000000457c16 in nfs_rpc_process_request (reqdata=0xa8fc5640)
at
> > /src/src/MainNFSD/nfs_worker_thread.c:1328
> >
> > #70x00000000004583d5 in nfs_rpc_valid_NFS (req=0xa8fc5640) at
> > /src/src/MainNFSD/nfs_worker_thread.c:1548
> >
> > #80x00007fbd15d14034 in svc_vc_decode (req=0xa8fc5640) at
> > /src/src/libntirpc/src/svc_vc.c:829
> >
> > #90x000000000044adc5 in nfs_rpc_decode_request (xprt=0x7ae54c90,
> > xdrs=0x36434390) at
/src/src/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
> >
> > #10 0x00007fbd15d13f45 in svc_vc_recv (xprt=0x7ae54c90) at
> > /src/src/libntirpc/src/svc_vc.c:802
> >
> > #11 0x00007fbd15d10689 in svc_rqst_xprt_task (wpe=0x7ae54ea8) at
> > /src/src/libntirpc/src/svc_rqst.c:769
> >
> > #12 0x00007fbd15d10ae6 in svc_rqst_epoll_events (sr_rec=0x1acd4d0,
> > n_events=1) at /src/src/libntirpc/src/svc_rqst.c:941
> >
> > #13 0x00007fbd15d10d7b in svc_rqst_epoll_loop (sr_rec=0x1acd4d0) at
> > /src/src/libntirpc/src/svc_rqst.c:1014
> >
> > #14 0x00007fbd15d10e2e in svc_rqst_run_task (wpe=0x1acd4d0) at
> > /src/src/libntirpc/src/svc_rqst.c:1050
> >
> > #15 0x00007fbd15d197f6 in work_pool_thread (arg=0x75491e70) at
> > /src/src/libntirpc/src/work_pool.c:181
> >
> > #16 0x00007fbd14d37de5 in start_thread () from
/lib64/libpthread.so.0
> >
> > #17 0x00007fbd1463ef1d in clone () from /lib64/libc.so.6
> >
> > (gdb) select-frame 1
> >
> > (gdb) info locals
> >
> > rc = 32701
> >
> > refcnt = 171833376
> >
> > lane = 268435456
> >
> > qlane = 0xe007e24e0
> >
> > __func__ = "_mdcache_lru_unref_chunk"
> >
> > (gdb) info args
> >
> > chunk = 0x3836d1a0
> >
> > func = 0x598a00 <__func__.23678>
"mdcache_readdir_chunked"
> >
> > line = 3133
> >
> > (gdb) print *chunk
> >
> > $1 = {chunks = {next = 0xed43cd15, prev = 0xa386010002000000},
dirents =
> > {next = 0x300000003000000, prev = 0x1c00000001000000},
> >
> > parent = 0x7000000aaaaaaaa, chunk_lru = {q = {next =
0x6e776f6e6b6e75,
> > prev = 0xfefffffffeffffff}, qid = LRU_ENTRY_NONE, refcnt = 0,
> >
> > flags = 0, lane = 268435456, cf = 16777283}, reload_ck =
1099511627778,
> > next_ck = 7236828793636126720, num_entries = 841903471}
> >
> > (gdb) select-frame 2
> >
> > (gdb) info locals
> >
> > status = {major = ERR_FSAL_NOENT, minor = 0}
> >
> > cb_result = DIR_CONTINUE
> >
> > entry = 0x0
> >
> > attrs = {request_mask = 0, valid_mask = 1433550, supported =
1433582,
> > type = REGULAR_FILE, filesize = 51682, fsid = {major = 0, minor =
0},
> >
> > acl = 0x0, fileid = 37935372, mode = 493, numlinks = 1, owner =
> > 4294967294, group = 4294967294, rawdev = {major = 0, minor = 0},
atime = {
> >
> > tv_sec = 1561811134795000, tv_nsec = 0}, creation = {tv_sec = 0,
tv_nsec
> > = 0}, ctime = {tv_sec = 1561811134795000, tv_nsec = 0}, mtime = {
> >
> > tv_sec = 1561795757, tv_nsec = 842000000}, chgtime = {tv_sec =
> > 1561811134795000, tv_nsec = 0}, spaceused = 51682,
> >
> > change = 1561811134795000000, generation = 0, expire_time_attr = 60,
> > fs_locations = 0x0, sec_label = {slai_lfs = {lfs_lfs = 0, lfs_pi =
0},
> >
> > slai_data = {slai_data_len = 0, slai_data_val = 0x0}}}
> >
> > dirent = 0xa57f3030
> >
> > has_write = true
> >
> > set_first_ck = false
> >
> > next_ck = 37935375
> >
> > look_ck = 37935377
> >
> > chunk = 0x3836d1a0
> >
> > first_pass = true
> >
> > eod = false
> >
> > reload_chunk = false
> >
> > __func__ = "mdcache_readdir_chunked"
> >
> > __PRETTY_FUNCTION__ = "mdcache_readdir_chunked"
> >
> > (gdb) info args
> >
> > directory = 0x664fa280
> >
> > whence = 37935356
> >
> > dir_state = 0x7fbd0a3dfaf0
> >
> > cb = 0x4323ed <populate_dirent>
> >
> > attrmask = 0
> >
> > eod_met = 0x7fbd0a3dffeb
> >
> > (gdb)
> >
>
>
>
_______________________________________________
Devel mailing list -- devel(a)lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org