Sounds plausible, and the qlane lock can be taken with the latch held,
so it should be okay...
Daniel
On 11/15/19 3:46 PM, Rungta, Vandana wrote:
Daniel,
Consider the following race condition.
Thread 41: Currently in mdcache_new_entry after adding the new entry to the hash and
before the lru_insert.
rc = cih_set_latched(nentry, &latch,
op_ctx->fsal_export->fsal, &fh_desc,
CIH_SET_UNLOCK | CIH_SET_HASHED);
.....
......
mdcache_lru_insert(nentry, reason);
Meanwhile Thread 1:
Finds the entry in the hash and is processing it while lruq->qid is LRU_ENTRY_NONE
Suggest holding the latch in mdcache_new_entry until after the mdcache_lru_insert
Thanks,
Vandana
******
(gdb) thread 41
[Switching to thread 41 (Thread 0x7f60c25bf700 (LWP 7286))]
#0 mdcache_new_entry (export=0x1c7a070, sub_handle=0x1d10900, attrs_in=0x7f60c25bd030,
attrs_out=0x0,
new_directory=false, entry=0x7f60c25bd1c8, state=0x0, reason=MDC_REASON_DEFAULT)
at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:847
847 /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c: No such file or
directory.
gdb) print nentry
$1 = (mdcache_entry_t *) 0x1e2da00
******
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f60be49c700 (LWP 23957))]
#0 0x00000000005280e9 in mdcache_lru_cleanup_push (entry=0x1e2da00)
at /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:969
969 /src/src/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c: No such file or directory.
(gdb) print entry
$2 = (mdcache_entry_t *) 0x1e2da00
****
On 11/13/19, 2:52 PM, "Rungta, Vandana" <vrungta(a)amazon.com> wrote:
Daniel,
I don't have any debug logs, and unfortunately I have not been able to reproduce
it. I still have the core, if there is anything else that might be useful to dump. The
total number of files on the share is less than 20, which makes it even more unusual that
the entry was reaped. It happened while copying a 10GB file (reading and writing the file
from the share). I only had 2 such copies in progress. The FSAL was returning
ERR_FSAL_DELAY as I was overloading the subsystem that the share lives on.
(gdb) print lru_state
$2 = {entries_hiwat = 500000, entries_used = 11, chunks_hiwat = 100000, chunks_used
= 2, fds_system_imposed = 400000,
fds_hard_limit = 396000, fds_hiwat = 360000, fds_lowat = 200000, futility = 0,
per_lane_work = 50,
biggest_window = 160000, prev_fd_count = 1, prev_time = 1573511968, fd_state = 0}
Would this fix have any bearing here. ( I'm running 2.7.6 so don't have this
fix )
https://github.com/nfs-ganesha/nfs-ganesha/commit/2f1f87143458d7564588d8f...
Thanks,
Vandana
On 11/13/19, 11:48 AM, "Daniel Gryniewicz" <dgryniew(a)redhat.com>
wrote:
It looks like the entry was somehow reaped (the only way we ever set
LRU_ENTRY_NONE) while it was in use. This should not be possible, as
mdc_read_cb() takes a ref around this use. And, in fact, you can see
that the refcnt is 3, so it shouldn't be reaped. Nothing else should be
cleaning out the LRU fields. The rest of the fields in the entry look
fine, so it's probably a valid entry (and not, say, a use-after-free).
Do you have logs from this run? CACHE_INODE on FULL_DEBUG would be very
helpful.
Daniel