i raised an issue here
https://github.com/nfs-ganesha/nfs-ganesha/issues/963.
We are using 4.0.2 but i went through the latest version/code. This should be reproducible
even with the latest version. The code i've used in the issue is from the latest
version
[
https://opengraph.githubassets.com/5cf87911fa2fec855324fe9ebba733f1e9fef8...
MDCACHE LRU reaper thread issue · Issue #963 ·
nfs-ganesha/nfs-ganesha<https://github.com/nfs-ganesha/nfs-ganesha/iss...
We recently hit some issues where the number of ganesha mdcache entries grew to 6-9
Million when the mdcache hiwater mark was at 450,000. ROOT CAUSE Upon further debugging we
found that readdir was...
github.com
________________________________
From: Frank Filz <ffilzlnx(a)mindspring.com>
Sent: Saturday, June 24, 2023 8:57 PM
To: Deepak Arumugam Sankara Subramanian <deepakarumugam.s(a)nutanix.com>;
devel(a)lists.nfs-ganesha.org <devel(a)lists.nfs-ganesha.org>
Subject: RE: [NFS-Ganesha-Devel] MDCACHE LRU reaper thread issues
What Ganesha version are you using?
Could you open a github issue for this?
Thanks
Frank
From: Deepak Arumugam Sankara Subramanian [mailto:deepakarumugam.s@nutanix.com]
Sent: Saturday, June 24, 2023 1:25 PM
To: devel(a)lists.nfs-ganesha.org; Frank Filz <ffilzlnx(a)mindspring.com>
Subject: [NFS-Ganesha-Devel] MDCACHE LRU reaper thread issues
ISSUE
We recently hit some issues where the number of ganesha mdcache entries grew to 6-9
Million when the mdcache hiwater mark was at 450,000.
ROOT CAUSE
Upon further debugging we found that readdir was not playing well with garbage collection.
The major issue comes from the fact that we have a temporary pointer from the dirent to
the inode mdcache entry.
Today when readdir fills up a chunk it puts a temporary reference from dirent to the inode
cache entry. This happens inside mdc_readdir_chunk_object
new_dir_entry->entry = new_entry;
Then we clear it out inside mdcache_readdir_chunked,
if (has_write && dirent->entry) {
/* If we get here, we have the write lock, have an
* entry, and took a ref on it above. The dirent also
* has a ref on the entry. Drop that ref now. This can
* only be done under the write lock. If we don't have
* the write lock, then this was not the readdir that
* took the ref, and another readdir will drop the ref,
* or it will be dropped when the dirent is cleaned up.
* */
mdcache_lru_unref(dirent->entry, LRU_FLAG_NONE);
dirent->entry = NULL;
}
But this means each readdir can hold a refcount on 128 entries in the inode cache at a
time.
This leads to a scenario where the following happens
1. We have multiple readdirs going in parallel each reading different directories with
1 Million entries. Each readdir ends up holding a refcount of 2 on 128 entries - some of
which end up being the entries at the LRU edge. Because of this the reaper thread is not
able to find an entry to reap
2. There is also another bug where we are not hitting the clearing out code path in
mdcache_readdir_chunked. We don't have an RCA on where this bug is - but we often see
systems where all the LRU edges of the inode cache queues are filled with a refcount of 2
even when there is no load on the system. And all the entries are being referenced by
their dirents which is the cause of the refcount.
POSSIBLE RESOLUTIONS/QUESTIONS
1. Can we remove this temporary pointer from dirent to the normal mdcache entry? This
optimization is only helpful in the cache miss code path. I think this optimization
accomplishes two things,
(a) We don't have to do a lookup from dirent to inode entry in the readdir cache miss
code path. But since we are already in the readdircache miss code path I don't think
the performance penalty will even be visible.
(b) It prevents the inode cache entries from being flushed out when a readdir is in
progress. I wonder if we should just use the LRU way to accomplish this - which brings me
to the second question
1. Can we always move an entry to the MRU of L2 even during scans? Today this is a bit
of an inconsistent behavior - we insert new entries created during readdir at the MRU of
L2 but we don't adjust old entries to the MRU. Not sure if this is an intended
behaviour or bug
Inside mdc_readdir_chunk_object we call mdcache_new_entry with LRU_FLAG_NONE
status = mdcache_new_entry(export, sub_handle, attrs_in, false, NULL,
false, &new_entry, NULL, LRU_FLAG_NONE);
if its a inode cache hit it follows this path
status = mdcache_find_keyed_reason(&key, entry, flags);
* s.. if (likely(*entry)) {
fsal_status_t status;
/* Initial Ref on entry */
mdcache_lru_ref(*entry, flags);
since flags is not LRU_REQ_INITIAL this entry doesn't get
adjusted to the MRU of L2 or LRU of L1
1. but if its a inode cache miss it follows this path
s.. mdcache_lru_insert(nentry, flags);
if flags has LRU_FLAG_NONE mdcache_lru_insert inserts the entry into the MRU of L2
We would really appreciate your comments on the questions/resolutions,
Thanks,
Deepak