ISSUE
We recently hit some issues where the number of ganesha mdcache entries grew to 6-9 Million when the mdcache hiwater mark was at 450,000.

ROOT CAUSE

Upon further debugging we found that readdir was not playing well with garbage collection. The major issue comes from the fact that we have a temporary pointer from the dirent to the inode mdcache entry.

Today when readdir fills up a chunk it puts a temporary reference from dirent to the inode cache entry. This happens inside mdc_readdir_chunk_object

new_dir_entry->entry = new_entry;

Then we clear it out inside mdcache_readdir_chunked,

if (has_write && dirent->entry) {

/* If we get here, we have the write lock, have an

* entry, and took a ref on it above. The dirent also

* has a ref on the entry. Drop that ref now. This can

* only be done under the write lock. If we don't have

* the write lock, then this was not the readdir that

* took the ref, and another readdir will drop the ref,

* or it will be dropped when the dirent is cleaned up.

* */

mdcache_lru_unref(dirent->entry, LRU_FLAG_NONE);

dirent->entry = NULL;

}
But this means each readdir can hold a refcount on 128 entries in the inode cache at a time.

This leads to a scenario where the following happens

We have multiple readdirs going in parallel each reading different directories with 1 Million entries. Each readdir ends up holding a refcount of 2 on 128 entries - some of which end up being the entries at the LRU edge. Because of this the reaper thread is not able to find an entry to reap
There is also another bug where we are not hitting the clearing out code path in mdcache_readdir_chunked. We don't have an RCA on where this bug is - but we often see systems where all the LRU edges of the inode cache queues are filled with a refcount of 2 even when there is no load on the system. And all the entries are being referenced by their dirents which is the cause of the refcount.

POSSIBLE RESOLUTIONS/QUESTIONS

Can we remove this temporary pointer from dirent to the normal mdcache entry? This optimization is only helpful in the cache miss code path. I think this optimization accomplishes two things,
(a) We don't have to do a lookup from dirent to inode entry in the readdir cache miss code path. But since we are already in the readdircache miss code path I don't think the performance penalty will even be visible.
(b) It prevents the inode cache entries from being flushed out when a readdir is in progress. I wonder if we should just use the LRU way to accomplish this - which brings me to the second question
Can we always move an entry to the MRU of L2 even during scans? Today this is a bit of an inconsistent behavior - we insert new entries created during readdir at the MRU of L2 but we don't adjust old entries to the MRU. Not sure if this is an intended behaviour or bug

Inside mdc_readdir_chunk_object we call mdcache_new_entry with LRU_FLAG_NONE

status = mdcache_new_entry(export, sub_handle, attrs_in, false, NULL,
false, &new_entry, NULL, LRU_FLAG_NONE);

if its a inode cache hit it follows this path

status = mdcache_find_keyed_reason(&key, entry, flags);

s.. if (likely(*entry)) {
fsal_status_t status;

/* Initial Ref on entry */
mdcache_lru_ref(*entry, flags);

since flags is not LRU_REQ_INITIAL this entry doesn't get adjusted to the MRU of L2 or LRU of L1

but if its a inode cache miss it follows this path

s.. mdcache_lru_insert(nentry, flags);

if flags has LRU_FLAG_NONE mdcache_lru_insert inserts the entry into the MRU of L2

We would really appreciate your comments on the questions/resolutions,

Thanks,

Deepak