On Mon, Jun 27, 2022 at 9:58 PM Pradeep <pradeepthomas@gmail.com> wrote:


On Mon, Jun 27, 2022 at 5:27 AM Daniel Gryniewicz <dang@redhat.com> wrote:


On 6/24/22 17:13, Pradeep wrote:
>
>
> On Fri, Jun 24, 2022 at 6:44 AM Daniel Gryniewicz <dang@redhat.com
> <mailto:dang@redhat.com>> wrote:
>
>     The problem with trying a fixed number of entries is that it doesn't
>     help, unless that number is large enough to be essentially unlimited.
>     The design of the reap algorithm is that it's O(1) in the number of
>     entries (O(n) in the number of queues), so that we can invoke it in the
>     fast path.  If we search a small number of entries on each queue, that
>
>
> You are right. We would like the reap to be O(1) in the data path. We
> also try to reap from
> garbage collector (lru_run). There, we could potentially scan more
> entries from LRU end. This
> may make the code more complex and hard to read though.
>
>
>     doesn't ensure any more than searching 1 that we find something,
>     especially with large readdir workloads.  I think Frank is right, we
>     need to do something else.
>
>     The LRU currently serves two purposes:
>
>     1. A list of entries in LRU order, so that we can reap and reuse them.
>     2. A divided list that facilitates closing open global FDs
>
> Does (1) mean we will keep the LRU order even if the entry is ref'd by
> readdir
> which means readdir could temporarily promote the entry.

Unsure.  We changed the readdir case specifically because it was
clogging up the MRU end of the LRU, and making reaping hard.  I think
I'd start with leaving that aspect the way it is for now.

When an entry from L2 is promoted, it goes to the LRU end of L1. So, you are right
that we can't reap from L1. This may be ok if we are not over the limit. When you are over the limit, 
reclaimable entries get moved to L2 by lru_run (to the MRU end).

The problem is with the entries in L2 at the LRU end. If those get active because of readdir,
we keep those there. This prevents any reaping to happen. If the workload is a mix of readdir + stat,
you could end up with a large number of entries in MDCache. Once workload stops, these entries
can be reclaimed (though in a  multi-user environment, workloads run pretty much 24x7).

So far, we discussed and rejected these:
1. lru_run() to look beyond the first entry at LRU end and free until number of entries is lower than hiwat.
2. readdir path to move entries to MRU end of the queue so that entries at LRU end can be reclaimed.

Anything else can we do to keep the least active entries at LRU end?

Thanks,
Pradeep


Daniel

>
> Thanks,
> Pradeep
>
>     It seems clear to me that 1 is more important than 2, especially since
>     NFSv4 will rarely use global FDs, and some of our most important FSALs
>     don't use system FDs at all.
>
>     What do we think is a good design for handling these two cases?
>
>     Daniel
>
>     On 6/23/22 16:44, Frank Filz wrote:
>      > Hmm, what's the path to taking a ref without moving to MRU of L2
>     or LRU of L1?
>      >
>      > There are a number of issues with sticky entries blowing the
>     cache size coming up that really need to be resolved.
>      >
>      > I wonder if we should remove any entry with refcount >1 from the
>     LRU, and note where we should place it when refcount is reduced to
>     1. That would take out the pinned entries as well as temporarily in
>     use entries. The trick would be the refcount bump for the actual LRU
>     processing.
>      >
>      > Frank
>      >
>      >> -----Original Message-----
>      >> From: Pradeep Thomas [mailto:pradeepthomas@gmail.com
>     <mailto:pradeepthomas@gmail.com>]
>      >> Sent: Thursday, June 23, 2022 1:36 PM
>      >> To: devel@lists.nfs-ganesha.org <mailto:devel@lists.nfs-ganesha.org>
>      >> Subject: [NFS-Ganesha-Devel] Unclaimable MDCache entries at the
>     LRU end of
>      >> L2 queue.
>      >>
>      >> Hello,
>      >>
>      >> I'm hitting a scenario where the entry at the LRU end of L2
>     queue becomes
>      >> active. But we don't move it to L1 - likely because the entry
>     becomes active in
>      >> the context of a readdir. The cache keeps growing to a point
>     where kernel will
>      >> invoke oom killer to terminate ganesha process.
>      >>
>      >> When we reap entries (lru_reap_impl), could we look beyond LRU
>     end - perhaps
>      >> try a fixed number of entries? Another option is to garbage
>     collect the L2 queue
>      >> also and free claimable entries beyond LRU end of the queue(through
>      >> mdcache_lru_release_entries()). Any other thoughts?
>      >> In the instance below, MDCache is supposed to be capped at 100K
>     entries. But it
>      >> grows to > 5 million entries (~17*310K).
>      >>
>      >> sudo gdb -q -p $(pidof ganesha.nfsd) -batch -ex 'p LRU[0].L1'
>     -ex 'p LRU[0].L2' -
>      >> ex 'p LRU[1].L1' -ex 'p LRU[1].L2' -ex 'p LRU[2].L1' -ex 'p
>     LRU[2].L2' -ex 'p
>      >> LRU[3].L1' -ex 'p LRU[3].L2' -ex 'p LRU[4].L1' -ex 'p LRU[4].L2'
>     -ex 'p LRU[5].L1' -
>      >> ex 'p LRU[5].L2' -ex 'p LRU[6].L1' -ex 'p LRU[6].L2'
>      >>
>      >> $1 = {q = {next = 0x7fe16a6adc30, prev = 0x7fe066775d30}, id =
>     LRU_ENTRY_L1,
>      >> size = 37}
>      >> $2 = {q = {next = 0x7fe0cd6d1130, prev = 0x7fdd595e2030}, id =
>     LRU_ENTRY_L2,
>      >> size = 310609}
>      >> $3 = {q = {next = 0x7fe222cc7930, prev = 0x7fe0e8afaf30}, id =
>     LRU_ENTRY_L1,
>      >> size = 37}
>      >> $4 = {q = {next = 0x7fdfa2022d30, prev = 0x7fe01c386b30}, id =
>     LRU_ENTRY_L2,
>      >> size = 310459}
>      >> $5 = {q = {next = 0x7fdfdd8acb30, prev = 0x7fe233849b30}, id =
>     LRU_ENTRY_L1,
>      >> size = 31}
>      >> $6 = {q = {next = 0x7fdf014e7e30, prev = 0x7fdd90fd7430}, id =
>     LRU_ENTRY_L2,
>      >> size = 310297}
>      >> $7 = {q = {next = 0x7fde79a4f030, prev = 0x7fe233a4aa30}, id =
>     LRU_ENTRY_L1,
>      >> size = 32}
>      >> $8 = {q = {next = 0x7fe061388430, prev = 0x7fdd24b5cf30}, id =
>     LRU_ENTRY_L2,
>      >> size = 310659}
>      >> $9 = {q = {next = 0x7fe1e96ce430, prev = 0x7fe0b3b4b130}, id =
>     LRU_ENTRY_L1,
>      >> size = 34}
>      >> $10 = {q = {next = 0x7fe00d84ff30, prev = 0x7fdd685b1530}, id =
>      >> LRU_ENTRY_L2, size = 310635}
>      >> $11 = {q = {next = 0x7fdf9df4fb30, prev = 0x7fe2414aaa30}, id =
>     LRU_ENTRY_L1,
>      >> size = 33}
>      >> $12 = {q = {next = 0x7fe165e82d30, prev = 0x7fdf1d2b8a30}, id =
>      >> LRU_ENTRY_L2, size = 310566}
>      >> $13 = {q = {next = 0x7fe159e55a30, prev = 0x7fde3f973d30}, id =
>      >> LRU_ENTRY_L1, size = 41}
>      >> $14 = {q = {next = 0x7fdf4fbb9030, prev = 0x7fdea8ca0730}, id =
>      >> LRU_ENTRY_L2, size = 310460}
>      >>
>      >> First entry has a ref of 2. But next entries are actually claimable.
>      >>
>      >> sudo gdb -q -p $(pidof ganesha.nfsd) -batch -ex 'p *(mdcache_lru_t
>      >> *)LRU[0].L2.q.next'
>      >> $1 = {q = {next = 0x7fe0fbff0c30, prev = 0x7fe250de2960
>     <LRU+32>}, qid =
>      >> LRU_ENTRY_L2, refcnt = 2, flags = 0, lane = 0, cf = 0}
>      >>
>      >> sudo gdb -q -p $(pidof ganesha.nfsd) -batch -ex 'p *(mdcache_lru_t
>      >> *)0x7fe0fbff0c30'
>      >> $1 = {q = {next = 0x7fe0c2c5a130, prev = 0x7fe0cd6d1130}, qid =
>      >> LRU_ENTRY_L2, refcnt = 1, flags = 0, lane = 0, cf = 0}
>      >>
>      >> sudo gdb -q -p $(pidof ganesha.nfsd) -batch -ex 'p *(mdcache_lru_t
>      >> *)0x7fe0c2c5a130'
>      >> $1 = {q = {next = 0x7fe06dfeac30, prev = 0x7fe142936430}, qid =
>      >> LRU_ENTRY_L2, refcnt = 1, flags = 0, lane = 0, cf = 0}
>      >> _______________________________________________
>      >> Devel mailing list -- devel@lists.nfs-ganesha.org
>     <mailto:devel@lists.nfs-ganesha.org> To unsubscribe send an email to
>      >> devel-leave@lists.nfs-ganesha.org
>     <mailto:devel-leave@lists.nfs-ganesha.org>
>      >
>