On Mon, Apr 22, 2019 at 7:18 AM Daniel Gryniewicz <dang(a)redhat.com> wrote:
On 4/20/19 11:39 AM, Pradeep wrote:
>
>
> On Wed, Apr 17, 2019 at 5:50 AM Daniel Gryniewicz <dang(a)redhat.com
> <mailto:dang@redhat.com>> wrote:
>
> On 4/17/19 12:49 AM, Pradeep wrote:
> > On 4/11/19, Frank Filz <ffilzlnx(a)mindspring.com
> <mailto:ffilzlnx@mindspring.com>> wrote:
> >>> From: Daniel Gryniewicz [mailto:dang@redhat.com
> <mailto:dang@redhat.com>]
> >>> Sent: Thursday, April 11, 2019 5:28 AM
> >>>
> >>> Okay, I misunderstood the situation. That fix won't work,
because
> >>> mdcache_lru_unref_chunk() takes the qlane lock, and it also
> takes the
> >>> content
> >>> lock, which must be taken before the qlane lock.
> >>>
> >>> The problem here is that the parent has a pointer to the chunk
> and a ref
> >>> on it,
> >>> and the chunk has a pointer to the parent but no ref on it.
> >>> This means that the parent refcount can go to zero while the
> chunk has a
> >>> pointer. I think we need to force remove all the chunks when
> we recycle
> >>> the
> >>> parent, and null out the parent pointer in the chunk as well.
> We'll have
> >>> to audit
> >>> all the uses of the parent pointer and make sure the call path
> has a ref
> >>> to the
> >>> parent, or that it can handle a null parent pointer.
> >
> > Hi Daniel,
> >
> > By force cleaning, did you have something like this in mind?
> >
> >
>
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/451311/1/src/FSAL/Stack...
> >
>
> I was thinking something more like this:
>
>
https://paste.fedoraproject.org/paste/2CJh-0heaj-nQQeb7Ou5KQ
>
> Your solution may work; but it worries me to add an unbounded loop
into
> the mix, especially since this involves multiple locks.
>
> Note, I haven't fully audited all the callpaths to make sure that
they
> can handle parent being NULL. I believe all the ones that can't
have a
> ref on parent, so it should be okay.
>
>
> Hi Daniel,
>
> There may still be one issue with this. The chunk->parent is accessed
> without any locks. So by the time the local variable ("parent") is used,
> chunk->parent might have been set to NULL by the reuse path. The danger
> is this does : (parent && lock(parent->content_lock)). If parent is
> reused, parent->content_lock is destroyed and reinitialized. Not sure
> what the behavior will be at this point.
>
> I'm wondering if we still need content_lock in
> mdcache_lru_unref_chunk(). Earlier it was needed to release dirents. Now
> that dirents are released by mdcache_dirent_invalidate_all() path before
> calling mdcache_lru_unref_chunk(), the parent is guaranteed to be NULL
> if refcnt on chunk is zero, right?
>
> Another question is do we really need chunk_lru_run_lane() to hold a ref
> on the chunk? It only moves the chunk between L1 to L2 with qlane lock
> held. So no other thread could free the chunk during this operation
> because those threads has to wait for qlane lock. So wondering what
> scenario would the ref on the chunk is needed.
>
> Similar race exists in lru_reap_chunk_impl(). At any point if
> mdcache_lru_unref_chunk() is called without any locks, it has the risk
> of parent going away. For example, say you don't get content_lock with
> pthread_rwlock_trywrlock(), then qlane lock is dropped and
> mdcache_lru_unref_chunk() is called. During this time, another thread
> could reuse the parent. Here also you have both qlane lock and
> content_lock till the chunk is out of LRU queue; so not sure if ref on
> chunk is needed?
>
Do Ashish's changes cover this? Or is more necessary?
Thanks Daniel/Ashish. With this patch, do we need the previous patch (now
that all calls to mdcache_lru_unref_chunk() already has content_lock)?