Thanks Daniel.
The issue is not reproducible at will.
When I checked the code, there is a chance of executing put_ref() in routine state_unlock(). 
nlm4_Unlock() -> state_unlock() 
        /* If the lock list has become zero; decrement the pin ref count pt
         * placed. Do this here just in case subtract_lock_from_list has made
         * list empty even if it failed.
         */
        if (glist_empty(&obj->state_hdl->file.lock_list))
                obj->obj_ops.put_ref(obj);

Should we check whether we did called put_ref() in state_unlock() & accordingly skip calling put_ref() in nlm4_Unlock ?

On Wed, Jun 27, 2018 at 10:33 AM, Daniel Gryniewicz <dang@redhat.com> wrote:
So, it looks like some codepath has an extra put_ref() in it.  The
handle in question had it's refcount go to zero, but still hand inavl
set.  Since inavl is tied to the sentinal refcount, this shouldn't
happen.

This isn't an error I remember seeing before, so it's likely to be in
next as well.  Is there a reproducer for this case?  MDCACHE has good
refcount debugging via LTTng, but only if I can reproduce it somehow.

Daniel

On Tue, Jun 26, 2018 at 6:33 AM, Sachin Punadikar
<punadikar.sachin@gmail.com> wrote:
>
> ---------- Forwarded message ----------
> From: Sachin Punadikar <punadikar.sachin@gmail.com>
> Date: Tue, Jun 26, 2018 at 3:57 PM
> Subject: Ganesha 2.5, crash /segfault while executing nlm4_Unlock
> To: nfs-ganesha-devel <nfs-ganesha-devel@lists.sourceforge.net>
>
>
> Hi All,
> Recently a crash was reported by customer for Ganesha 2.5.
> (gdb) where
> #0  0x00007f475872900b in pthread_rwlock_wrlock () from
> /lib64/libpthread.so.0
> #1  0x000000000041eac9 in fsal_obj_handle_fini (obj=0x7f4378028028) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/commonlib.c:192
> #2  0x000000000053180f in mdcache_lru_clean (entry=0x7f4378027ff0) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:589
> #3  0x0000000000536587 in _mdcache_lru_unref (entry=0x7f4378027ff0, flags=0,
> func=0x5a9380 <__func__.23209> "cih_remove_checked", line=406)
>     at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1921
> #4  0x0000000000543e91 in cih_remove_checked (entry=0x7f4378027ff0) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_hash.h:406
> #5  0x0000000000544b26 in mdc_clean_entry (entry=0x7f4378027ff0) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:235
> #6  0x000000000053181e in mdcache_lru_clean (entry=0x7f4378027ff0) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:592
> #7  0x0000000000536587 in _mdcache_lru_unref (entry=0x7f4378027ff0, flags=0,
> func=0x5a70af <__func__.23112> "mdcache_put", line=190)
>     at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1921
> #8  0x0000000000539666 in mdcache_put (entry=0x7f4378027ff0) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.h:190
> #9  0x000000000053f062 in mdcache_put_ref (obj_hdl=0x7f4378028028) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1709
> #10 0x000000000049bf0f in nlm4_Unlock (args=0x7f4294165830,
> req=0x7f4294165028, res=0x7f43f001e0e0)
>     at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/Protocols/NLM/nlm_Unlock.c:128
> #11 0x000000000044c719 in nfs_rpc_execute (reqdata=0x7f4294165000) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1290
> #12 0x000000000044cf23 in worker_run (ctx=0x3c200e0) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1562
> #13 0x000000000050a3e7 in fridgethr_start_routine (arg=0x3c200e0) at
> /usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/support/fridgethr.c:550
> #14 0x00007f4758725dc5 in start_thread () from /lib64/libpthread.so.0
> #15 0x00007f4757de673d in clone () from /lib64/libc.so.6
>
> A closer look at the backtrace indicates that there was cyclic flow of
> execution as below:
> nlm4_Unlock -> mdcache_put_ref -> mdcache_put -> _mdcache_lru_unref ->
> mdcache_lru_clean -> fsal_obj_handle_fini and then mdc_clean_entry ->
> cih_remove_checked ->   (purposely coping next flow on below line)
>
> -> _mdcache_lru_unref -> mdcache_lru_clean -> fsal_obj_handle_fini
> (currently crashing here)
>
> Do we see any code issue here ? Any hints on how to RCA this issue ?
> Thanks in advance.
>
> --
> with regards,
> Sachin Punadikar
>
>
>
> --
> with regards,
> Sachin Punadikar
>
> _______________________________________________
> Devel mailing list -- devel@lists.nfs-ganesha.org
> To unsubscribe send an email to devel-leave@lists.nfs-ganesha.org
>



--
with regards,
Sachin Punadikar