So, the dec_state_t_ref() is okay, since the ref to the state was gotten
in nlm_process_parameters(), and a state has to have a ref to it's
object, since it has a pointer to the object.
The last put_ref() is also okay, since nlm_process_parameters() calls
nfs3_FhandleToCache(), which calls create_handle(), which gets a ref on
the object. So these two also seem to be okay.
Can you run this with LTTng enabled? If so, you can enable just the
mdcache:mdc_lru_ref and mdcache:mdc_lru_unref tracepoints. These should
help you in tracking down where the extra unref is.
Daniel
On 06/29/2018 07:32 AM, Sachin Punadikar wrote:
Thanks Daniel.
Wondering which path is leading to crash.
Found that we may be invoking 2 times dec_state_owner_ref() while
executing nlm4_Unlock(). I think this may not be the reason fro crash,
but wanted to bring it to notice.
if (state != NULL)
dec_state_t_ref(state); -> if this
calls dec_nlm_state_ref() , then internally it calls dec_state_owner_ref()
dec_nsm_client_ref(nsm_client);
dec_nlm_client_ref(nlm_client);
dec_state_owner_ref(nlm_owner); -> This can be additional
invocation.
obj->obj_ops.put_ref(obj);
- Sachin.
On Thu, Jun 28, 2018 at 9:00 AM, Daniel Gryniewicz <dang(a)redhat.com
<mailto:dang@redhat.com>> wrote:
No, that put_ref is fine. It's a ref for an entire list, and so is
taken when the first entry is put on the list, and released when the
last entry is removed from the list. It should be safe.
Daniel
On Wed, Jun 27, 2018 at 8:58 AM, Sachin Punadikar
<punadikar.sachin(a)gmail.com <mailto:punadikar.sachin@gmail.com>> wrote:
> Thanks Daniel.
> The issue is not reproducible at will.
> When I checked the code, there is a chance of executing put_ref()
in routine
> state_unlock().
> nlm4_Unlock() -> state_unlock()
> /* If the lock list has become zero; decrement the pin
ref count pt
> * placed. Do this here just in case
subtract_lock_from_list has
> made
> * list empty even if it failed.
> */
> if (glist_empty(&obj->state_hdl->file.lock_list))
> obj->obj_ops.put_ref(obj);
>
> Should we check whether we did called put_ref() in state_unlock() &
> accordingly skip calling put_ref() in nlm4_Unlock ?
>
> On Wed, Jun 27, 2018 at 10:33 AM, Daniel Gryniewicz
<dang(a)redhat.com <mailto:dang@redhat.com>> wrote:
>>
>> So, it looks like some codepath has an extra put_ref() in it. The
>> handle in question had it's refcount go to zero, but still hand
inavl
>> set. Since inavl is tied to the sentinal refcount, this shouldn't
>> happen.
>>
>> This isn't an error I remember seeing before, so it's likely to
be in
>> next as well. Is there a reproducer for this case? MDCACHE has
good
>> refcount debugging via LTTng, but only if I can reproduce it
somehow.
>>
>> Daniel
>>
>> On Tue, Jun 26, 2018 at 6:33 AM, Sachin Punadikar
>> <punadikar.sachin(a)gmail.com
<mailto:punadikar.sachin@gmail.com>>
wrote:
>> >
>> > ---------- Forwarded message ----------
>> > From: Sachin Punadikar <punadikar.sachin(a)gmail.com
<mailto:punadikar.sachin@gmail.com>>
>> > Date: Tue, Jun 26, 2018 at 3:57 PM
>> > Subject: Ganesha 2.5, crash /segfault while executing nlm4_Unlock
>> > To: nfs-ganesha-devel <nfs-ganesha-devel(a)lists.sourceforge.net
<mailto:nfs-ganesha-devel@lists.sourceforge.net>>
>> >
>> >
>> > Hi All,
>> > Recently a crash was reported by customer for Ganesha 2.5.
>> > (gdb) where
>> > #0 0x00007f475872900b in pthread_rwlock_wrlock () from
>> > /lib64/libpthread.so.0
>> > #1 0x000000000041eac9 in fsal_obj_handle_fini
(obj=0x7f4378028028) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/commonlib.c:192
>> > #2 0x000000000053180f in mdcache_lru_clean
(entry=0x7f4378027ff0) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:589
>> > #3 0x0000000000536587 in _mdcache_lru_unref
(entry=0x7f4378027ff0,
>> > flags=0,
>> > func=0x5a9380 <__func__.23209> "cih_remove_checked",
line=406)
>> > at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1921
>> > #4 0x0000000000543e91 in cih_remove_checked
(entry=0x7f4378027ff0) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_hash.h:406
>> > #5 0x0000000000544b26 in mdc_clean_entry
(entry=0x7f4378027ff0) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_helpers.c:235
>> > #6 0x000000000053181e in mdcache_lru_clean
(entry=0x7f4378027ff0) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:592
>> > #7 0x0000000000536587 in _mdcache_lru_unref
(entry=0x7f4378027ff0,
>> > flags=0,
>> > func=0x5a70af <__func__.23112> "mdcache_put",
line=190)
>> > at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.c:1921
>> > #8 0x0000000000539666 in mdcache_put (entry=0x7f4378027ff0) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_lru.h:190
>> > #9 0x000000000053f062 in mdcache_put_ref
(obj_hdl=0x7f4378028028) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1709
>> > #10 0x000000000049bf0f in nlm4_Unlock (args=0x7f4294165830,
>> > req=0x7f4294165028, res=0x7f43f001e0e0)
>> > at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/Protocols/NLM/nlm_Unlock.c:128
>> > #11 0x000000000044c719 in nfs_rpc_execute
(reqdata=0x7f4294165000) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1290
>> > #12 0x000000000044cf23 in worker_run (ctx=0x3c200e0) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1562
>> > #13 0x000000000050a3e7 in fridgethr_start_routine
(arg=0x3c200e0) at
>> >
>> >
/usr/src/debug/nfs-ganesha-2.5.3-ibm013.00-0.1.1-Source/support/fridgethr.c:550
>> > #14 0x00007f4758725dc5 in start_thread () from
/lib64/libpthread.so.0
>> > #15 0x00007f4757de673d in clone () from /lib64/libc.so.6
>> >
>> > A closer look at the backtrace indicates that there was cyclic
flow of
>> > execution as below:
>> > nlm4_Unlock -> mdcache_put_ref -> mdcache_put ->
_mdcache_lru_unref ->
>> > mdcache_lru_clean -> fsal_obj_handle_fini and then
mdc_clean_entry ->
>> > cih_remove_checked -> (purposely coping next flow on below line)
>> >
>> > -> _mdcache_lru_unref -> mdcache_lru_clean ->
fsal_obj_handle_fini
>> > (currently crashing here)
>> >
>> > Do we see any code issue here ? Any hints on how to RCA this
issue ?
>> > Thanks in advance.
>> >
>> > --
>> > with regards,
>> > Sachin Punadikar
>> >
>> >
>> >
>> > --
>> > with regards,
>> > Sachin Punadikar
>> >
>> > _______________________________________________
>> > Devel mailing list -- devel(a)lists.nfs-ganesha.org
<mailto:devel@lists.nfs-ganesha.org>
>> > To unsubscribe send an email to
devel-leave(a)lists.nfs-ganesha.org
<mailto:devel-leave@lists.nfs-ganesha.org>
>> >
>
>
>
>
> --
> with regards,
> Sachin Punadikar
--
with regards,
Sachin Punadikar