Getting crash occasionally, if run IO in parallel to the scenario mentioned above.

(gdb) bt

#0  0x0000000000000000 in ?? ()

#1  0x00007fc1fd6427cb in opr_rbtree_insert (head=0x7fc1615edf08, node=0x7fc1f2010c30)

    at /usr/src/debug/nfs-ganesha-2.7.1/libntirpc/src/rbtree.c:271

#2  0x00007fc1fd63cde4 in clnt_req_setup (cc=0x7fc1f2010c00, timeout=...)

    at /usr/src/debug/nfs-ganesha-2.7.1/libntirpc/src/clnt_generic.c:538

#3  0x00000000004a2bce in nsm_unmonitor (host=0x7fc1615e3080) at /usr/src/debug/nfs-ganesha-2.7.1/Protocols/NLM/nsm.c:219

#4  0x00000000004e6a7f in dec_nsm_client_ref (client=0x7fc1615e3080) at /usr/src/debug/nfs-ganesha-2.7.1/SAL/nlm_owner.c:857

#5  0x00000000004e73dd in free_nlm_client (client=0x7fc1615dfa40) at /usr/src/debug/nfs-ganesha-2.7.1/SAL/nlm_owner.c:1039

#6  0x00000000004e7753 in dec_nlm_client_ref (client=0x7fc1615dfa40) at /usr/src/debug/nfs-ganesha-2.7.1/SAL/nlm_owner.c:1130

#7  0x00000000004e7f36 in free_nlm_owner (owner=0x7fc161616200) at /usr/src/debug/nfs-ganesha-2.7.1/SAL/nlm_owner.c:1314

#8  0x00000000004c87bb in free_state_owner (owner=0x7fc161616200) at /usr/src/debug/nfs-ganesha-2.7.1/SAL/state_misc.c:818

#9  0x00000000004c8d56 in dec_state_owner_ref (owner=0x7fc161616200) at /usr/src/debug/nfs-ganesha-2.7.1/SAL/state_misc.c:968

#10 0x000000000049d906 in nlm4_Unlock (args=0x7fc1f2022f08, req=0x7fc1f2022800, res=0x7fc160e522c0)

    at /usr/src/debug/nfs-ganesha-2.7.1/Protocols/NLM/nlm_Unlock.c:119

#11 0x000000000045cacb in nfs_rpc_process_request (reqdata=0x7fc1f2022800)

    at /usr/src/debug/nfs-ganesha-2.7.1/MainNFSD/nfs_worker_thread.c:1329

#12 0x000000000045d399 in nfs_rpc_valid_NLM (req=0x7fc1f2022800)

    at /usr/src/debug/nfs-ganesha-2.7.1/MainNFSD/nfs_worker_thread.c:1581

#13 0x00007fc1fd658d9c in svc_vc_decode (req=0x7fc1f2022800) at /usr/src/debug/nfs-ganesha-2.7.1/libntirpc/src/svc_vc.c:825

#14 0x000000000044fc82 in nfs_rpc_decode_request (xprt=0x7fc1f8453800, xdrs=0x7fc1f8444c00)

    at /usr/src/debug/nfs-ganesha-2.7.1/MainNFSD/nfs_rpc_dispatcher_thread.c:1341

#15 0x00007fc1fd658cad in svc_vc_recv (xprt=0x7fc1f8453800) at /usr/src/debug/nfs-ganesha-2.7.1/libntirpc/src/svc_vc.c:798

#16 0x00007fc1fd6553fe in svc_rqst_xprt_task (wpe=0x7fc1f8453a18)

    at /usr/src/debug/nfs-ganesha-2.7.1/libntirpc/src/svc_rqst.c:767

#17 0x00007fc1fd655878 in svc_rqst_epoll_events (sr_rec=0x7fc1f84c3b10, n_events=2)

    at /usr/src/debug/nfs-ganesha-2.7.1/libntirpc/src/svc_rqst.c:939

#18 0x00007fc1fd655b0d in svc_rqst_epoll_loop (sr_rec=0x7fc1f84c3b10)

    at /usr/src/debug/nfs-ganesha-2.7.1/libntirpc/src/svc_rqst.c:1012

#19 0x00007fc1fd655bc0 in svc_rqst_run_task (wpe=0x7fc1f84c3b10)

    at /usr/src/debug/nfs-ganesha-2.7.1/libntirpc/src/svc_rqst.c:1048

#20 0x00007fc1fd65e510 in work_pool_thread (arg=0x7fc1f240e020) at /usr/src/debug/nfs-ganesha-2.7.1/libntirpc/src/work_pool.c:181

#21 0x00007fc1fbbd1dd5 in start_thread () from /lib64/libpthread.so.0

#22 0x00007fc1fb4d8ead in clone () from /lib64/libc.so.6

(gdb) 


Regards,
Gaurav


On Wed, Feb 13, 2019 at 9:06 PM gaurav gangalwar <gaurav.gangalwar@gmail.com> wrote:
Using Ganesha 2.7.1
I did this sequence with NFS V3.
1>nlm lock from client
2>Restart statd
3>nlm unlock from client

svc xprt got created for nsm_connect during nlm lock gets destroyed if statd restart, this happens through svc_rqst_epoll_event.
But we have global nsm_clnt which will still point to destroyed svc xprt.
We have checks on svc xprt flags if it already got destroyed, but this will not work if that memory get reallocated and we could end up doing memory corruption.
Here are log snippets.
xprt destroyed through epoll:
13/02/2019 03:25:58 : epoch 5c63d058 : centos7 : ganesha.nfsd-33933[svc_13] rpc :TIRPC :F_DBG :svc_vc_wait: 0x7f5d783f0400 fd 34 recv closed (will set dead)
13/02/2019 03:25:58 : epoch 5c63d058 : centos7 : ganesha.nfsd-33933[svc_21] rpc :TIRPC :F_DBG :svc_vc_destroy_task() 0x7f5d783f0400 fd 34 xp_refcnt 0

nsm unmonitor accessing destroyed xprt:
13/02/2019 03:26:51 : epoch 5c63d058 : centos7 : ganesha.nfsd-33933[svc_21] rpc :TIRPC :F_DBG :WARNING! already destroying!() 0x7f5d783f0400 fd -1 xp_refcnt 0 af 2 port 58327 @svc_ioq_write:233
13/02/2019 03:26:54 : epoch 5c63d058 : centos7 : ganesha.nfsd-33933[svc_12] nsm_unmonitor :NLM :CRIT :Unmonitor ::ffff:10.53.91.67 SM_MON failed: RPC: Timed out


I am not sure if this is a right way to use nsm rpc client, as its pointing to svc xprt without taking extra ref.
Is this a ref count issue with nsm rpc client, should we take extra ref for it?
Or we are should not keep global nsm rpc client, instead do nsm_connect/disconnect for every MON/UNMON call?

I tried with extra ref fix, it seems to be working,

Regards,
Gaurav