Hi,
I've found a bug that only influences some platforms which manifests as a result of a
union of two lock types in src/include/sal_data.h:
union {
/** Lock protecting state */
pthread_mutex_t st_lock;
/** Lock protecting export junctions */
pthread_rwlock_t jct_lock;
};
When I ran Ganesha on macOS with a MEM FSAL and tried to mount the share, I could see in
the debug log that the same exact address was being treated as a jct_lock and then a
st_lock:
$ grep 0x7feb0200a080 ~/ganesha.log
02/10/2020 17:02:28 : epoch 5f77bf94 : <snip> : ganesha.nfsd-95907[main]
state_hdl_init :RW LOCK :F_DBG :Init rwlock 0x7feb0200a080 (&ostate->jct_lock) at
/Users/matvore/ganesha/src/include/sal_functions.h:141
02/10/2020 17:02:28 : epoch 5f77bf94 : <snip> : ganesha.nfsd-95907[main]
init_export_root :RW LOCK :F_DBG :Got write lock on 0x7feb0200a080
(&obj->state_hdl->jct_lock) at
/Users/matvore/ganesha/src/support/exports.c:2595
02/10/2020 17:02:28 : epoch 5f77bf94 : <snip> : ganesha.nfsd-95907[main]
init_export_root :RW LOCK :F_DBG :Unlocked 0x7feb0200a080
(&obj->state_hdl->jct_lock) at
/Users/matvore/ganesha/src/support/exports.c:2605
02/10/2020 17:03:59 : epoch 5f77bf94 : <snip> : ganesha.nfsd-95907[svc_9]
nfs4_op_getattr :RW LOCK :CRIT :Error 22, acquiring mutex 0x7feb0200a080
(&(obj)->state_hdl->st_lock) at
/Users/matvore/ganesha/src/Protocols/NFS/nfs4_op_getattr.c:122
The last log message corresponds to EINVAL and is only printed for acquiring a pthread
mutex (not a rwlock). I think what is happening is that the root directory of the
filesystem is sometimes treated as a normal directory and sometimes as a junction. I
noticed the following code "works" on Linux but prints 22 on the second line on
macOS:
#include <pthread.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
{
union {
pthread_mutex_t mut;
pthread_rwlock_t rwl;
} foo;
memset((void *)&foo, 0, sizeof(foo));
printf("ok if 0: %d\n", pthread_rwlock_init(&foo.rwl, NULL));
printf("ok if 0: %d\n", pthread_mutex_lock(&foo.mut));
}
I will continue looking at this, but posting this now in case anyone has hints. I naively
tried to remove the union and just treat the lock as a rwlock always (grabbing a write
lock rather than a mutex lock), and this solved the EINVAL error+abort, but there may be
something else wrong causing a deadlock, since I have experienced one when mounting, and I
noticed the lock order changes based on the side of the union being used.
Thanks,
Matt