If a client starts writing a large file, then the READDIR from that directory seems to be
stuck. Debugging further, it appears to be caused by a lock contention between FSAL merge
(vfs_merge()) and write or commit FSAL functions.
- Since write started first, OPEN would have created a new MDCACHE entry.
- As part of READDIR, MDCACHE tries to create a new entry for the same file, so
mdcache_new_entry() will reach obj_ops->merge().
This seems to get stuck trying to get the obj_lock in write mode.
- Since client is sending multiple writes and commits, the readdir thread seems to be
stuck until it can get the write lock in vfs_merge().
Can be reproduced by the following commands (latest Ganesha, VFS FSAL):
Let's say, /gsh4 is the mount point
- mkdir /gsh4/dir.1
- touch /gsh4/dir.1/file.{1..100}
- dd if=/dev/zero of=/gsh4/dir.1/largefile bs=1M count=100000 &
- ls /gsh4/dir.1
Note that any other client doing a READDIR will also be slow.
Is there a way this path can be optimized? Wouldn't share_counters be zero in readdir
path (meaning there isn't anything to merge)?
Thanks,
Pradeep