Hello
I have long running test which keeps failinging after 2-3 days. There is workload which
runs fine for 2-3 days and then Ganesha process completely hangs up for file creation.
Mounting, unmounting, directory listing, file read - work. However, new file creation
fails completely. The process does not allow file creation until it restarts. I have
waited for 2-3 days to check if process recovers but it does not.
Ganesha version : 3.5
Client : Windows 2016
NFS version : 3
VFS : custom
Workload
Concurrent access from multiple threads in single client. 1 thread continuously (in a
loop) running python os.walk (i.e., readdir) of the entire filesystem, roughly ~5M files
total.
Client runs robocopy command copying folder over to NFS share with 10 threads. The folder
contains randomly between 1-1000 files of random size between 1Kb-50MB. When the writes
complete, a single thread verifies written content, then deletes it. The robocopy
continues again.
Debugging
- I looked at the logs and debugger and file creation fails as
`mdcache_lru_fds_available` evaluates to false everytime.
(
https://github.com/nfs-ganesha/nfs-ganesha/blob/V3.5/src/FSAL/Stackable_F...)
- I checked the "lru_state" variable and "open_fd_count" has hit the
limit. So, ganesha sends `ERR_FSAL_DELAY` response back each time.
(gdb) p lru_state
$1 = {entries_hiwat = 500000, entries_used = 500000, entries_release_size = 100,
chunks_hiwat = 25000, chunks_used = 610, fds_system_imposed = 400000, fds_hard_limit =
396000, fds_hiwat = 360000, fds_lowat = 200000,
futility = 1, per_lane_work = 50, biggest_window = 160000, prev_fd_count = 396000,
prev_time = 1676657325, fd_state = 3}
- However, this situation keeps continues indefinitely. The "lru_thread" is
actively trying to reap but it is not making any progress. Debug logs prove that. I can
probably increase the limits for open fd but it just feels it will hit this limit in a
while.
lru_run :INODE LRU :F_DBG :formeropen=396000 totalwork=0 workpass=37 totalclosed:0
Actually processed 6 entries on lane 8 closing 0 descriptors
- Even unmounting the client does not help.
I have ready server in this state. I can run any debug commands and provide logs.
Thanks in advance for the help.