I have been testing Ganesha 2.5.4 code with default mdcache settings. It
starts showing issues after prolonged I/O runs.
Once it exhausts all the allowed fds, its kind of gets stuck
returning ERR_FSAL_DELAY for every client op.
A snapshot of the mdcache
open_fd_count = 4055
lru_state = {
entries_hiwat = 100000,
entries_used = 323,
chunks_hiwat = 100000,
chunks_used = 9,
fds_system_imposed = 4096,
fds_hard_limit = 4055,
fds_hiwat = 3686,
fds_lowat = 2048,
futility = 109,
per_lane_work = 50,
biggest_window = 1638,
prev_fd_count = 4055,
prev_time = 1529013538,
fd_state = 3
}
[cache_lru] lru_run :INODE LRU :INFO :After work, open_fd_count:4055
entries used count:327 fdrate:0 threadwait=9
[cache_lru] lru_run :INODE LRU :INFO :lru entries: 327 open_fd_count:4055
[cache_lru] lru_run :INODE LRU :INFO :lru entries: 327open_fd_count:4055
[cache_lru] lru_run :INODE LRU :INFO :After work, open_fd_count:4055
entries used count:327 fdrate:0 threadwait=90
I have killed the NFS clients, so no new I/O is being received. But even
after a couple of hours I don't see lru_run making any progress, thereby
open_fd_count remains a 4055 and even a single file open won't be served.
So basically the server is in stuck state.
I have these changes patched over 2.5.4 code
e2156ad3feac841487ba89969769bf765457ea6e Replace cache_fds parameter and
handling with better logic
667083fe395ddbb4aa14b7bbe7e15ffca87e3b0b MDCACHE - Change and lower
futility message
37732e61985d919e6ca84dfa7b4a84163080abae Move open_fd_count from MDCACHE to
FSALs (
https://review.gerrithub.io/#/c/391267/)
Any suggestions how to resolve this ?