I have been testing Ganesha 2.5.4 code with default mdcache settings. It starts showing issues after prolonged I/O runs.
Once it exhausts all the allowed fds, its kind of gets stuck returning ERR_FSAL_DELAY for every client op.

A snapshot of the mdcache

open_fd_count = 4055
lru_state = {
  entries_hiwat = 100000,
  entries_used = 323,
  chunks_hiwat = 100000,
  chunks_used = 9,
  fds_system_imposed = 4096,
  fds_hard_limit = 4055,
  fds_hiwat = 3686,
  fds_lowat = 2048,
  futility = 109,
  per_lane_work = 50,
  biggest_window = 1638,
  prev_fd_count = 4055,
  prev_time = 1529013538,
  fd_state = 3
}

[cache_lru] lru_run :INODE LRU :INFO :After work, open_fd_count:4055  entries used count:327 fdrate:0 threadwait=9
[cache_lru] lru_run :INODE LRU :INFO :lru entries: 327 open_fd_count:4055
[cache_lru] lru_run :INODE LRU :INFO :lru entries: 327open_fd_count:4055
[cache_lru] lru_run :INODE LRU :INFO :After work, open_fd_count:4055  entries used count:327 fdrate:0 threadwait=90

I have killed the NFS clients, so no new I/O is being received. But even after a couple of hours I don't see lru_run making any progress, thereby open_fd_count remains a 4055 and even a single file open won't be served. So basically the server is in stuck state.

I have these changes patched over 2.5.4 code
e2156ad3feac841487ba89969769bf765457ea6e Replace cache_fds parameter and handling with better logic
667083fe395ddbb4aa14b7bbe7e15ffca87e3b0b MDCACHE - Change and lower futility message
37732e61985d919e6ca84dfa7b4a84163080abae Move open_fd_count from MDCACHE to FSALs (https://review.gerrithub.io/#/c/391267/)

Any suggestions how to resolve this ?