I'm not sure what's happening here. Your LRU_Run_Interval is 60, which
leaves a window if a workload starts at the wrong time that opens many
FDs per second, but when we pass the HWMark (which is 50% for you, so
lots of space) we'll wake up the LRU thread, and it should run. You
could change a few of the log messages in lru_run() and/or
lru_run_lane() to LogEvent(), so that they show up in the log without
needing to turn on FULL_DEBUG (which will likely slow things down enough
to affect the test). This should tell you what the LRU is doing during
that time. Several fixes went in post 2.8 that you might be interested
in, Several of which may apply here:
aa07c9b9886b5c7f5fce32138e58f71345a3696e
MDCACHE: fix lru_run thread don't get scheduling when thread_delay
adjusted to zero
a49352d63b407010ac0d5817a8772787c7691b73
MDCACHE - Put ref on root
It may just be best to update to 2.8.3 and try again.
Daniel
On 3/4/20 5:08 AM, des(a)vmware.com wrote:
We are using V2.8.2 but last we pulled was 7 months ago.
Frank, your fix I mentioned above did fix the Filebench hang in fileserver workload.
But now we are seeing new issue in the 'videoserver' workload.
We are seeing a dump of these errors in ganesha log:
2020-03-03T21:08:53Z : epoch 5e5bfd3c : fsvm23 : ganesha.nfsd-33[::ffff:172.30.0.121]
[svc_3546] 288
:vdfs_filehandle_open :FSAL :vdfs_open failed: could not get attributes: Input/output
error (5)
2020-03-03T21:08:53Z : epoch 5e5bfd3c : fsvm23 : ganesha.nfsd-33[::ffff:172.30.0.121]
[svc_3546] 84
:posix2fsal_error :FSAL :Mapping 5 to ERR_FSAL_IO, rlim_cur=65536 rlim_max=65536
There are like 200k lines of these errors within few minutes. I think all operations are
failing with this error.
So this means Ganesha's LRU threads could not close the FDs faster and hence hitting
the underlying filesystem
resource limits?
ulimit -n on our server is 65536, and ganesha.conf is configured with :
CACHEINODE { LRU_Run_Interval = 60; FD_Limit_Percent = 75; FD_HWMark_Percent = 50;
FD_LWMark_Percent = 20; Entries_HWMark = 65536; Reaper_Work_Per_Lane = 100; }
So we should have hit the high water mark and before underlying filesystem and closed the
FDs right?
_______________________________________________
Devel mailing list -- devel(a)lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org