For issue 1, there are only two things that I can think of that can make
different clients see listings differently: timing and the client cache.
Obviously, while the data is mutating, listing it will show it
changing, and listing at different times will show different subsets of
the changes. Ganesha caches directory contents, so what you see depends
on when it was last cached, and which up-calls have been received.
However, this is all server side, and two clients listing at the same
time will get the same results. Clients listing at slightly different
times will get different results, and the differences won't go away
until the actual data changes are completed.
The other thing is the client cache. NFS clients cache metadata (and
sometimes data), and to different clients will always have slightly
different views of the metadata and data as those caches are saved,
invalidated, and refreshed. Ganesha has no control over this aspect of
the issue.
For issue 2, there's some other problem here. Ganesha does not do any
data caching, all read/write calls are passed straight through to the
underlying FSAL. So, if data is corrupted, it's either Gluster's fault
or the client's fault. Ganesha is just a pass-through for data.
I've never heard of a setup like you have, with two separate synced
Gluster clusters fronted by NFS. I don't know anything about Gluster's
sync processes, and so I don't know how that process interacts with the
UP calls that Ganesha needs to help provide consistent access to files.
My guess is that there's some issue there, but I don't know. It's
also possible that the sync process isn't atomic with respect to the
APIs that Ganesha uses, and that this is the source of some of the
issues? I'm just guessing here, however.
Daniel
On 5/13/21 2:23 PM, philipmjewell(a)gmail.com wrote:
Looking to see if there is any config that needs to be set to prevent
this issue or if any logs needs to be provided for further debug? Any assistance is
appreciated.
To kind of help paint a picture on our implementation, we have gone ahead and created the
following miro board:
https://miro.com/app/board/o9J_lE0WOvs=/
We go ahead and make a large change to the NFS filesystem from one of the NFS clients by
deleting and adding 8,000+ files - no particular file size.
During this bulk file update, we see missing or corrupt file appear on one of the NFS
clients. The problem file(s) differ on each given time these operations happen. All other
NFS clients show the file correctly.
We have been able to consistently replicate this issue over NFS 3 as well as NFS 4.2.
Only thing that seems to get logged when enabling rpcdebug for nfs seems to have the
below pattern for the problem file(s) where the permissions seem to use an int of
'1' instead of the ID associated with the given file like all the other files that
successfully go through this process. (We are not sure if this log is relevant but this
seems to be displayed when we see the error)
NFS: nfs_update_inode(0:104/11627611431409627202 fh_crc=0x4d950748 ct=1 info=0x27e7f)
NFS: nfs_lookup_revalidate_done(<MISSING_FILE_HERE>) is valid
NFS: dentry_delete(<MISSING_FILE_HERE>, 48084c)
NFS: permission(0:104/1), mask=0x1, res=0
Note: Gluster configuration has up-call enabled.
Gluster version: glusterfs 8.2
Ganesha version: NFS-Ganesha Release = V3.3
_______________________________________________
Devel mailing list -- devel(a)lists.nfs-ganesha.org
To unsubscribe send an email to devel-leave(a)lists.nfs-ganesha.org