ESX has a somewhat odd NFS implementation, for sure. At Bakeathon, all
the servers have more issues with ESX than with every other client
combined. That said, it's not necessarily a bug in ESX.
When the IP fails over, the client is supposed to detect the change, and
treat it as a server reboot, and initiate a graceful restart. It the
recovers all it's state, and continues correctly.
Ganesha supports this, and CephFS supports this, and it works, but to my
knowledge, it's only been tested with the Linux nfs client.
I found a whitepaper from VMWare claiming that ESX supports HA,
including VIP failover. However it seems to indicate that some
configuration must be done on the ESX server to make it work. It's
pretty general, so it's hard to say. Check the config, and see if
something related to HA or VIP failover can be set?
Here's the whitepaper I found:
https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpap...
Daniel
On 12/11/20 5:45 AM, rhuerta(a)edicomgroup.com wrote:
Hello
We have a shared cephfs volume exported with nfs-ganesha in two nodes.
oot@c51a ~]$ ceph fs status
esx - 2 clients
===
RANK STATE MDS ACTIVITY DNS INOS
0 active esx.c51b.cyxkod Reqs: 0 /s 33 32
POOL TYPE USED AVAIL
cephfs.esx.meta metadata 530M 23.4T
cephfs.esx.data data 148G 23.4T
STANDBY MDS
esx.c51a.hfqjyo
ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
[root@c51a ~]$ ceph nfs cluster info
{
"esx": [
{
"hostname": "c51a",
"ip": [
"172.16.8.160"
],
"port": 2049
},
{
"hostname": "c51b",
"ip": [
"172.16.8.161"
],
"port": 2049
}
]
}
[root@c51a ~]$ ceph nfs export ls esx --detailed
[
{
"export_id": 1,
"path": "/",
"cluster_id": "esx",
"pseudo": "/ceph",
"access_type": "RW",
"squash": "no_root_squash",
"security_label": true,
"protocols": [
4
],
"transports": [
"TCP"
],
"fsal": {
"name": "CEPH",
"user_id": "esx1",
"fs_name": "esx",
"sec_label_xattr": ""
},
"clients": []
}
]
In both nodes port 2049 listens on all ip's
[root@c51a ~]$ netstat -tulnp | grep 2049
tcp6 0 0 :::2049 :::* LISTEN
1511372/ganesha.nfs
udp6 0 0 :::2049 :::*
1511372/ganesha.nfs
We have a balanced service ip with keepalived.
If we mount that nfs in a linux client and we switch the ip from one node to another it
continues listing the content of the mount without problems, when it switches the ip there
is a small delay in the ls but it ends up reconnecting.
However in the ESX when adding a new datastore of type NFS it adds it correctly but when
the ip switches it loses the datastore and it is no longer able to reconnect giving the
following error:
2020-12-09T13:35:05.472Z cpu34:2099634)WARNING: NFS41: NFS41FSAPDNotify:6100: Lost
connection to the server 172.16.1.222 mount point nfs_c51, mounted as
39a1079b-e140bb96-0000-000000000000 ("/ceph")
2020-12-09T13:35:05.474Z cpu34:2099632)WARNING: NFS41: NFS41ProcessExidResult:2460:
Cluster Mismatch due to different server Major or scope. Probable server bug. Remount data
store to access
We do not know if it is a problem in the configuration of Ganesha or in the
implementation of the ESX NFS client since a NFS client of a linux distribution that is
not ESX is able to reconnect with the host that has the service ip.
Thank you for your help.
Regards,
Roberto
_______________________________________________
Support mailing list -- support(a)lists.nfs-ganesha.org
To unsubscribe send an email to support-leave(a)lists.nfs-ganesha.org