ESX has a somewhat odd NFS implementation, for sure.  At Bakeathon, all 
the servers have more issues with ESX than with every other client 
combined.  That said, it's not necessarily a bug in ESX.
When the IP fails over, the client is supposed to detect the change, and 
treat it as a server reboot, and initiate a graceful restart.  It the 
recovers all it's state, and continues correctly.
Ganesha supports this, and CephFS supports this, and it works, but to my 
knowledge, it's only been tested with the Linux nfs client.
I found a whitepaper from VMWare claiming that ESX supports HA, 
including VIP failover.  However it seems to indicate that some 
configuration must be done on the ESX server to make it work.  It's 
pretty general, so it's hard to say.  Check the config, and see if 
something related to HA or VIP failover can be set?
Here's the whitepaper I found:
https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpap...
Daniel
On 12/11/20 5:45 AM, rhuerta(a)edicomgroup.com wrote:
 Hello
 
 We have a shared cephfs volume exported with nfs-ganesha in two nodes.
 
 oot@c51a ~]$ ceph fs status
 esx - 2 clients
 ===
 RANK  STATE         MDS           ACTIVITY     DNS    INOS
   0    active  esx.c51b.cyxkod  Reqs:    0 /s    33     32
        POOL         TYPE     USED  AVAIL
 cephfs.esx.meta  metadata   530M  23.4T
 cephfs.esx.data    data     148G  23.4T
    STANDBY MDS
 esx.c51a.hfqjyo
 ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
 
 [root@c51a ~]$ ceph nfs cluster info
 {
      "esx": [
          {
              "hostname": "c51a",
              "ip": [
                  "172.16.8.160"
              ],
              "port": 2049
          },
          {
              "hostname": "c51b",
              "ip": [
                  "172.16.8.161"
              ],
              "port": 2049
          }
      ]
 }
 
 [root@c51a ~]$ ceph nfs export ls esx --detailed
 [
    {
      "export_id": 1,
      "path": "/",
      "cluster_id": "esx",
      "pseudo": "/ceph",
      "access_type": "RW",
      "squash": "no_root_squash",
      "security_label": true,
      "protocols": [
        4
      ],
      "transports": [
        "TCP"
      ],
      "fsal": {
        "name": "CEPH",
        "user_id": "esx1",
        "fs_name": "esx",
        "sec_label_xattr": ""
      },
      "clients": []
    }
 ]
 
 In both nodes port 2049 listens on all ip's
 
 [root@c51a ~]$ netstat -tulnp | grep 2049
 tcp6       0      0 :::2049                 :::*                    LISTEN     
1511372/ganesha.nfs
 udp6       0      0 :::2049                 :::*                               
1511372/ganesha.nfs
 
 
 We have a balanced service ip with keepalived.
 
 If we mount that nfs in a linux client and we switch the ip from one node to another it
continues listing the content of the mount without problems, when it switches the ip there
is a small delay in the ls but it ends up reconnecting.
 
 However in the ESX when adding a new datastore of type NFS it adds it correctly but when
the ip switches it loses the datastore and it is no longer able to reconnect giving the
following error:
 
 2020-12-09T13:35:05.472Z cpu34:2099634)WARNING: NFS41: NFS41FSAPDNotify:6100: Lost
connection to the server 172.16.1.222 mount point nfs_c51, mounted as
39a1079b-e140bb96-0000-000000000000 ("/ceph")
    2020-12-09T13:35:05.474Z cpu34:2099632)WARNING: NFS41: NFS41ProcessExidResult:2460:
Cluster Mismatch due to different server Major or scope. Probable server bug. Remount data
store to access
 
 We do not know if it is a problem in the configuration of Ganesha or in the
implementation of the ESX NFS client since a NFS client of a linux distribution that is
not ESX is able to reconnect with the host that has the service ip.
 
 
 Thank you for your help.
 
 Regards,
 Roberto
 _______________________________________________
 Support mailing list -- support(a)lists.nfs-ganesha.org
 To unsubscribe send an email to support-leave(a)lists.nfs-ganesha.org