- I was able to rebuild the 2 ESXi Hosts that comprise in my vSAN cluster with no data loss
- Recovery was pretty easy once I understood the actual process
- Understanding the command line options of
esxcli vsan cluster unicastagentwas key
- Don’t mess with the Witness Appliance networking, even in a 2-node direct connect
So There I Was…
It was a banner maker of bad days. My wife tested positive for Strep Throat. One of my daughters (4) was battling some mutant cold that had her miserable. My other daughter (2ish) was fully melting down in the corner. I was having issues getting NSX reinstalled in my lab after pulling it out for some testing a few weeks back - Hosts wouldn’t prep. All day I had been trying various tricks to resolve the problem. I had been trying to manually install the VIBs.
It was nearing bedtime for the kiddos, and I went to connect to vCenter so I could pull the host of maintenance mode. I noticed I couldn’t get to it anymore via the hostname. Odd. IP not working either. vCenter is down. Doubly odd. Then it hits me…my DNS isn’t working. I can’t log into the primary host (currently in maintenance mode), the UI keeps crashing. I log into the other host, and find a terrible sight…all my VMs on that host are throwing invalid errors next to them. Unable to reach datastore. Crap. This isn’t good. I had done the maintenance mode cycle many times throughout the day without issues. Since there were no active VMs on the first host, I decided to throw a reboot on it. And then the world came crashing down…
Nuclear Launch Detected…
When the host came back from reboot, it was in a strange state. Since I couldn’t ping it/connect to the UI, I hopped on iDRAC to take a look. The IP was set as static, 0.0.0.0. Clearly this wasn’t correct. I attempted to go in and correct the network settings but the network options were unable to be selected. Hadn’t seen this one before! I rebooted again, same behavior.So at this point, I know a few things…
- One host is currently totally out of commission - it needs to be rebuilt or reset completely
- Other host is having problems prepping in NSX as well
- Management host is having the same problem regarding NSX
- Looking at the functional host; I see data still present on the disks. When I do an
esxcli vsan storage getthe disk group shows healthy and is still intact
Given that I’m having problems already - lets use the tactical and careful response…
I want to be 100% clear here - I was NOT nuking the hosts to solve the vSAN issue only. The NSX problems, as well as a few other hiccups in my environment made me feel like this was a good opportunity to start clean. Plus, this would allow me to test if I had a solid grip on recovering my environment. After doing some digging, and stumbling upon a post but a “relatively well known blogger” named William Lam I was certain that my data would be safe after a reinstall. Ever appreciating the opportunity to test new things in the lab, LET’S ROLL. My process is going to look like the following…
- Reinstall both vSAN hosts
- Recover my vSAN via black magic and voodoo (without a vCenter)
- Recover the VMs
- Once stable, reinstall my Management Host (where my Witness Appliance lives; uses my Synology for storage)
- Get back to living life
The Hosts Are Back, But Without a vCenter
Once I finished the install of the hosts, I knew I had to reconfigure vSAN without the use of a GUI. There was a lot of content “out there”; but most of it was around setting up a fresh vSAN implementation. I didn’t want this obviously. I wanted to just recover my existing configuration.
When I started researching, I was able to find a great guide for most of the commands on This Blog. Some modifications were needed; but this was a solid start.
To confirm everything was solid, on both hosts I ran an
esxcli vsan storage list to ensure the diskgroups were intact and healthy. Sample of the output below.
The diskgroups are intact; so I didn’t want to do anything to mess that up. Time to start surgery. First things first, I needed to handle the networking. The diagram below shows a pretty simple overview of what my endstate needed to be restored to.
In order to accomplish this, I had to tag the ESXi vmkernels appropriately. My original configuration had a direct connect setup for the vSAN traffic over its own vmkernel, and sent my Witness traffic over the standard vmkernel that my management network was on. To accomplish this, I used the following commands
esxcli vsan network ip add -i vmk0 -T=witness esxcli vsan network ip add -i vmk1 -T=vsan
Once this was done I ran a
esxcli vsan network list to confirm everything was setup correctly.
I needed to make sure these commands was run on both hosts. I also needed to drop my existing witness host from the “previous” vSAN cluster - since that cluster has the 2 “original” hosts in it. I SSH’d into the Witness host, and ran the following command…
esxcli vsan cluster leave
Which produced errors about being unable to leave the cluster. At this point, I figured I had already rebuilt 2 hosts. I might as well redeploy a new Witness server as well. I deployed a new one in parallel - not destroying the old one but powering it off completely. When it finished deploying, I gave it a new IP address.
Once this command completes, we’ll have the following configuration
- 2 standalone ESXi hosts, each having a configured diskgroup. Not in a vSAN cluster.
- A standalone Witness host, not in a vSAN cluster.
Borrowing from William Lam’s post I referenced earlier, I run the following command on the first of the 2 nodes.
esxcli vsan cluster join -u $(python -c 'import uuid; print(str(uuid.uuid4()));') esxcli vsan cluster get
From the result, we grab the Sub-Cluster UUID, and copy it. On the second host, I ran the following:
esxcli vsan cluster join -u [UUID] number esxcli vsan faultdomain get
This joined the second host to the vSAN cluster and also gave me the fault domain ID which I needed to join my Witness back into this vSAN cluster.
Here’s Where a “Learning Opportunity” Mistake Happened
Since I knew I was doing a 2 node direct connect configuration, I knew that my witness traffic was going over my management network. I had thought that I would need to SSH into the Witness node, and tag the vmkernel traffic to be Witness. THIS WAS WRONG. I made the change, and progressed forward; which caused me a ton of headache later on.
To join the Witness Appliance to the cluster, I ran the below command.
esxcli vsan cluster join -u [UUID] -w -t -p [Fault Domain ID]
Now, I had expected that here is where the magic would happen. I logged into one of my hosts - and the vSAN Datastore was blank and still half the size I expected. Clearly not functional. I started digging and discovered this blog post. This enlightened me around the need to use the esxcli to interact with the unicastagent configuration and add the nodes.
On each host I ran the following…
esxcli vsan cluster unicastagent add -a [remote storage ip] -U true -u [remote host local node UUID] -t node
Once this is complete, I ran
esxcli vsan cluster unicastagent list
And was able to see my datastore again. I re-added my vCenter to the environment, powered it up, and added the hosts to the cluster.
Remember That Earlier Mistake?
Once everything was re-added, I proceeded to check the vSAN configuration which was flooded with errors. Inability to communicate with the Witness node was mentioned a number of times - along with data being in an unprotected state. One of the error indicated I needed to clean up the vSAN database as it still had lingering entries from the previous hosts. This was an easy single button click to remediate. From here, I logged into my old Witness host and checked to see if it was still in the vSAN cluster. It wasn’t. The cleanup operation had dropped it from it’s original cluster. I made sure the networking matched the new Witness host I stood up (with its original IP address). This was Another Mistake and served no purpose. I then ran a “Change Witness Host” operation from vCenter and was rewarded with the same results.
A major lesson here - Don’t mess with purpose built appliances unless you have a VERY WELL THOUGHT OUT AND DOCUMENTED REASON to. After consulting with the great folks on the vExpert Slack channel - I was advised that I should NOT have to change any networking configurations on the Witness Appliance.
I re-ran the tagging command to set the traffic back to vSAN traffic on vmk1 for the witness appliance.
esxcli vsan network set -i vmk1 -T=vsan
Behold, The Glory of vSAN
One I flipped that tag back to vSAN traffic; magic happened. The vCenter UI refreshed, and all my communication errors went away. I was able to watch as the counter of “Fault Domain Compliant” VMs grew rapidly. My hosts were functional again. The storm was over.
A lot of good content was learned as a result of this little exercise. Also, a lot of things I had been told previously but never experienced were proven.
- vSAN has some awesome resiliency built into it. I was able to completely rebuild my hosts and still ultimately end up in a place where the data on the disks was accessible
- The setup really is pretty easy, especially in a 2-node direct connect
- Having a standalone management host that ran some out of band services, as well as the requirement of the Witness host, was crucial
- Moving away from multicast as a requirement was a great move for vSAN - but it certainly left a requirement of needing a way to discover other nodes. Figuring out the unicastagent add bits was obviously required. If I had a vCenter up and running - vCenter would’ve handled all of that
- The Witness Appliance IS a Witness. You don’t need to change it’s networking configuration to do that. Leave it alone; it’s built the way it is for a reason.
Ultimately, I’m a pretty big novice when it comes to vSAN. This was a great learning experience for me around managing server failure in a vSAN cluster and how to recover completely.
Huge thanks to Jase McCarty, Jeff Wong, and John Nicholson for explaining some of the finer details to me along the way. Great folks to be able to bounce ideas off of in the vSAN Channel on the vExpert Slack!