Today I lost the XCP-ng networking to 3 XCP-ng hosts in a pool of 4 hosts. I logged in to check XCP-ng Center from a Windows workstation and couldn’t connect to the pool. I logged into the master of the pool with putty and had no issues. Yet when I ran commands against the pool it would say “lost connection to the server”. The logs on that workstation from XCP-ng Center said “The server that you are talking to is a slave” during that time period.
I thought maybe this was an issue with XCP-ng center and went to a laptop I had and it exhibited the same issues even after removing the configuration and trying to import the server. Then just as on the workstation it would note this server belongs to a pool would you like to import the pool master, I said yes and it wouldn’t import the pool or the system thus not completing the operation.
I performed a xe-toolstack-restart and still had the same issues. Finally logged into the physical console and all of the network information was gone from the console, yet I was still in the system via ssh with no issue, connected and able to perform commands.
This whole time all of my VM’s on this system, the pool master was up and working just fine. The other two systems didn’t have VM’s on them at this time as I was transitioning away from them.
On the physical console I went into the “Local Command Shell” and ifconfig showed all my physical interfaces and I could ping out on any of them. After a reboot I could still ssh into the system but still no management interfaces. I tried several times performing xe-toolstack-restart that’s when I started seeing physical interfaces in the local shell disappear. After trying xe-toolstack-restart a few times only my local loopback was left!
Now of course my VM’s on this pool master system are down now and not operating.
Finally on the one XCP-ng host that was still up I decided to login via ssh and made that the pool master with “xe -pool-emergency-transition-to-master” and that worked. I could now connect to the pool with the one host in it with XCP-ng Center.
Then I wanted to get the VM’s up on the new pool master and did a “xe host-list” “vm-list resident-on=<UUID-pool-master>” then “xe vm-reset-powerstate resident-on=<UUID-pool-master>” and I could now see the VM’s I wanted to start on the new pool master but couldn’t because the one host system left operating with networking was in maintenance mode.
I finally re-installed the original master with an XCp-ng 8.1 iso and then got the networking back and had it rejoin the pool as a slave and everything is starting to look good again and the new pool master is out of maintenance mode. I started my VM’s and they are working. I am concerned that I lost the network configuration on 3 of my 4 XCP-ng hosts though.
The only thing I can think of is that about a week and half ago I transitioned from an older pool master to a new pool master using this command “xe pool-designate-new-master host-uuid =<host-uuid>” then since I was transitioning to new more updated hardware powered down 2 hosts leaving the 2 hosts I wanted to keep. I was planning to remove the old hosts from the pool this weekend. These were my only changes and there were no issues I observed until this morning. These servers have been the part of the same pool for months now.
The good thing is that once I decided to just re-install the former XCP-ng pool master, recreate the networking (thank goodness I didn’t have complex networking or VLAN scheme) everything worked well. Having the storage for my virtual systems on FreeNAS storage utilizing ZFS gives you great reliability.
Earlier this week my dual 10GBE Intel 10 Gigabit XF SR Server Adapter went dead on my Windows workstation which has made it really a strange week in my computer lab. Thank goodness I had a spare Intel 10 Gigabit XF SR Server Adapter and I was able to recover my virtual environment!