Restoring NSX-T components from a site failure, and tips to bring down your RTO

Introduction

This article will explain how the NSX-T components are backed-up and restored in case of a disaster. The procedure is pretty straight forward, but there are a few gotchas that you need to keep in mind before you can restore a backup.

Backing-up the NSX-T Manager Cluster

General backup procedure:

Go to the backup page of the NSX-T Manager.

When you have an NSX-T Manager Cluster, you can go to any NSX-T Manager instance, or you can just use your VIP IP / FQDN to connect to it.

When you are there, your screen looks like below.

I have already clicked on the “Backup Now” button so you can see that the “Cluster Backup is In Progress”:

NOTE: In NSX-T version 2.4.0 it is OPTIONAL to put a passphrase for the backup, but in the restore section it is MANDATORY to put in a passphrase. So please also use a passphrase for the backup because you will need it during the restore.

Wait for a successful finish:

We can also verify if the backup is successful on the SFTP server by browsing to the files on the SFTP server itself:

You can see that the backup was made with the “FQDN setting” on.

In the directory-name you will see the name ‘ih-dc1-nsxm-12.iwan.local’, and you can see the IP address in another directory (previous backups).

NOTE: The backup is done with the DNS name ‘ih-dc1-nsxm-12.iwan.local’, which means we will also need to create a NEW NSX-Manager appliance with the same IP address and FQDN as the one from which the backup is taken.
 

The SFTP Server for back-up and restores:

When you select an SFTP server, make sure you are using an SFTP server that is FIPS compliant. When an SFTP server is not FIPS compliant, the NSX-T Manager will not do the backup.

You can test this by doing an SSH session towards the SFTP server from the NSX-T Manager and you will get this error:
 
root@nsxmanager:~# ssh sftpuser@ih-sftp-nonfips.iwan.local
aes_misc.c(74): OpenSSL internal error, assertion failed: Low level API call to cipher AES forbidden in FIPS mode!
Aborted (core dumped)

Restoring the NSX-T Manager Cluster

Before we can restore the backup, we need to simulate a site failure. We do this by Powering off all the NSX-T Managers and the Edge VMs. Here they are, still running:

And here we have them powered off:

Now it’s time to deploy a new NSX-T Manager; the Summary page of the OVF deployment looks like this:

Once this NSX-T Manager is deployed and the FQDN that was used in the backup points to the IP address of the new NSX-T Manager (can be the same IP or another IP address), we are then ready to do a restore the backup now from the restore screen:

Note that it looks like the passphrase field here looks optional, but it is not; if you do not populate this, it will complain it needs one anyway. Click on Save.

Click OK on the Fingerprint Warning:

The moment we click on OK, the NSX-T Manager will list the available backups for this NSX-T Manager based on the FQDN:

Check the box of the backup you want to restore:

And click on Restore to start the process. A warning will be presented first:

NOTE: It is important that you ONLY do a restore on a blank NSX-T Manager instance; restoring to an existing NSX-T manager can result in a broken environment.

Once we click Continue, we can start tracking the progress:

During the restore a message will be presented with a warning that the Edge Transport Nodes cannot be found. In a typical full-site failure, you will also lose your Edge VMs (and this is also simulated by turning them off).

Check the box and click Resume.

A second warning will be presented that is related to the NSX-T Managers.

The warning here tells you that before you continue with the restore process, you will first need to set up the other two required NSX-T Managers which will be a part of this cluster.

I did not read it at all, and decided to ignore it and continue. Not a wise thing to do!

I checked the box and clicked Resume.

Then, I was presented with another warning related to the first one.

Another warning related to the NSX-T Managers to deploy them first before I continue:

I checked the box (again) and clicked Resume.

Eventually, I got the message that my NSX-T Manager is successfully restored:

Now when we verify the nodes on the dashboard, we see that only one node is active — the one we just deployed:

From this page, we need to deploy the two remaining nodes and conduct the steps we should have done during the restore process.

We can now put in the required information for both nodes, and can deploy them both at the same time.

 

We should see the deployments being in progress:

Once deployed, a Sync needs to be done across the nodes:

And eventually the Sync is completed, and we are all good:

We can do a quick check if our Host Transport Nodes are detected:

And they are fine…

Now we verify that our Edge Nodes are also back online:

And they are not! This is because the Edge Nodes are not backed-up at all, while the configuration of the T0 and T1 Gateways are backed-up.

 

Back-up and restore of the Edge-VM (clusters)

So how do we back up the Edge VM’s?  We don’t (we can’t).

We will need to redeploy the Edges from the NSX-T Manager and put in the configuration (like hostname, IP address, etc.) again.

Once these are redeployed, we may need to reselect them in the Edge Clusters again to be part of an Edge Cluster, if some of the configuration parameters have changed.

Enable FQDN Setting

When we want to create backups (and do restores) based on the FQDN, we will need to enable the “FQDN” setting in the NSX-T Manager through a REST API call.

First, we check what the status is with a “GET” call:

Then we enable FQDN with the code in the screenshot with the “PUT” call:

Finally, we verify this with another “GET” call to determine that it has been changed:

 

Key Takeaways:

  • Use a passphrase when you create a backup (even though it is optional).
  • Use a FIPS compliant SFTP server.
  • Create a blank NSX-T Manager with the same FQDN from which the backup was taken.
  • Deploy the remaining two NSX-T Managers during the restore process of the first NSX-T manager and NOT afterwards.
  • Edge VMs are not backed-up, and need to be manually redeployed.

 


Iwan is a seasoned network engineer with strengths in different areas of networking infrastructures such as implementing, troubleshooting and maintaining mixed vendor enterprise networks — in both fixed and deployed environments. He regularly pursues higher education for new types of technologies.

 

 

Other posts by

Fixing the Unintended Consequences of Blockchain

Blockchain Today Blockchain is an exciting technology development that enables people and businesses to run applications that traditionally require high amounts of trust between participating parties. With that said, there are several challenges that will continue to slow enterprise adoption of blockchain. One particular challenge stands out and this blog entry discusses how VMware’s blockchain […]

How VMware NSX Service Mesh is Purpose-Built for the Enterprise

Service Mesh is fast becoming one of those hot topics where every industry player must have an offering in this space. Open source service mesh projects like Linkerd and Istio, or others like Consul from HashiCorp and Universal Service Mesh from Avi Networks (now a VMware company!) are all trying to answer many of the […]

Re-Architecting Telco Networks – The NFVI Way

Communication service providers (CSPs) and Data networks face numerous challenges today. With broadband and mobile data traffic having grown at an exorbitant rate in the past decade, and real time content/video taking the largest market share*, consumers’ appetite for wireless content and mobile data is pushing CSP networks to a crisis point. At the same […]