Tech Deep Dives Guides and How-Tos

How High Performance and Scientific Computing Benefit from vSphere 8’s Distributed Services Engine

Data center technology is evolving on multiple fronts, from faster storage based on NVMe, to an increased number of central processing unit (CPU) cores and system memory to faster interconnects such as 400GbE. These innovations are being adopted by applications and platforms like big data, HPC, and AI/ML, pushing the boundaries to get the best performance possible. Many of these newer applications and platforms benefit not only from faster interconnects, storage with lower latency, or more compute power but also from hardware offloads like GPUs and FPGAs.

In the current data center architecture, the CPU oversees, coordinates, and controls most of the functionality and capabilities of the hardware and software-defined services that run on the operating systems (e.g., NSX). Unfortunately, this CPU-centric approach has proven to create a bottleneck for architectures to scale.

We need to start changing the design approach for software-defined data centers to one that tightly couples the hardware with the operating systems. A good approach would be to offload specific management plane, networking, security services, storage, etc., to dedicated hardware, effectively freeing up resources from the more expensive x86 CPU cores for production workloads and providing other benefits, such as security isolation.

This article will discuss the advantages of leveraging data processing units (DPUs) with the latest vSphere 8.0 release and how that will benefit running scientific and research workloads.

What is a DPU?

A DPU is a programmable system-on-a-chip (SoC) with an industry-standard CPU (ARM-based in most cases), multiple programmable acceleration engines, and high-performance network interfaces.

With the recent introduction of vSphere 8.0, we saw the new vSphere Distributed Services Engine, which takes advantage of these DPUs in the form of a SmartNIC, offering multiple capabilities like vSwitch offloading, secure isolation (secure boot, root of trust, etc.), network overlay offloading, RDMA, and precision timing accelerators for telco/NFV workloads. In addition, they are also capable of running operating systems or general-purpose applications/workloads.

Figure 1. DPU Diagram

Most current-generation DPUs come fully equipped with onboard local persistent storage, a large amount of DDR RAM, multi-level caches, PCIe root complex, virtualized device functions (VF – SR-IOV), and IO capabilities. Because of this, DPUs could be used for many more use cases beyond network and security acceleration.

vSphere Distributed Services Engine

vSphere Distributed Services Engine introduces multiple changes from an ESXi perspective. This includes an instance of ESXi running directly on the DPU, enabling numerous use cases like direct offload of network acceleration to the DPU. Communication between the “main” ESXi (installed on x86) and the second instance of ESXi running on the DPU happen on a private IPV4 channel. This approach frees up resources from the x86 CPUs and provides a clear demarcation line between productive workloads (running on VMs and Containers) and infrastructure, serving as an isolation layer, reducing the possibility of Infrastructure services getting affected from any possible security breach that happened on the running workloads.

Figure 2. Security demarcation between DPU and x86

The first phase of vSphere Distributed Services Engine introduces network processing offload with EDP (enhanced data path), meaning the data path itself will be optimized. This also opens the possibility of accelerating NSX distributed routing services, DFW and security services like IDP/IDS by running those directly on the DPU (requires NSX enterprise licensing). EDP is not new; this network stack mode is used for NFV and other types of workloads that can benefit from lower latency and higher throughput but still requires CPU resources from the x86 server to accomplish this (effectively reserving cores for this datapath mode). By leveraging a DPU, we can instead use the ARM processing units for this purpose.

EDP can be used in conjunction with UPTv2 to provide the best of both worlds, an accelerated datapath with hypervisor bypass but without sacrificing workload manageability features like DRS, HA, and vMotion that cannot be used with existing passthrough technologies like SR-IOV. However, UPTv2 requirements are strict, needing full VM memory reservation and a specific VMXNET3 driver version.

UPTv2 is not the only mode to work with EDP. We can also leverage MUX mode, which, while not a full hypervisor passthrough as with UTPv2, MUX does not impose the same rigorous requirements. Although MUX mode is the default mode, this mode still uses some of the x86 processing power for TX/RX packet tagging, so to get the best performance, we should look at UTPv2.

How high performance and scientific computing can benefit from distributed services engine

As already mentioned, a CPU-centric architecture has multiple downsides, augmented by industry macro trends such as:

  • The data that we generate is growing at a highly accelerated pace.
  • Moore’s ‘law’ is no longer holding, and we are experiencing a slowdown in the improvement of storage, networking, and compute performance.
  • Growth of intra-data center (east-west) traffic is growing exponentially.

These trends make it clear that there is a growing need to shift away from CPU-centric architectures towards a disaggregated architecture approach for components like networking, storage, and GPUs. Allowing businesses to bring resources closer to workloads on an on-demand basis will result in improved resource planning for future scalability and separate lifecycle management of infrastructure services and workload domains.

Disaggregated architectures can benefit from DPUs in several ways:

  • Offloading workloads from x86 processors
  • Infrastructure services such as software-defined storage (vSAN), network I/O processing, security services (NSX & Third parties) no longer need to “compete” with production workloads since these services are executed on the DPU.
  • Common security posture across multiple platforms – DPUs will enable VMware to independently run NSX’s security services like DFW on a common platform. If we are talking about VMs, containers, or bare metal environments, any traffic going in and out of the host is inspected at the DPU level, making it simpler to design, implement and adopt a common new security posture across different platforms and clouds.
  • Improved network performance – By leveraging EDP (either MUX or UPTv2 mode) and having dedicated resources from the DPU for capabilities including packet processing, overlays, security inspections, and more, we will be able to keep up with the growing amount of traffic (and its associated services) already taking place in the data center.
  • Storage acceleration – DPUs will enable a broad set of storage use cases like NVMe-oF initiator offload (DPU taking the initiator role), PSA offloading, etc. It is important to note that the first phase of vSphere Distributed Services Engine is not covering any of these uses at this time but will eventually start to integrate new features.

These benefits will clearly impact the performance of virtualized HPC, ML, AI, NFV, and other platforms like HFT, where latency and performance are critical for business operations. Stay tuned for future performance test studies that we will be publishing.

How to leverage EDP with DPU-enabled hosts

The first step is to perform a unified ESXi install; this will install ESXi on your x86 server and available DPUs:

Once your ESXi servers are ready, we need to create a Distributed Switch version 8.0 and select the right network offload compatibility (NVIDIA Bluefield or AMD Pensando).

Next, we need to prepare the DPUs as part of the NSX-T Fabric, which means that NSX-T Manager will install NSX’s components into the ESXi running within the DPU. This can be accomplished with vLCM if needed. As part of this process, we will use a transport node profile that has the required datapath mode defined.

Once the fabric is ready, we can start getting the benefits of DPU offloading. If UPTv2 is required, VM settings must be modified by selecting “Use UPT Support” under the vNIC.


With this first phase of vSphere Distributed Services Engine, we start moving towards a disaggregated architecture model. Resources like GPU are pooled and consumed on demand, while networking and security services are abstracted from the CPU-centric architecture and executed on the DPUs. This ultimately allows your data center to scale and adjust to the ever-growing industry demands and operating models.

Future versions of Distributed Services Engine will bring new exciting features to vSphere like running DPUs as NSX transport nodes, leaving the x86 compute resources intact for bare metal workloads like HPC. This will enable researchers to consume different types of compute platforms based on the specific requirements, VMs, containers or Bare Metal servers with the same management framework, security posture, and network services


Leave a Reply

Your email address will not be published. Required fields are marked *