Today, many industries take advantage of high performance computing (HPC) and machine learning (ML) to accelerate innovation. They apply high-definition simulation and modeling to enable drug discovery, autonomous driving, financial risk analysis, anomaly detection, and so on. With massive computing power, high-speed networking, and high-performance storage, HPC systems are valuable assets that many teams or departments within an organization often compete for. That’s why multi-tenancy is a much-desired feature on HPC systems — to support multiple users/user groups simultaneously. Our HPC/ML team within VMware’s Office of the Chief Technology Officer (OCTO) has architected a virtualization-based solution to bring true, secure multi-tenancy to HPC and ML.
In the reference architecture above, we achieve multi-tenancy by constructing multiple virtual clusters on a shared physical cluster. Each virtual cluster consists of a set of virtual machines that span the whole physical cluster. By giving each tenant an exclusive virtual cluster, its workload is isolated from other tenant workloads on the same set of physical nodes. While each tenant has the illusion that they own the cluster exclusively, you can configure the underlying ESXi hypervisor to either do fair-share of the physical resources among different tenants or enforce a differentiated class of service.
Another important component of the reference architecture is networking security. In clinical research, chip design, or other sensitive research areas, full isolation of project files and data — regardless of whether they are within a compute node or on the network — is mandatory throughout every stage of the development. We leverage VMware’s network virtualization and security platform (NSX-T Data Center) to achieve security isolation on the network path. By providing a distributed, stateful firewall on a per-workload granularity and network-overlay-based isolation, NSX-T delivers micro-segmentation ideal for a multi-tenant environment.
Performance and best practices
When it comes to HPC and ML, performance is always top of mind in our design. To ensure that performance is not compromised, we apply best practices from our many years of experience and undergo a rigorous performance evaluation. Experiment results demonstrate that the overhead incurred by virtualization and NSX-T security is often negligible, enabling HPC/ML providers to support multiple tenants on consolidated infrastructure without compromising security or performance
Read more details in our white paper, “Secure Networking for Multi-Tenant High-Performance Computing and Machine Learning.”