Welcome to Part II of my overview of virtualization for HPC. In Part I, I introduced myself, defined my terms and then began to describe the primary use-cases for virtualization in HPC. In this final section, I cover the remaining use-cases and then turn to a discussion of application performance.

We continue here describing the rest of the primary use-cases.

Checkpoint / Restart

Checkpoint / Restart – the ability to save and restore the state of a running job to disk – has been a much sought after capability in HPC for decades. Many attempts have been made with varying degrees of success. As a lack of application resiliency has come to be recognized as one of the largest barriers to increased application scaling on future systems, it has become more critical to find effective ways to safeguard application state in the presence of failing hardware and software. Virtualization offers the potential of a better way to checkpoint based on the existing snapshot functionality already available for virtual machines. By working in conjunction with an MPI implementation, it is possible to cause the draining of in-flight messages and write a checkpoint as a set of coordinated virtual machine snapshots. A very basic version of this capability has been prototyped using Open MPI running over TCP.

Dynamic Workload Migration

Some of the most exciting potential uses of virtualization for HPC revolve around the creative use of dynamic workload migration – VMotion. Leveraging this one capability will add significant capabilities to virtualized HPC environments.

First, migration can be used for power management by actively shifting running workload onto a subset of nodes when utilization drops to allow other nodes to be powered off or placed in a low power state.

Second, migration can be used to rearrange running workload on a cluster to make room for high priority jobs whose resource requirements cannot be met with the current workload placement. This is a significant advance beyond what is possible with current generation HPC distributed resource managers which place jobs within OS instances and have no subsequent ability to revisit or revise those placement decisions. In a virtualized environment, workload can dynamically shift across the cluster as resource requirements change.

Third, while checkpoint/restart is an important capability, it is a very expensive operation that requires the state of all virtual machines and their applications be written to disk periodically. In addition, for highly-scaled systems the data must be written quickly to avoid experiencing hardware failures before the checkpoint has been safely written to disk. As systems and their memories become larger and as failures become more frequent due to increased component counts, checkpointing becomes more problematic.

Virtualization in conjunction with fault management agents running on the cluster can mitigate this problem by creating a proactive approach to application resiliency. In such a system, a virtual machine could be migrated from a node that has been predicted to fail to a healthy node without having to take a checkpoint of the full application. While this capability would be quite challenging to implement, it could offer some significant advantages in the future as systems and applications continue to scale. It should be noted that researchers have successfully migrated individual MPI processes of a running MPI job using this technique, again with MPI running over TCP. Handling the more general case will be much more challenging.

Performance

Performance is obviously the question since the above benefits need to be assessed in terms of their value to a particular site and that value needs to be weighed against the performance cost of virtualization for applications of interest. As we’ve just begun our HPC effort, I do not yet have VMware-generated HPC performance numbers. We have several efforts underway to get this data, among them a proof-of-concept engagement with a university partner that is running a wide variety of traditional HPC benchmarks and applications both native and virtualized so comparisons can be made. I am also in the process of identifying additional benchmarks to run and am acquiring the gear to do so. The gear will include a variety of interconnects so we can carefully assess interconnect performance.

Having said the above, I’d like to share a few graphs from some research papers that looked at various aspects of performance for HPC workloads. These results were generated using Xen. I have used them in the past to illustrate the plausibility of vHPC, but of course these are not substitutes for generating our own numbers , which we will do. These numbers should not be taken at face value – use them, as I did, to conclude that good performance is plausibly achievable in a virtualized environment.

Figure 7:  The Impact of Paravirtualized Memory Hierarchy on Linear Algebra Kernels and Software, Youseff, et al, HPDC ‘08

The first paper looked at both compute and memory performance for linear algebra kernels.  Figure 7 shows the floating point performance for BLAS double precision routines over a range of scenarios. For our purposes, the specific configurations aren’t important beyond the fact that some are native and some are virtual. The fact that each fat histogram bar is essentially flat on top indicates that no significant performance difference was seen when running these kernels native or virtualized. The paper presents a much more in-depth analysis of both memory and CPU performance and essentially concludes that no significant performance differences were found.

While this is certainly promising and a useful result when considering single threaded application performance, it will still be important for us to measure performance of highly-threaded applications within a single virtual machine to assess the scalability of our infrastructure. Nonetheless, the results presented in this paper are encouraging.

Based on these results as well as performance results published by two other companies (I need to be vague until I have permission to share their results) who examined several single-process, parallel HPC applications and who found generally small slowdowns, I feel it is possible today for virtualization to be deployed in certain carefully chosen HPC vertical markets where large throughput workloads are processed and applications do not require MPI. Issues related to MPI are addressed below.

Figure 8:  High Performance VMM-Bypass I/O in Virtual Machines, Liu et al, USENIX ‘06

A second paper examined the use of MPI and InfiniBand in a virtualized environment. More specifically, the researchers prototyped  the equivalent of an OS bypass mechanism for Xen to allow the guest operating system to gain direct access to the underlying InfiniBand hardware and to achieve maximum possible performance. The left-hand graph in Figure 8 demonstrates that native and virtual latencies in this test framework were the same over a range of small message sizes. It is important to note that this result was generated with MVAPICH in polling mode. As can be seen in the right-hand netperf test, performance results can be lower in the virtualized case if interrupts are used to handle transfer completions. This is essentially because interrupts currently cannot be passed directly to the guest OS. For correctness, they must pass through the virtualized infrastructure which can add latency and decrease performance and small message sizes. Again, we need to measure all of this ourselves with our own products to assess where we stand and where improvements are needed.

It is also important to note that the above bypass approach punches right through the virtual machine abstraction and lets a guest operating system see a real piece of hardware in the system. Doing so either degrades or destroys the ability to live migrate workloads from machine to machine, which in turn makes it impossible to deliver several of the valuable capabilities for HPC outlined earlier in this piece.  Whether we can mitigate the impact and do so in a way that is acceptable to VMware engineering and business owners remains to be seen. More study is needed in this area specifically.

The above observations regarding interrupts are also valid in a broader context that includes storage and networking. In cases where runtimes include a large amount of time spent in the transfer of primarily fine-grained  messages, this may also lead to a degradation in application performance.

Summary

In this introductory piece, I’ve given you the flavor of the argument for how virtualization might be used today for some parts of HPC and also touched on some of the performance issues, both promising and concerning, related to virtualizing HPC. In subsequent posts I’ll share more details on performance, on discoveries we make as we continue to experiment, and on progress we are making on delivering the value of virtualization for HPC. We are at the beginning of a journey to a very interesting future for HPC – please join me!