VMware vSphere Large Memory Support and AMD EPYC™ Series Processors – Part 1

Customers are continuously requiring more and more memory as their workload demands increase.  Therefore, both enterprise hardware and software must be able to meet this growing need.  Below, we present the details on how VMware, in collaboration with Micron, AMD, and HPE, configured and qualified a 4TiB server to help address this call to support expanding memory requirements.

vSphere Support for AMD EPYC™ 7001 Series Processors and Large Memory Hosts

In June of 2017 AMD announced the launch of the AMD EPYC™ 7001 series high-performance datacenter processors side-by-side with VMware and other industry partners. VMware’s support of AMD EPYC™ 7001 series was first announced in our vSphere 6.5 Update 1 product which boasts support for up to 12TiB of DRAM per host. VMware later released vSphere 6.7 which increased the maximum memory support to 16TiB of DRAM per host, as well as, including support for the AMD EPYC™ 7001 series processors.

You can learn more about vSphere configuration maximums, including DRAM per host, with our online tool found here:

https://configmax.vmware.com/

In our initial support of the AMD EPYC™ 7001 series processors we enabled a variety of server vendors and platforms with a mix of DRAM configurations. In our internal validation efforts we generally use the standard guideline that a system must have a minimum of 2GiB per thread; so for a 2-socket AMD EPYC™ 7501/7551P/7601 server that would be 128-threads (2-sockets x 64-threads/socket) and 256GiB of DRAM or more. This allows us to create a variety of VM(virtual machine)-based workloads to test new platforms with high confidence that they, in combination with vSphere, are enterprise ready.

While VMware only used memory configurations of approximately 256GiB during the enablement of the AMD EPYC™ 7001 series processor-based platforms, our support is not limited to that capacity. The VMware vSphere Server Certification program allows our partners to expand the amount of DRAM supported in their platform up to each host’s physical capabilities. It also aligns the amount of DRAM with each vSphere release’s maximum limits (up to 12TiB or 16TiB as noted above). Server vendors are given clear guidelines within the Server Certification program for when tests need to be executed to increase their DRAM support. In the vSphere 6.5 Update 1 and 6.7 releases, which support AMD EPYC™ 7001 series processor-based servers, support for more than 2TiB of DRAM requires additional prescribed certification testing by our partners.

vSphere Increases Default Host Memory Support

Earlier this year VMware reevaluated our certification policies because we desired to raise the baseline support from 2TiB to 4TiB. We recognized a clear pattern of more servers emerging with support for higher DRAM capacities, as the following data indicates from our VMware Compatibility Guide. In the following chart, the vSphere versions are sorted in chronological order of their release dates from oldest (on the left) to the newest (on the right). The data illustrates the growth over the past four years from 6% to 15% of all servers on the VMware Compatibility Guide to have at least 4TiB of DRAM certified by our partners.

vSphere Certified Servers with 4TiB+

To address this growing trend, VMware opted to update our Server Program’s Large Memory and Large DRAM certification policy for newer processors, including the AMD EPYC™ 7001 series processor. The amended policy increased support for 4TiB without additional certification and was applied to the following VMware vSphere releases listed below (and all subsequent releases):

  • ESXi 6.7 Update 1
  • ESXi 6.5 Update 2
  • ESXi 6.0 Update 3 + latest patch

Collaboration with Micron, AMD, HPE, and VMware Labs

In support of this policy change, VMware reached out to its industry partners in order to fully validate a 4TiB configuration of an AMD EPYC™ 7001 series processor platform. With up to 32 multithreaded cores and 8 memory channels supporting up to 2 TiB of DRAM per socket, we wanted to prove that AMD EPYC™ 7001 series processor-powered single and dual socket servers running VMware vSphere can meet the needs of memory demanding applications with more system resources and greater flexibility.

VMware in collaboration with Micron, AMD, and HPE built a 4TiB configuration using an HPE DL385 Gen10 dual socket server with AMD EPYC™ 7451 processors using Micron’s 128GB LRDIMM technology. This system was placed in VMware’s QE (Quality Engineering) lab and was used to run a battery of system and memory focused tests that ensured the combined solution supported the change in policy.

VMware’s QE teams focused on two key areas to test. Firstly, to check to see if this configuration could pass the rigorous stress testing by our Hardware Enablement QE that covers a broad range of functional areas. Secondly, to further evaluate if this configuration could pass the broader VM-operation at-scale stress testing from our System Test QE.

Test Coverage

Hardware Enablement QE

The selective sanity check of the SUT (system under test) consisted of a set of tests ranging from basic bring-up and boot tests, all the way through to fully blown host-wide stress tests. The comprehensive suite of tests consisted of almost 100 test cases across the two releases of vSphere 6.5 Update 2 and 6.7 Update 1 that support AMD EPYC™ 7001 series processors.

While each and every test case cannot be detailed out here, one example of the deeper testing executed includes our Multi-GOS Combination Stress test suite. In this testing we execute CPU, Memory, Storage IO and network IO workloads simultaneously with multiple VM’s running different Guest Operating Systems (GOS) to stress host resources for multiple days.  The workloads continuously generate:

  • High CPU Utilization
  • High memory usage/churn
  • High local storage load
  • Wire-speed inbound network traffic

System Test QE

The other set of testing we did focuses on use cases that consume and exercise the 4TiB of memory through two different VM-centric approaches.

The first test case focuses on what we refer to as a Monster VM. In this scenario the Monster VM’s virtual hardware configuration is as close to the size of the physical host without over-committing it. The Monster VM is deployed running CPU and memory stress workloads. Next, the VM is put through a series of VM operations (Reset, Suspend-Resume), Host operations (Power-on after PSOD/Reboot), and Cluster operations (over-commit 2X CPU and 1.2X memory in cluster). This is repeated for multiple days.

Similarly, the second test case consumes the physical host, but instead of a single Monster VM, it uses 100 smaller VMs with 2 vCPUs and 40GB of vRAM. This stresses the memory and the system in a different way, and the multiple days of VM operations, Host operations, and Cluster operations generate a different stress due to the volume of VMs rather than the size of the single VM.

Conclusions

The combined solution (host, increased DRAM, and both the 6.5 Update 2 and 6.7 Update 1 vSphere releases) successfully completed all tests cases and validated the change in DRAM policy. This proves that AMD EPYC™ 7001 series processor platforms from partners like HPE with up to 4TiB of memory from vendors like Micron are vSphere validated technologies ready for demanding enterprise applications in a virtualized environment. VMware hopes to continue our collaboration with our valued partners and we enthusiastically look forward to updating readers on the validation and performance analysis done on next generation CPUs, memory, and server platforms.

Acknowledgements

Many thanks to our partners at Micron, AMD, and HPE, as well as the engineers and QE teams at VMware that enable and validate this technology.