Challenges for IT supporting research computing environments
What is most important to an IT department supporting university research? Based on conversations with many of our customers, we have repeatedly heard the following:
- Satisfying the varied and sometimes conflicting requirements of different research groups, which may include compliance issues for bioinformatics research
- Showing the value of centralized IT resources against buying isolated equipment or outsourcing to a public provider of infrastructure
- Meeting these goals while on a limited budget
For many independent research teams, each desired environment typically involves its own set of simulation and analytics tools, including different operating systems and other middleware components. For example, different environments and tools are used for genomic research than for reservoir simulation. Satisfying these disparate requirements is often not possible in a traditional, un-virtualized HPC environment due to the uniform software environment imposed across these facilities. Software tools that are in conflict may require dedicated compute nodes or may not be supportable at all in traditional, centralized HPC clusters. Storage requirements may also be in conflict. For example, funding may come with strict data segregation and security requirements that cannot be easily satisfied without buying isolated equipment for every project.
Using multiple, dedicated HPC environments provides maximum flexibility, but comes with higher costs and lower overall resource utilization. On the other hand, a single centralized, homogeneous environment sacrifices flexibility in order to reduce infrastructure costs. Outsourcing to a public cloud provider may allow self-service, but may cost more as time and money spent can’t be defrayed by collaborating with other departments. This state of affairs limits the ability of IT to achieve the above goals.
However, if an IT organization can meet these goals, and provide multi-tenant research computing capabilities—including high performance, high availability, and regulatory compliance while providing self-service and dynamic provisioning of resources—then this agility can position university departments to win more competitive HPC funding, and attract more researchers in the interest of furthering advanced research within the university.
How virtualization helps research computing environments
We believe that virtualization can help university IT organizations achieve the above goals.
Virtualization for research computing enables the best of both worlds because multiple, heterogeneous, virtual private environments can be maintained simultaneously on the same shared, physical infrastructure. These environments can beautomatically provisioned and refreshed according to individual teams’ needs without causing software or hardware conflicts.
Principal investigators (‘PIs’) and their associated universities can directly benefit from virtualized HPC resources because their research teams can create private, customized clusters that meet all of their requirements without interfering with other researchers’ environments. And instead of waiting to procure and install their own physical infrastructure, which could delay their research, their HPC research environment can be provisioned immediately to their specifications using virtualization. In addition, while having access to virtual resources on large-scale infrastructure that is shared by many research teams, PIs can also continue to compete for funding opportunities that include government-mandated compliance requirements and conditions, which would be difficult to achieve on non-virtualized infrastructure.
Augmenting the research computing environment end-to-end lifecycle
In addition to the benefits outlined above, vSphere® also simplifies operational management by handling fault isolation and high availability of clustered physical resources. Clusters of servers can be presented as pools of CPU and memory to run individual virtual machines with a given resource pool accommodating homogeneous or heterogeneous workload requirements. vSphere clusters create an automated fault domain into which servers can be added or removed without downtime for any workloads already running in the cluster. Infrastructure administrators have more flexibility for performing maintenance and being able to resize the environment without interrupting existing research.
Traditional HPC schedulers such as Univa Grid Engine, Torque, and SLURM can be leveraged simultaneously within distinct groups of virtual machines to handle job scheduling. In addition, vSphere High Availability (‘HA’) can handle server failure by restarting affected virtual machines on other servers. Further, vSphere vMotion can mitigate any downtime that would normally be required for hardware upgrades or maintenance by non-disruptively moving virtual machines between hosts. vSphere’s built-in scheduler, DRS, looks for optimal distribution of CPU and memory and uses vMotion® to balance VMs dynamically across physical hosts. Finally, the VMware hypervisor dynamically adjusts the amount of resources allocated to each virtual machine so that each research group receives its fair share of the underlying physical infrastructure.
While virtualization takes the operational and compliance management burden away from the PIs, IT still needs to manage those requirements. VMware vRealize™ Operations Manager™ looks at end-to-end operational management across all of the virtual and physical resources, analyzing and correlating performance metrics from servers, virtual machines, storage, and networks to determine resource health. In order to ensure a high-performance application continues to perform, all resource dependencies need to be monitored since an application performance issue may be caused by internal (in-guest) or external (infrastructure) factors. Internally, there may be memory swapping or process contention or poorly written custom code. Externally, this may be because of overprovisioned resources, storage latency, or network congestion. Only with a complete picture can IT provide optimal performance. vRealize Hyperic® provides a guest agent that can be used to deliver the guest-aware performance monitoring needed to achieve this full view of performance.
In addition to immediate performance management and operational considerations, vRealize Operations Manager is able to analyze “what-if” scenarios for capacity planning. This helps administrators understand where resources may be constrained and how to plan and budget for expansion. Capacity management requests for more hardware can be justified with this information.
There may be other constraints, such as compliance requirements, that need to be taken into account for planning. For example, some bioinformatics research requires healthcare compliance. Without virtualization, IT would need to constrain and physically separate each compute environment in order to secure and audit access to servers and data. Having more equipment and policies to manage and enforce can require more dedicated personnel and reduce the value of resource centralization. With virtualization, such a compliant environment can be instantiated on hardware resources that are shared with other, unrelated projects: this sensitive research and data can be restricted in its own secure and audited virtual environment.
vRealize Operations Manager uses vRealize Configuration Manager™ to understand an environment’s compliance. vRealize Configuration Manager can check compliance against security and regulatory baselines such as HIPAA, SOX, and PCI. vRealize Configuration Manager also includes the Defense Information Systems Agency (DISA) Security Technical Implementation Guidelines (STIG) toolkit. The DISA STIG is used by the US Military to provide the highest information assurance and security level of their infrastructure assets. At the infrastructure layer, numerous vSphere versions have undergone rigorous certification for Common Criteria evaluation and validation as listed here. vRealize Configuration Manager can also prepare audit and remediation reports for each virtual cluster or across the entire environment.
Perhaps most important, all of these activities—from operational and performance management to capacity planning, security and compliance—are handled by VMware at a finer granularity than in non-virtualized solutions. By using the virtual machine as the basic allocation unit, we can support sub-host sizing which accommodates custom sizing of resources—more CPU or more memory or both—for different workloads. Environments can be relatively static or they can be more dynamic to further optimize the efficiency of the HPC environment. All of the VMware solutions discussed allow for multi-tenancy without sacrificing the granularity at which the administrators can apply their policies and PIs can address their research requirements.
For these and other reasons, to be described in more detail in subsequent blog articles, virtualization can have significant positive impact on the high-performance computing solutions used for many kinds of research. The next blog article in this series will cover more in-depth operational and management considerations for automation and self-service of high performance research environments.