Data science and analytics continue to become increasingly important to our customers. At the same time, many organizations are pursuing multi-cloud strategies. Combined, these two trends create the need for a way to rapidly create consistent data-science environments from private cloud to multi-cloud (including hybrid cloud), as well as edge environments. The Multi-Cloud Analytics Solution (MCAS) — created jointly by VMware and Intel — addresses this need with the release of our third generation of reference architecture.
Before discussing MCAS in detail, it’s important to understand machine-learning operations (MLOps) and the crucial role that IT operations (ITOps) plays in making MLOps possible. MCAS sits at the intersection of these two disciplines.
MLOps emerged as a discipline to apply developer operations’ (DevOps) practices to streamline the ML lifecycle process and allow organizations to realize the value of machine learning and artificial intelligence.
Figure 1 shows the main elements of the machine-learning (ML) lifecycle. ML development is a process in which a model is iteratively developed and validated prior to deployment. Once deployed, the model requires continuous monitoring to ensure it is delivering both good performance and accurate predictions. Eventually, most ML models must be updated to account for patterns emerging in newer data that was not available when developing the model.
Getting MLOps right is challenging. According to VentureBeat, up to 87% of ML models are never deployed in production. Lack of collaboration — endemic in siloed organizations — is one reason cited for this low success rate. Effective MLOps requires collaboration between multiple areas, such as data science, business operations, application development, and MLOps engineering. Each of these will have either direct or indirect IT infrastructure requirements, as shown in Figure 2.
Core IT requirements for successful MLOps:
- High-performance computing services for data scientists to train and test ML models. In addition, IT infrastructure must be ready to support parallel computing systems, such as Spark, Dask, Ray, and others.
- Containers are used by DevOps teams to encapsulate models so they can be deployed as components of cloud-native applications.
- Scale-up and scale-down of cloud-infrastructure resources are necessary components of the model-monitoring process to ensure adequate performance is delivered at an appropriate cost.
- End-to-end security is a fundamental attribute for any ML platform, especially when models are handling sensitive or protected data.
- Responsiveness and high-accuracy predictions must be continuously delivered to any required location and device to ensure a good end-user experience.
As such, ITOps is at the center of MLOps. It has the potential to deliver infrastructure that crosses functional silos in a timely and cost-effective manner. The key to achieving this resides in having a flexible multi-cloud infrastructure, such as Kubernetes (K8s).
Enterprise distributions of K8s can deliver the infrastructure services necessary to enable every step of the MLOps lifecycle. For example:
- As a container orchestrator, K8s is the ideal execution engine for running the many ML and analytics components that leverage containers as the most practical packaging format to develop, distribute, and deploy ML models.
- On-demand K8s clusters can dynamically provide the resources that ML platforms need to develop, deploy, and monitor ML models.
- K8s clusters can be expanded or reduced, depending on the computing needs of ML workloads.
- K8s clusters can be installed on premises, in public clouds, and on edge endpoints, which makes it easy to develop ML models in one location and deploy them anywhere.
Figure 3 shows some of the leading commercial and open-source ML platforms and tools that run on K8s.
At VMware, we have heavily invested in developing an enterprise-grade K8s platform that you can trust and use in multi-cloud settings. Tanzu Kubernetes Grid (TKG), which is at the center of our offering, is an upstream-compatible K8s distribution that VMware maintains and supports. In addition, VMware has built a robust portfolio of management and security solutions that complement TKG to meet enterprise requirements.
One of the critical advantages of TKG is that it supports multi-cloud architectures. One can start from an on-premises vSphere or VMware Cloud Foundation (VCF) deployment and extend its capacity to include VMware-based solutions available from public-cloud vendors. TKG can be deployed across this infrastructure to create a single, multi-cloud deployment. In addition, TKG can also be deployed directly on native AWS or Azure instances.
The MCAS solution developed by Intel and VMware uses TKG at its core to create an infrastructure solution that enables ITOps to support multi-cloud MLOps activities.
The current version of MCAS now supports on-premises VCF, VMware Cloud on AWS, the Azure VMware Solution (AVS), as well as edge deployments. With this latest release, MCAS supports deploying ML workloads wherever it makes the most sense, from the core datacenter, to the cloud, and to the edge.
MCAS supports critical MLOps use cases, including but not limited to:
- Machine-learning inference. Once a model has been developed, it can be deployed and used to make predictions on new data. Because these inference services are compute-intensive, they can benefit from innovations such as Intel™ Deep Learning Boost (Intel® DL Boost) technology, including Vector Neural Network Instructions (VNNI)—available starting with vSphere 7, which is at the core of the VMware Cloud Foundation 4.2 platform.
- Data warehousing and analytics. Data warehouses are considered one of the core components of business intelligence, acting as a central location to store data from one or more disparate sources, as well as both current and historical data. The VMware multi-cloud platform supports data warehousing, including industry-proven solutions based on Microsoft SQL Server 2019 or Oracle Database 19.
- Edge computing. For retail stores, healthcare, and other industries, running workloads closer to customers and closer to where data is gathered can improve performance significantly, leading to increased customer satisfaction. VCF makes it easy to deploy and manage remote workloads, using the same technology for public- and private-cloud workloads.
The MCAS reference architecture comprises Intel and VMware technologies fit for ML use cases running in multi-cloud settings. MCAS core components include:
- Intel hardware accelerators for ML. The 2nd and 3rd generation of Intel® Xeon® Scalable processors include Intel DL Boost with VNNI, which improves artificial-intelligence performance by combining three instructions into one, thereby optimizing compute resources, utilizing the cache more effectively, and avoiding potential bandwidth bottlenecks. The MCAS reference architecture also includes other acceleration technologies: Intel® OptaneTM persistent memory (PMem), Intel Optane DC SSDs, Intel® SSD D7 and D5 Series, and Intel® Ethernet products.
- VMware Cloud Foundation supports both traditional enterprise and modern applications and provides a complete set of highly secure software-defined services for compute, storage, network, security, Kubernetes, and cloud management.
- The Tanzu Kubernetes Grid Service (TKGS) is deployed into VCF workload domains and supports creating and operating Tanzu Kubernetes clusters natively in vSphere with Tanzu. The Kubernetes CLI can be used to invoke the TKGS to provision and manage Tanzu Kubernetes clusters.
- For edge deployments, VMware Cloud Foundation Remote Clusters is a feature that enables the deployment of a workload domain or cluster at a remote site through the SDDC Manager running in a central location. This makes it possible to deploy and manage a full stack at remote sites using a single, centralized SDDC manager.
- For edge networking, VMware SD-WAN enables enterprises to support application growth, network agility, and simplified branch and endpoint implementations. It also delivers high-performance, reliable access more securely to cloud services, private data centers, and software-as-a-service (SaaS)-based enterprise applications. In addition, VMware’s SASE solution delivers secure, optimal, and automated access to applications and workloads in the cloud by extending software-defined networking and security to the doorstep of major IaaS and SaaS providers.
- Public-cloud access via the VMware Cloud on AWS and Azure VMware Service infrastructures that are delivered by the same vSphere-based SDDC stack that is used on-premises. The solutions take advantage of existing tools, processes, and familiar VMware technologies, along with native integration with AWS or Microsoft Azure services.
- Finally, VMware Tanzu Mission Control provides a centralized management platform for consistently operating and securing your Kubernetes infrastructure and modern applications across teams and clouds.
To learn more about MCAS, see VMworld 2021 session MCL1594, “5 Key Elements of an Effective Multi-Cloud Platform for Data and Analytics.” You can also download the MCAS reference architecture from VMware.