Strategic Advisor

VMware Tanzu + cnvrg.io: Enabling Enterprise-Grade MLOps

The un-realized promise of machine learning

Very few Machine Learning (ML) models are deployed in production. For example, in a recent poll by KDnuggets, data scientists said less than 20% of ML models are deployed in production. This low number implies that companies are wasting compute power, human talent, and opportunities to use ML to enhance their operations, services, products, etc.

Training and deployment are key for MLOps

Continuous training and deployment capabilities have become recurrent discussion topics when talking to data science teams, regardless of industry type or company size. They continually surface as the primary obstacles to deploying and managing ML models as part of business applications. It turns out that Machine Learning Operations (MLOps) emerged as the enabler of continuous training and deployment for ML models. This article introduces the core concepts of MLOps and explores how the combination of VMware Tanzu and cnvrg.io can help the adoption of MLOps processes, allowing teams to deploy ML models in production at the speed required by your business.

Bridging the chasm between ML experimentation and production

In this article, Chris Gully (VMware,) Bob Glithero (cnvrg.io,) and Enrique Corro (VMware) show how Machine Learning Operations (MLOps) processes and tools can help you bridge the chasm between ML experimentation and production. We strongly believe that MLOps is the quickest path for a business to take advantage of ML to enhance services, products, and the overall customer experience.

The article is divided into three sections. In section one, we talk about MLOps’ key concepts and why they matter. Section two discusses how Kubernetes can provide the infrastructure services to run ML workloads at any point in a multi-cloud. Finally, section three talks about how MLOps Orchestrators offer ML pipeline automation and ML model management instruments to continuously train, deploy, and update ML models in production. Chasm bridged.

What is MLOps, and how can it help you promote ML models from labs into production chains?

To understand MLOps, we need to consider that it is an adaptation of DevOps to meet ML model development requirements. It is important to recall that DevOps is about culture and methods that enable the “rapid delivery of stable, high-quality software from concept to the customer.” A DevOps culture aims to automate software development, testing, and deployment through continuous delivery. Adopting DevOps requires the alignment of three dimensions of the software development lifecycle: people, processes, and tools.

MLOps builds on DevOps methods, considering that ML models are more than just code. Models are a tight combination of the statistical patterns latent in historical data (datasets) and codified mathematical functions that get incrementally shaped during a model’s training process. When the training process is completed the right way, ML models can apply their learned functions to make predictions on new (unseen) data as long as the statistical patterns latent in this new data remain close enough to those from historical data.

After some time, the statistical patterns latent in new (unseen) data will considerably drift from historical data, leading the model to make wrong predictions at an unacceptable rate. When this happens, it is time to use new data to retrain the model and deploy the updated model’s version to production.

MLOps’ mission is to streamline the ML development lifecycle, involving the continuous execution of activities such as data extraction, model training, model deployment, and model monitoring.

If an organization does not build its MLOps competence, it will be tough to connect data science work with business outcomes.

Like DevOps, MLOps are about organizational culture, people, and roles. MLOps requires a lot of collaborative work among different personas and functions, for example:

  • Data engineers build the pipelines needed to extract data from diverse sources. These data pipelines often carry out a series of transformations necessary to get data in the right shape for ML modeling.
  • Data scientists determine if ML is suitable to solve a business problem and will use historical data to conduct experiments with multiple ML techniques and configurations to come up with the one with the best balance between accuracy, inference latency, and endpoint infra constraints.
  • ML (or MLOps) engineers define the ML pipeline processes and control the transition of ML models from experimentation to production. They also monitor models’ performance and determine the conditions to update and deploy the new version of a model.
  • Business stakeholders help connect data science work with business outcomes and help justify the investment in data science people and tools.
  • Compliance officers enforce data governance and adherence to company policies (e.g., data privacy and security) and industry regulations in data science processes.
  • IT Operations (ITOps) are at the center of this ecosystem of roles as they provide the infrastructure and technical support required to deploy ML systems and tools.
Figure 1: Personas, activities, and tools in a typical ML model development pipeline.

Please take note of the following:

  • There are many activities that data engineers, data scientists, and ML engineers need to perform and coordinate to produce ML models that deliver the intended business outcomes. Therefore, it is essential to select the orchestration tools that facilitate the automation and coordination of those activities as part of a unified process.
  • ML is a very innovative, experimental, and evolving field that will invite people to try the latest ML tools and libraries. For convenience, ML libraries and tools often get packaged in containers you can deploy in Kubernetes; however, for security purposes, you’ll have to scan and validate those containers.
  • ML pipeline orchestrators are at the center of the ML system. There are multiple open-source options like MLFlow and Kubeflow; however, deploying these in production with the proper security and stability can be daunting.

Here is where the combination of enterprise-grade platforms like VMware Tanzu (multi-cloud Kubernetes platform) and cnvrg.io (MLOps orchestration) can simplify the deployment and management of ML systems and workloads so your data science teams can focus on getting MLOps processes done the right way.

Running your ML workloads on VMware Tanzu.

Kubernetes, MLOps, GPUs, oh my! Today, IT admins and operators are being asked to move at the speed of light to implement and operationalize key technologies to help drive innovation. These modern technologies require a new set of skills. They are often a departure from what many IT shops have standardized to achieve the scale and resiliency of the business and customer demands. As if that was not enough, there is also a rising culture of data specialists that are often siloed due to a lack of mainstream support for workloads like Machine Learning (ML) from their IT departments. This bifurcation creates a space where each team develops its own solution, making it harder to collaborate and ultimately get models into production.

Another element working against IT shops is the rise of multi-cloud infrastructure. Today infrastructure consumption spans a diverse set of private clouds, hyperscalers, and hybrid cloud solutions. This sporadic consumption lacks a common architecture, cost, and, most importantly, standard API (Application Programming Interface). VMware addresses this by maintaining a presence across all the major hyperscalers and on-premises private clouds. This enables VMware to tame this chaotic landscape that spans from on-premises to cloud to the Edge with a consistent control plane, operating model, and common API set. But having a consistent infrastructure model is not enough for today’s transformative climate.

Given these challenges and listening to feedback from customers, VMware is motivated to embark on a journey to streamline and simplify the implementation and management of Kubernetes, accelerate resources like GPUs, and become the platform of choice for Machine Learning Operations (MLOps).

One of the outcomes of streamlining operations and creating Time to Value (TTV) for our customers was a collaboration to develop a platform that enabled best-in-class performance coupled with a rich ecosystem to help enable users and operators. NVIDIA AI Enterprise is an enterprise-class AI/ML solution that was co-engineered to ensure performance, ease of use, and implementation. It is fully integrated to give customers the confidence to run their AI workloads.

A couple of the highlights of the solution are:

  • Certified Systems that give customers confidence
  • Solution-level support from VMware and NVIDIA
  • Performance optimized containers
  • Fully integrated Network & GPU Operators
  • Tanzu Kubernetes Grid (native K8s for vSphere)
  • Curated NGC catalog and applications
Figure 2: High-level architecture of an end-to-end ML development environment.

Dell’s Validated Design for AI is a concrete example of this reference architecture to develop ML models with cnvrg.io and VMware infrastructure.

To further the TTV for customers, VMware also developed the Service Installer for Tanzu, which can be found in the VMware Marketplace. The Service Installer is a toolkit that helps to automate Tanzu Kubernetes deployments. Its main purpose is to create an easy-to-use UI experience to capture the required data for successful TKG deployments. To complement the UI experience, there is also a command line tool to run the different phases of a deployment. Building off industry adoption and ecosystem enablement, the Service Installer is primarily based on Terraform modules and configuration files. This allows the utility to become easier to modify and maintain for future growth and expansion of capabilities. Figure 3 shows the easy-to-use UI experience provided by this utility.

Figure 3: Main Screen of the Service Installer for Tanzu.

Okay, so now you have hardware and software certified, validated, and integrated to ensure easy implementation and foster existing Centers of Excellence built around VMware technology. So, what could you possibly need next, you might ask? Well, to bring it all together, the convergence of Data Engineering, ML Engineering, and DevOps practices spawned the need for a comprehensive and mature MLOps practice to ensure frictionless data curation, model creation, model tuning, model delivery, and lastly, model lifecycle. The next section will dive into how these are all addressed by cnvrg.io and their simple-to-use but robust MLOps solution.

cnvrg.io: A full stack machine learning operating system

With an increasing demand for rapid insights from data science to keep up with business and market changes, enterprises need faster value from their AI investments. As a result, most are trying to streamline experimentation, prototyping, training, and testing. But optimizing the development and training workflows is only half of the job. It’s essential to have a consistent, repeatable process for deploying and retraining models at scale.

Figure 4: Typical issues in machine learning projects

One of the biggest challenges in repeatable ML is in the diversity of the pipelines. ML pipelines can include many versions of libraries, dependencies, and frameworks that support diverse training, testing, and production tasks.

Moreover, each pipeline task may better fit a specific type of compute resource – CPUs, GPUs, or AI accelerators. These can be deployed across various clouds and data centers, each with its own operational and monitoring idiosyncrasies that blur a holistic view of the ML portfolio.

Because ML performance is a function of model code, data, and hyperparameter choices, ML models have different monitoring needs from other types of software. Kubernetes observability down to the container level isn’t enough to track the health and performance of individual jobs. With potentially hundreds of pipelines using different combinations of model code, datasets, hyperparameters, and resources, it’s essential to have monitoring in place that can track and compare model performance among the various runs to efficiently select the most performant models.

In response, cnvrg.io has created a Kubernetes-based MLOps platform to manage the lifecycle of AI models at scale: training, testing, deploying, monitoring, and continuous learning. Figure 5 shows the key elements of the cnvrg.io platform.

The cnvrg.io control plane follows the Kubernetes operator pattern and contains all the user-facing logic needed to manage code, jobs, and the state and health of jobs and resources. VMware Tanzu provides a consistent Kubernetes API across clouds and data centers for the control plane to manage ML worker nodes. You deploy the control plane to the Tanzu Kubernetes runtime easily with Helm charts. Of course, if you’re using Metacloud, cnvrg.io’s managed service, all of these operational details are abstracted away from you.

Figure 5: cnvrg.io MLOps platform

Using Kubernetes as an orchestration layer makes jobs portable across environments and simplifies scaling resources up and down on demand. cnvrg.io can also use Kubernetes’ native mechanisms, such as taints and tolerations, to place workloads only on appropriate nodes. For example, we can ensure that jobs requiring GPUs for training don’t land on nodes with only CPUs. cnvrg.io also created an additional job scheduler that manages the lifecycle of individual ML tasks to augment the pod and container management provided by VMware Tanzu.

Finally, one of our key value propositions is heterogeneous compute and storage. As discussed above, even in a single pipeline, developers may need different types of compute and storage resources at different stages. cnvrg.io abstracts and templatizes compute and storage into a utility by making it seamless to connect a variety of compute and storage resources, like CPUs, GPUs, specialized AI accelerators, and different classes of storage. Developers can select marketplace resources from a menu of containerized OEM and partner-provided options and consume in a cloud-native way, even if they’re on-premises.

MLOps — combining processes, tools, people, and personas for maximum impact

MLOps is the cornerstone to successfully transitioning ML models from experimentation to production. MLOps is not a product but a combination of processes, tools, and (more importantly) people & personas. Effective ML pipeline orchestrators like cnvrg.io allow organizations to connect people to collaborate and put the processes in place that make ML development efficient and effective. Please keep the following ideas in mind when making decisions about infrastructure and tools to instantiate ML development pipelines:

  • Most ML development libraries, tools, and orchestrators have evolved to run on Kubernetes. With this, Kubernetes has become the universal infrastructure service to run ML workloads on-premises, at the public cloud, and at the edge.
  • An enterprise Kubernetes platform such as VMware Tanzu provides the performance, security, management, and observability features needed to meet the requirements of production environments.
  • The combination of cnvrg.io and VMware Tanzu allows you to simplify the deployment of well-orchestrated ML pipelines at any point in a multi-cloud setting, wherever it makes the most sense for your business.

To learn more, please visit the replay of VMware Explore 2022 session VIB1345US, “One Multi-Cloud Grid to Run all Machine Learning Operations Stages.” The video will walk you through the key MLOps concepts and show how easily Tanzu and cnvrg.io can get integrated to get a complete ML dev system up and running in no time.

Comments

Leave a Reply

Your email address will not be published.