To stay competitive, enterprise organizations are engaged in an ongoing drive to optimize and scale the delivery of their products and services. Collecting and understanding data has become a critical component of these efforts.
The growing number of use cases and corresponding scale require new data-management applications that go beyond traditional databases. In addition to data storage, modern application deployments require new capabilities, such as caching, enrichment, real-time analysis, search, and visualization to better manage and utilize large data sets. Such capabilities require application teams to deploy and integrate multiple software tools as part of a single data platform. Examples include data-ingest services supported by caching, data preparation, and analysis; storage tiers for real-time and historical data access; and data search and analytics dashboards.
At the same time, as part of enterprise cloud adoption, application architecture and deployment patterns have evolved into distributed systems with interconnected software runtimes (i.e., microservices). Similarly, as per cloud-native deployment principles, modern data-platform implementations require deploying a distributed system of stateful software runtimes. So despite the benefits of horizontal scale and deployment agility that they bring to the enterprise, modern data platforms also introduce new deployment challenges.
At VMware’s Office of the CTO, the Cloud Data Platforms team focuses on developing deployment solutions for data platforms. This blog provides an overview of the data platform deployment technologies we use. These solutions can enable you to deploy your own data platform within minutes, instead of hours or days.
Modern data platforms
Open-source and third-party middleware are often useful in data management, specifically for ingestion, distribution, processing, and storage. Examples include distributed databases and data-caching software. When deployed and integrated as a data platform, these software stacks can support a shared set of data services for integration with application business logic.
Implementation of a data platform typically involves the deployment and replication of multiple types of data software runtimes across various server nodes to handle scale and resiliency. As a result, data platforms are multi-node systems supporting distributed (and often interconnected) software runtimes. A data platform deployed on cloud infrastructure creates a new distributed, functional layer of data services that integrate with application runtimes via APIs and data function calls (see Figure 1).
In addition to self-managed deployments with virtualized and containerized software runtimes, new data platform architectures are emerging that cover both hosted and hybrid cloud (services) deployments. These modern data platforms are easier to implement, easier to scale, and easier to manage than their predecessors. And by leveraging cloud-native technologies and deployment patterns, modern data platforms are more cost-effective and resilient, as well.
To support operational agility and cost efficiency, cloud data platforms require the following deployment capabilities:
- System-level deployment automation: Implementation of modern data platforms requires deployment of a combination of different data software stacks. Deployment automation must cover the installation and configuration of multiple data software runtimes, as well as their placement across server and cloud infrastructure. Deployment manifests that capture data software configuration and placement details enable repeatable deployment patterns that can be automated.
- Infrastructure capacity elasticity: The types and amount of data that need to be processed by a data platform will change — and is likely increase — over time. As a result, a modern data platform must have a scale-out architecture that enables incremental storage- and processing-capacity upgrades via the deployment of additional data platform nodes and software runtimes.
- Right-sized resource utilization: Data platform deployments consume infrastructure resources that are likely to grow over time, increasing processing requirements. Controls must be put in place that match the sizing of data platform runtime components with the resources available in underlying infrastructure server nodes and services. This helps minimize underutilization and avoid resource bottlenecks.
- Full-stack, system-level, and data observability: Cloud-native monitoring principles for infrastructure and application runtimes also apply to modern data platforms. In addition to full-stack monitoring and platform-level status visibility, observability instrumentation must be in place for service-level objective (SLO) monitoring of end-to-end data flows within a data platform.
In addition to increasing agility and scale, the adoption of cloud infrastructure and containerization has created a new set of deployment challenges for data platforms (specifically for stateful software runtimes and data workloads).
- Stitching together multiple software runtimes: Most data platforms require the deployment of multiple data software stacks. For each software stack, multiple container runtime instances are deployed to meet defined scale and resiliency requirements, creating even more complexity. All of this requires numerous deployment steps, which can easily take hours (or even days) to complete.
- Software runtime placement: Data platform implementations require deployment of multiple types and instances of data software runtimes. How these software components are deployed across server compute and cloud infrastructure matters. For example, to ensure node-failure resiliency, software runtime instances need to be placed on separate server nodes. At the same time, there is an incentive to stack multiple runtime instances together to optimize resource utilization of underlying server nodes. Handling these (sometimes competing) runtime placement requirements creates additional deployment complexity and risk. For example, when placement affinity rules are defined incorrectly, multiple data runtimes can inadvertently be placed into the same fault domain, creating a single point of failure.
- Resource sizing: Correctly allocating memory and compute resources is essential for the optimal execution of data software runtimes (tuning parameters for JVM heap memory, for example). At the same time, minimizing the overallocation of resources requires the configuration and tuning of resource (request) sizes and limits, potentially at multiple (cloud) infrastructure layers. These various levels of deployment tuning create new operational complexities, especially for mature data platform deployments where resource capacity and performance tuning are essential to optimize infrastructure costs.
- Implementing data observability: Managing data platforms requires monitoring the operational state and performance of multiple — often many — server nodes and data software runtimes running on those nodes. To derive the overall functional status of a data platform, you must collect, aggregate, and process observability metrics from potentially many (runtime and server) endpoints, often causing scale and time-series data query complexities. In addition, measuring data SLO throughput characteristics for data platforms requires end-to-end tracing supported by data software and application runtimes, which involves custom code and additional runtime deployment configurations.
- Maintaining open-source data software: Unfortunately, it’s usually necessary to manually download open-source software for implementing data platforms every time new software upgrades or fixes are released. Each release must be tracked separately for every open-source software component of the data platform deployment, which creates operational risk and overhead.
New application catalog offerings and deployment-automation tools have simplified software runtime installs and reduced the complexity of software lifecycle management for data platforms. In addition to validated software runtimes, application catalogs can also offer deployment-automation manifests and configuration guidelines.
- DIY open source: A wide variety of public software repositories are available for downloading open-source data-management software. Once downloaded (and repackaged), software runtimes can be deployed manually or automatically, which gives application teams both choice and flexibility. However, along with this flexibility comes the complexity and ongoing burden of tracking and managing new software releases and testing patch upgrades to fix security vulnerabilities. All of this requires technical expertise, time, and resources, which complicates day-two operations.
- Trusted open source: Application catalog and software marketplace offerings simplify day-two operations by making validated and packaged software runtimes and related artifacts available to applications teams. This makes searching for and testing new data software releases and patch updates easier, simplifying the software-lifecycle management for data platforms. Today, there is a growing list of software catalog offerings available, such as the Bitnami Application Catalog and VMware Marketplace.
- Deployment blueprints: Deployment automation focuses on installing and configuring software runtimes and includes resource sizing, security settings, and placement rules. With the adoption of declarative software-automation solutions, deployment blueprints have become a tool to capture many of these (complex) configuration parameters. Especially with data platform deployments — where multiple data software runtimes must be stitched together and configured — blueprints simplify initial implementations as part of day-one operations and ongoing day-two software patching and upgrade activities. Blueprints have become part of the software-deployment artifacts available in application catalogs.
You can eliminate many of the operational challenges associated with self-managed open-source and third-party software deployments by using a centralized application catalog with validated data software runtimes and deployment artifacts. In addition, blueprints expand the automation semantics of software runtime implementations, simplifying deployments of distributed data platforms with multiple software stacks.
The use of blueprints reduces the need for reconfigurations after initial data platform deployments. Moreover, right-sized blueprint configurations reduce the amount of software runtime tuning work required throughout the remaining data platform deployment lifecycle. Carefully engineered blueprint configurations can replace post-deployment tuning activities, enabling right-sized deployments of data platforms on day one.
You can find validated blueprint designs — including blueprints for building containerized data platforms with Kafka, Apache Spark, Solr, and Elasticsearch — in the Bitnami Application Catalog and in the VMware Marketplace.
These engineered and tested data platform blueprints are implemented via Helm charts. They capture security and resource settings, affinity placement parameters, and observability endpoint configurations for data software runtimes. Using the Helm CLI or KubeApps tool, Helm charts enable the single-step, production-ready deployment of a data platform in a Kubernetes cluster, covering automated installation and the configuration of multiple containerized data software runtimes.
Each data platform blueprint comes with Kubernetes cluster node and resource configuration guidelines to ensure the optimized sizing and utilization of underlying Kubernetes cluster compute, memory, and storage resources. For example, README.md covers the Kubernetes deployment guidelines for the Kafka, Apache Spark, and Solr blueprint.
We are always interested in getting feedback from users! You can submit questions and issues via the Bitnami GitHub page and use the Bitnami Community page to submit enhancement requests and ideas for new data platform blueprints.
|Geeta Kulkarni is a Staff Engineer in VMware’s Office of the CTO. She is currently focusing on building data platforms on cloud-native infrastructure.|