Data engineers invest time not only on activities that require their core data-engineering skills (such as SQL and Python) but must also devote focus to tasks that require system knowledge — knowledge about configuration, monitoring, testing, deployment, and automation. They also need to acquire and maintain a deep technical understanding of the underlying infrastructure. All those non-core activities contribute to the total cost of the end product, whether it is an interactive dashboard or a new dataset. The more time they spend on non-core activities, the more inefficient they become.
But now there is a solution!
We recently released the Versatile Data Kit (VDK), which we built to boost data engineers’ efficiency. VDK was initially built as part of a self-service data analytics-platform called SuperCollider. It is a framework that helps data engineers develop, troubleshoot, run, and manage data-processing workloads (data jobs). We found it so useful that we decided to release it as an open-source project. This post will explain the genesis of this project, as well as the results we’ve seen using VDK.
How we ended up here
Ninety percent of the data that exists in the world today was created within the last three years. To accommodate this massive growth, traditional data-warehousing solutions gave way to data lakes, which were cheaper to scale. This meant that to maintain high-quality, consistent datasets, data engineers had to think about enormous amounts of data — but also needed to understand the underlying infrastructure in deep technical detail. A decade ago, a data engineer needed excellent SQL knowledge and some scripting. These days, they also need to understand things like HDFS, S3, AWS, and Apache Parquet.
VMware is no exception. For example, with disparate sources and heterogenous infrastructure, data engineers and data analysts had to learn that a single query against Amazon Redshift could indefinitely block all other queries and pipelines. Or that when querying data from Presto, somebody else might be changing the .parquet files on the storage while Presto read them, which could result in query errors or inconsistent results returned by the SQL query…and much more. These new types of problems made our work far more inefficient than it was with good old ACID SQL transactional RDBMS offered by traditional data warehouses. Given that we have hundreds of full-time data engineers, analysts, and data scientists, it made sense to optimize their day-to-day work and help VMware navigate faster in this data-driven world. So, we decided to make data engineering more efficient.
So why is data engineering the focus of this article? Well, let’s look at what is common across roles. Data analysts are supposed to drag and drop in an interactive business-intelligence (BI) tool and occasionally write SQL queries. Data engineers use SQL and scripts for transformation and data integration. Data scientists work on model training, so they spend a lot of time in data preparation. The common denominator is the data engineering work — data integration and transformation. All three roles need to work directly with the underlying data infrastructure. The bottom line is that one cannot just write code that does not account for the underlying technologies, systems, and processes.
Let’s look at a simple example: imagine you implement a Tableau report that is refreshed daily. However, the data source — a Presto — is deployed in another geo location. Initially, the report works and is responsive. However, over time, the source data grows exponentially (as expected). Soon, the daily report refreshes start failing because the data that feeds the report becomes so big that a) the SQL query that produces it started running slowly, and b) there’s not enough network bandwidth to transfer the result within the Tableau-defined limits for refresh duration. And of course, you cannot simply ask Tableau admins to increase the maximum refresh duration time, because this request would affect the performance of the refreshes of other Tableau tenants, as well.
To restate: data engineers cannot simply write SQL. They also need a working knowledge of the underlying systems.
Before and after
SuperCollider, which I referred to earlier, has thousands of active users. Tenants in SuperCollider are grouped into self-organized teams. There are more than 100 different teams from all the various departments in the company. Teams vary in size (they might be one person or tens of people) and in skillset (data analysts, data engineers, data scientists, software developers, product managers). Looking more closely, it became evident that data-engineering activities took up most of the time. Even data scientists — who are supposed to work on model training — actually spend most of their time in data-engineering activities, such as data integration, curation, enrichment, and other kinds of processing. We really needed to make data engineering more efficient.
Next, we analyzed which parts of the data-engineering activities were the most time-consuming. It would be great if those were the core data-engineering activities…but that wasn’t the case. Troubleshooting, monitoring, and other system-related activities were consuming the majority of these employees’ time.
Before we implemented VDK, data engineers were using all sorts of data-integration and data-processing tools. Because the tools were different, it was usually not possible for one team to reuse the work of other data engineers (for example, how do you reuse a Sqoop job if your team’s tool of choice is Jupyter notebooks?). There was no common source-code repository. In fact, usually there was no source-code repository at all. The data product (report, dataset, pipeline, etc.) had only one copy — on the actual production system. Same with other results of the data engineers’ work: everything related to monitoring, troubleshooting, and operations. Every team (and often every team member) had to implement building blocks such as credential management, connection to systems, and retry logic. Even the standard Kimball’s update strategies had to be implemented for every system.
With VDK, our mantra was “data-engineering efficiency.” To allow data engineers to focus on their core activities, we abstracted system-related activities, such as deployment, execution, monitoring, security, etc. VDK lets you save your SQL statement(s) and/or Python scripts to a folder and schedule them for execution from the command line with
vdk deploy, and that’s it. Scheduling, execution, monitoring, security, state management, connections, alerts…everything is handled by the VDK runtime. Also, your SQL is versioned and saved in a code repository, so colleagues can benefit from your work, as well. For cases where SQL is not enough, VDK also supports Python, with support for second- and third-party libraries, which allows you to implement logic as complex as it gets.
After introducing VDK to SuperCollider users, we noticed efficiency improvements all over the place! Here are the numbers for the first team (five data engineers and two data scientists) that was onboarded to VDK in 2018:
- Data jobs’ execution time: This was reduced from hours to minutes — not because VDK makes the underlying system faster, but because with VDK it is possible for less-experienced developers (data engineers) to provide more efficient solutions. For example, VDK has built-in Kimball templates for Slowly Changing Dimensions. Also, a developer can see other developers’ code or reuse a plugin/library implemented by other developers, which saves time.
- Data jobs code: We saw a 5x reduction in SQL and Python code length. Code is now modularized and reused. Ingestion is now a one-liner. All the code is in a common code repository, which is available to all VMware employees. Code-review practices are now part of the process — greatly improving quality, collaboration, and knowledge sharing.
- Report / dataset refresh time: This metric was typically between two and 20 minutes at the outset and became two and 20 seconds. Again, this is because with VDK, data engineers can reuse already proven and efficient code and adopt better practices, producing a more efficient data model.
- End-to-end use-case development time: This activity used to take weeks or months. VDK reduced this time to hours or days.
- Troubleshooting: VDK lets you develop locally using an integrated development environment (IDE), which includes benefits such as breakpoints and debugging. It has a built-in smart error classification that helps you perform root-cause analysis even faster.
- Operating expenses: We went from hundreds of heterogenous, unmanaged scripts whose maintenance consumed several person-years per year to homogenous data jobs that are managed, monitored, and easier to troubleshoot at a cost of about one person-month per year. Whenever a problem occurs, the affected stakeholders (configured during deploy time) are notified (via email or Slack, etc.) with a message containing the actual error (with Python, a full-stack trace) and a link to a centralized logging system, where all the logs for the given data-job execution are filtered and displayed.
Looking at its architecture, one might suppose that VDK and third-party ETL UI tools are mutually exclusive. However, the truth is that an ETL implementation that connects to the underlying infrastructure via the VDK library actually achieves better SLAs and is easier to troubleshoot. For example, imagine a Pentaho instance talking to a query engine that offers 99.9% availability. When this Pentaho is talking to the query engine via VDK (the VDK library is run as a Python application inside Pentaho), the success rate is 99.99%! For those familiar with SLAs, adding an additional “9” brings much more value than 10x stability (and is usually achieved with more than 10x cost). With VDK’s sophisticated error detection and automatic retries, you get this for free. And because everything in VDK is extensible, you can augment the connector to your system in any way you want.
At VMware, we already use the VDK in several systems and it has served a mutual benefit. VDK’s extensible design means it can be used by data systems with different technology stacks. This alone justified making VDK open-source software and battle-tested its extensibility in heterogeneous production environments.
We can give you many more examples that highlight VDK’s power and its simplicity of use. But the best way to benefit from the VDK is to try it yourself. You can see examples and tutorials at https://github.com/vmware/versatile-data-kit/wiki/Examples.