Hillview: A Tool for Browsing Huge Datasets
A key part of the success of many enterprises involves analyzing and understanding data. There is a myriad of commercial and free tools available today to help enterprises do just that. Only a few of the existing tools can handle large-scale datasets, but all of those require deep technical expertise to use. Moreover, the explosion of the Internet of Things (IoT) will tremendously increase the amounts of data produced worldwide.
The Hillview project is developing a set of tools for the exploration of very large datasets (“big data”). Our goal is to build a spreadsheet for big data that is easy to use and helps users browse and visualize data quickly. The motto of our project is “Big data for the 99% of enterprises.” The goal is to handle datasets with a billion rows interactively. Let’s take a quick look at how we are accomplishing this.
Ease of use – Currently, all existing tools that can analyze a billion rows require users to write code. Hillview provides a browser-based point and click interface. It cannot get much easier.
Visualization – How can you show information about a billion data items on a display that has only a few million pixels? You must somehow extract the essence of the data. As an analogy, consider web-based mapping services which show the route to travel from Point A to Point B. The route is initially shown at a very high level, but as you drill down, you see more details down to the bends in the road. Hillview uses the same principle: the initial display provides the gist of the data, and the user can deep dive into various data regions, looking for anomalies, trends, etc. This technique enables Hillview to compute the images to display much more efficiently since it only needs to compute a high-level view of the results initially. Hillview combines the visualization and computing engines into one.
Response time – Hillview is implemented as a cloud-based service; this enables it to use many machines to load the data concurrently from many storage devices. This method can be much faster than using a single machine, in the same way, search engines can find a phrase on the entire web faster than you can search on your computer.
Our goal is not to replicate features available in other available tools, but to augment them. Hillview is designed for very large datasets; once the user hones in on a small subset of interest, Hillview allows the user to export the data for use in conventional tools.
The following are screenshots from Hillview using a dataset of all flights in the United States from January and February 2016. Figure 1 shows a tabular format of the data. This example displays three columns, sorted on the flight departure delay.
Figure 1: Tabular View of FAA Data Sorted by Departure Delay
The data shows that some flights leave as much as 56 minutes early. This is surprising. By looking at the origin city, you notice most of these flights are departing from Alaska in the winter; we guess that is one place where people do not mind leaving early.
With just a few clicks you can display the distribution of flights organized by their departure delay, shown as a histogram in Figure 2. This figure indicates that 66% of flights leave on time or early. How come our flights never seems to be one of them?
Figure 2: Histogram of Departure Delays by Minutes
Hillview can also display data in a pair of columns as a heat map. Figure 3 is a graph of the arrival delay vs. departure delay, where color is used to display the number of flights (magenta is small, green is very high). The graph looks like what you might expect: most flights that leave on time arrive on time, flights that leave late arrive late; it is amazing to see that some flights are delayed by more than a day (1,440 minutes).
Figure 3: Heat Map of Arrival Delay vs. Departure Delay
Hillview is still an early prototype, but we are excited by its capabilities. We decided to keep this as an open-source project, to make it readily accessible to the maximum number of users and contributors. You can find Hillview on GitHub, https://github.com/vmware/hillview.
If you are at attending VMworld 2017 in Las Vegas and want to learn more about Hillview and other Machine Learning/Big Data projects, look for the Big Data for the 99% of Enterprises Panel Discussion in the “Future Trends” track.
Hillview Research Team
The Hillview Research Team is part of VMware Research. It consists of several researchers devoting part of their time to the interesting problem of data exploration over large datasets. The team members are Udi Wieder, Mihai Budiu, Parikshit Gopalan, and Lalith Suresh.