Hillview: A Tool for Browsing Huge Datasets

A key part of the success of many enterprises involves analyzing and understanding data. There is a myriad of commercial and free tools available today to help enterprises do just that.  Only a few of the existing tools can handle large-scale datasets, but all of those require deep technical expertise to use. Moreover, the explosion of the Internet of Things (IoT) will tremendously increase the amounts of data produced worldwide.

The Hillview project is developing a set of tools for the exploration of very large datasets (“big data”). Our goal is to build a spreadsheet for big data that is easy to use and helps users browse and visualize data quickly. The motto of our project is “Big data for the 99% of enterprises.”  The goal is to handle datasets with a billion rows interactively. Let’s take a quick look at how we are accomplishing this.

Ease of use – Currently, all existing tools that can analyze a billion rows require users to write code. Hillview provides a browser-based point and click interface. It cannot get much easier.

Visualization – How can you show information about a billion data items on a display that has only a few million pixels? You must somehow extract the essence of the data. As an analogy, consider web-based mapping services which show the route to travel from Point A to Point B. The route is initially shown at a very high level, but as you drill down, you see more details down to the bends in the road. Hillview uses the same principle: the initial display provides the gist of the data, and the user can deep dive into various data regions, looking for anomalies, trends, etc. This technique enables Hillview to compute the images to display much more efficiently since it only needs to compute a high-level view of the results initially. Hillview combines the visualization and computing engines into one.

Response time – Hillview is implemented as a cloud-based service; this enables it to use many machines to load the data concurrently from many storage devices. This method can be much faster than using a single machine, in the same way, search engines can find a phrase on the entire web faster than you can search on your computer.

Our goal is not to replicate features available in other available tools, but to augment them. Hillview is designed for very large datasets; once the user hones in on a small subset of interest, Hillview allows the user to export the data for use in conventional tools.

The following are screenshots from Hillview using a dataset of all flights in the United States from January and February 2016. Figure 1 shows a tabular format of the data. This example displays three columns, sorted on the flight departure delay.

Figure 1: Tabular View of FAA Data Sorted by Departure Delay

The data shows that some flights leave as much as 56 minutes early. This is surprising. By looking at the origin city, you notice most of these flights are departing from Alaska in the winter; we guess that is one place where people do not mind leaving early.

With just a few clicks you can display the distribution of flights organized by their departure delay, shown as a histogram in Figure 2. This figure indicates that 66% of flights leave on time or early. How come our flights never seems to be one of them?

Figure 2: Histogram of Departure Delays by Minutes

Hillview can also display data in a pair of columns as a heat map. Figure 3 is a graph of the arrival delay vs. departure delay, where color is used to display the number of flights (magenta is small, green is very high). The graph looks like what you might expect: most flights that leave on time arrive on time, flights that leave late arrive late; it is amazing to see that some flights are delayed by more than a day (1,440 minutes).

Figure 3: Heat Map of Arrival Delay vs. Departure Delay

Hillview is still an early prototype, but we are excited by its capabilities. We decided to keep this as an open-source project, to make it readily accessible to the maximum number of users and contributors. You can find Hillview on GitHub, https://github.com/vmware/hillview.

If you are at attending VMworld 2017 in Las Vegas and want to learn more about Hillview and other Machine Learning/Big Data projects, look for the Big Data for the 99% of Enterprises Panel Discussion in the “Future Trends” track.

Hillview Research Team

The Hillview Research Team is part of VMware Research. It consists of several researchers devoting part of their time to the interesting problem of data exploration over large datasets. The team members are Udi Wieder, Mihai Budiu, Parikshit Gopalan, and Lalith Suresh.

Other posts by

VMware Announces the 2017 Systems Research Award Recipient

VMware is pleased to announce the 2017 recipient of the early career Systems Research Award: Tim Kraska, Assistant Professor of Computer Science at Brown University, who will be joining MIT in the spring as an Associate Professor. Professor Kraska’s university will receive a gift of US $100,000 in support of his research on data management systems. […]

ACM’s 2017 HotNets Workshop to be held at VMware this week

HotNets 2017 is being held on the VMware Palo Alto Campus November 30th thru December 1st. This ACM Workshop brings together top researchers to discuss and debate the disruptions taking place in networking today and the possible directions for future disruptions.  This year 90 invited participants will discuss 28 accepted papers that are early-stage work where community […]

VMware’s Mihai Budiu presents “Hillview: A Spreadsheet for Big Data” at Stanford

Dr. Mihai Budiu, a Senior Researcher in VMware’s Research Group, will present on “Hillview:  A Spreadsheet for Big Data”, as part of the Stanford DAWN Project’s seminar series.  The talk will occur on Wednesday, November 29th, 2017 from 3 to 4 p.m., on campus in the Gates Computer Science Building in room 415.   The Hillview Project is […]