When you think of summertime, images of the beach, sun, and vacation come into mind. Not for the members of the VMware Research Group, they think “INTERNS.” It is the time of year when top-notch computer science Ph.D. students join our team’s quest to cause disruption in the state of the field. As we are welcoming a new crop of interns, we thought we would take a moment to highlight some of the groundbreaking results from interns past. As you will see, the bar has been set high, very high.
In summer 2016, intern Heidi Howard of the University of Cambridge came to VMware to work on optimizing systems built on Paxos to make it more efficient and scalable. As she studied the Paxos algorithm, she became perplexed. She did not know why all quorums in Paxos had to intersect. It was evident to her that intersection was required across phases, but she just did not understand why they were necessary for the same phase. Heidi decided to take her observation to her mentor, Dahlia Malkhi who took the time to digest and understand the implications. In a word, Dahlia deemed the observation “brilliant.” Dahlia’s role as mentor took on a whole new direction at that point.
To grasp the significance of this observation you need to know that Paxos is an algorithm for the reliability of distributed systems first introduced in 1989. It has been studied, tweaked, and implemented for years. Paxos uses two phases, leader election and replication. The heart of Paxos requires a quorum of participants to agree and accept a proposed value from a leader. At this point, the leader deems the value committed. Heidi’s proposal to generalize Paxos such that majority quorums are only necessary across Paxos phases was named Flexible Paxos.
The goal for the remainder of the summer was to take the AHA moment and prove it. Working with her mentor and VMware Research intern, Sasha Spiegelman of Technion University, the trio implemented a prototype of the Flexible Paxos algorithm. The results showed that Flexible Paxos is reliable and efficient as both latency and throughput improved. It was significant enough to be archived in arXiv. So for your summertime reading list, dig into Flexible Paxos: Quorum intersection revisited.
Berkeley Churchill of Stanford University, VMware Summer Intern 2016, worked with Eric Schkufza to come up with a breakthrough toolchain insensitive technique to identify third party libraries in unmodified binaries. As often happens with discoveries, it is a project/experiment that does not succeed that leads to a giant step forward. Berkeley’s original project was in the x86 binary size reduction domain, and he was not satisfied with the results. Eric, serving as his mentor, was able to use his internal VMware product team relationships to see if there was anything useful or applicable to Berkley’s work. In the ensuing discussions with the product teams, a reoccurring pain point emerged, third party library identification. Re-energized, Berkeley repurposed his original project, and ProcHasher was born.
Techniques to be able to audit your real world software to determine which third party libraries are contained in the binary are theoretically simple. You build a database of all the possible libraries and do a lookup based on a hash code. The not so simple part is determining which hash function to use. Current techniques rely on using the machine code text in the binaries. The problem with this approach is its sensitivity to the toolchain. Every compiler with its multitude of options requires a new database entry for the same library. As you can imagine the size of the database dramatically increases if you try to build a database of all possible compiler/options combinations.
Berkeley realized when looking at the side effects a binary has on the heap, a hash function based on this property was both promising and toolchain insensitive. Now, this is the point in the tale where principled math takes over. Similar to techniques used for plagiarism detection you can treat successful hash lookups as evidence, not proof, that the component exists in the binary. With mathematical techniques used in the document classification domain, you can statistical classify the probability a given library is in a binary.
ProcHasher’s prototype implementation produced significantly more accurate results than existing techniques. The size of the database was also manageable. ProcHasher was 4.3X less sensitive to toolchain variation and achieved up to 31X improvement in recall, the percentage of components that were identified correctly. It is also proved to be robust against the presence of multiple components in a single binary.
We are looking forward to the disruptions this year’s interns will bring to the table. Check back in the Fall to see if they raised the bar even higher.