Computational biology datasets tend to be big. Really big. The raw genomic data for a single cell can run into the hundreds of gigabytes. And if you want to search through thousands of datasets looking for a gene or other snippet of DNA that you are interested in, then you can be staring down petabytes of data. The sheer size of computational biology datasets means that many problems can only be solved on beefy servers with lots of RAM and fast disks.
VMware Senior Researcher Rob Johnson has recently published a string of papers showing how new data structures can be used to cut this data down to size. By shrinking the data, Johnson’s data structures enable many problems that have typically required a server to now be solved on a laptop or even a high-end phone. “One of our motivating visions from the beginning was to eliminate the computational, storage, and memory obstacles to creating a portable device that doctors and researchers could take into the field to quickly analyze and diagnose unknown diseases and outbreaks,” says Johnson. “And our most recent paper shows how a national or international organization, such as NIH or WHO, could use our data structures to create and maintain an efficiently searchable archive of genomic datasets for researchers around the world.” A searchable archive of genomic data could enable researchers to quickly find instances of a gene they have identified, or to test whether a gene is correlated with some observable characteristic, such as susceptibility to cancer.
But how did VMware get involved in computational biology?
The connection is in the underlying data structures.
Johnson’s computational biology work builds on the counting quotient filter, a data structure he originally developed as a replacement for the widely used, decades-old Bloom filter. A Bloom filter is a very space-efficient way of representing a set of items. Applications can insert new items into a Bloom filter, and they can query whether an item is in the filter. Bloom filters save space by allowing a small false-positive rate, i.e. they will occasionally indicate that an item is in the set even though it is not. Many applications can tolerate a small false positive rate, and so Bloom filters are now frequently used in networking, databases, file systems, storage systems, and computational biology, among many other domains.
The counting quotient filter improves on the Bloom filter in several ways. For example it is smaller and faster than a Bloom filter, and it supports several operations that Bloom filters do not, such as counting the number of times that each item appears in the set. The counting quotient filter also scales out of RAM more efficiently than Bloom filters, since many counting quotient filter operations can be implemented using sequential IO, which is particularly efficient for data stored on disks and SSDs.
These improvements are important because applications often have to work around the performance and functionality limitations of Bloom filters, resulting in greater complexity and reduced performance. Counting quotient filters, on the other hand, enable streamlined, high-performance solutions to many problems that up to now have been solved by using Bloom filters in conjunction with other, less space-efficient data structures.
Johnson’s three recent papers demonstrate how counting quotient filters can pay off in the domain of computational biology.
In the first paper, Johnson describes Squeakr, a tool for performing k-mer counting, which is often the first step in any kind of analysis of raw sequencing data. State-of-the-art k-mer counting tools use Bloom filters to remove errors from the raw data, but resort to a conventional, large hash table to perform the actual counting. Squeakr exploits the counting ability of the counting quotient filter to perform both tasks faster and in less space. Squeakr uses as little as one third the RAM as comparable tools and, for some applications, one sixth the time.
The second paper describes deBGR, a tool for compactly representing the de Bruijn Graph of a collection of raw sequencing data. De Bruijn Graphs are widely used to perform more complex analyses of sequencing data; for example, to identify complex mutations and errors in raw data, to disentangle data from multiple sources, and two compare data from different organisms. deBGR shows how to use the counting quotient filter to represent not only the basic de Bruijn Graph, but also how often each snippet of DNA in the graph occurs in the underlying raw data set. This information can be essential to many of the advanced analyses mentioned above.
Finally, Johnson’s third paper shows how to use counting quotient filters to build an index for quickly searching through thousands of raw datasets, constituting terabytes of data. The ability to perform such searches is so essential that BLAST, an earlier tool for searching through genomic data, is one of the most cited papers in the entire scientific literature. But BLAST works only on assembled data, not raw data. And the process of converting raw data into assembled data is expensive, so the vast majority of raw data never gets assembled, making it inaccessible to BLAST-based searches. Johnson’s system, Mantis, overcomes this limitation. Mantis outperforms Bloom-filter-based solutions to this problem on several dimensions. For example, previous solutions had large error rates; Mantis has none. Mantis also builds the index in about one eighth the time, performs queries about 100 times faster, and uses 20% less space than the best Bloom-filter-based systems.
“I believe these projects show the universality of good algorithms and data structures work,” says Johnson. “Although we started out working on the quotient filter for databases, dedupe systems, and the like, we ended up making a big impact on computational biology, effectively building an entirely new tool-chain around a data structure originally created to solve systems problems.”
The papers describing this research have appeared in top venues for big data and computational biology research, including SIGMOD, RECOMB, ISMB, and Bioinformatics. This NSF-funded work is in collaboration with Fatemah Almodaresi, Michael Bender, Mike Ferdman, Prashant Pandey, and Rob Patro of Stony Brook University
- A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro. SIGMOD 2017.
- Squeakr: an exact and approximate k-mer counting system. Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro. Bioinformatics, Volume 34, Issue 4.
- deBGR: an efficient and near-exact representation of the weighted de Bruijn graph. Prashant Pandey, Michael A. Bender, Rob Johnson, and Rob Patro. Bioinformatics, Volume 33, Issue 14.
- Mantis: A Fast, Small, and Exact Large-Scale Sequence Search Index. Prashant Pandey, Fatemeh Almodaresi, Michael A. Bender, Michael Ferdman, Rob Johnson, and Rob Patro. RECOMB 2018.