2013 Predictions for Big Data
Over the last few years we’ve seen a frenzy of interest and buzz around the area of Big Data. Beyond the hype, there is a solid base of growing use cases, which are becoming center stage to most businesses. 2011 was the year of awareness. There was a great amount of sharing from the early core developers of the analytic platforms – showing the rest of the world the capabilities of the tools and platforms that had been developed for special purpose high scale analytics. The big names at the core of open source analytics development include Facebook, eBay, Linkedin, Twitter – all blazing the trail with new approaches. These companies brought along with them a new and expanding interest in leveraging the same technologies for commercial interest.
In 2012, I saw much more activity within core enterprise and business. There are a growing number of enterprises that are already heavily invested in the use cases – but by volume, most customers now have some form of big data proof-of-concept underway. These proof of concepts typically start with a thesis of how competitive advantage can be gained through insight from the data. A proof of concept can quickly validate the theory, and helps sell further investment in the analytics platform, and it snowballs from there.
VMware made awesome progress this year in making vSphere a great platform for big data, with the mission of allowing all varieties of big data storage and analytics frameworks to run on a common virtual infrastructure platform. In support of this, we’ve teamed up with the Hadoop community to validate virtual infrastructure as a differentiated Hadoop platform and make the combination of Hadoop and virtualization better than the sum of its parts. The highlights for this year include:
- Our engineering team showcased performance results of Hadoop running on virtual infrastructure with little or no overhead.
- Through Project Serengeti , we were able to make Hadoop dead-simple to provision and scale, allowing us to deploy a new Hadoop cluster in less than 10 minutes.
- We enabled Hadoop to be elastic so that it can co-exist with other big-data workloads, allowing dynamic grow/shrink of Hadoop nodes in concert with other big-data workloads – ultimately creating elastic Hadoop on demand.
Now onto my predictions of how 2013 will unfold. Drumroll, please!
Prediction #5) We will all know at least one colleague who is bragging about a Petabyte stockpile of new data.
We’re seeing a growing list of new sources of data, most of it being machine generated. It’s estimated that in 2013, we’ll produce 4 Zetabytes (that 4 million petabytes) of new data. Over 80% of that will be unstructured – in the form of files, documents, media, logs, and other types. That will amount to a jaw dropping 1 quintillion new objects.
The current research is showing a growth rate of between 50-60% per year of these new types of data. As an example, one customer I’ve been working with is building out an architecture to store every single key-click, mouse-over and application log event for every user for two years. This will give them tremendous insight into what their customers’ interests are, and allow them to do sophisticated targeted marketing. Keeping this data amounts to an estimated storage stockpile of 200 Petabytes!
The economics alone is a forcing function towards new storage architectures. If we store 1 Petabyte today in a regular storage system, that’s typically a storage investment of several million dollars. The challenges are the costs of storage, the administrative overhead of managing this much data, and bringing enough computation to the data in a way that we can reasonably filter, organize and analyze the data.
Prediction #4: ‘Delete’ will become a forbidden word
There’s definitely a mindset change about keeping data – with a change from storing important data to keeping ALL data. The problems is that we don’t know up front what questions we want to ask of the data, so if we don’t keep that information we are precluded from doing whatever insightful analytic that could have been the “killer usecase”. If we keep all data, then we can keep open all options for interesting analytics. The data scientists can develop new theories and models, and go back in time to understand these new models.
I believe we’ll see a growing number of companies who follow the same path. They will setup sufficiently large-scale data stores and scale-out analytic tools so that keeping all data is affordable and practical.
Prediction #3: There will be a mad dash for software-defined storage
I predict we’ll see a flurry of new technologies and companies that will claim to offer different renditions of software-defined-storage, aimed at storing this mass of data. The traditional model of whole-system storage hardware will change in light of the volume of data tilting heavily towards new data types, and a blurring of the line between compute and data.
The growth rate of traditional data (customer records, transactions, history) just doesn’t grow at anywhere near the rate of the new data. Traditional enterprise data is only growing at 20% or so – but as we saw, the amount of new data being stored is growing in the order of 50% year over year. This means that there will be two key shifts within the storage industry – a move towards more commodity-based storage that can potentially take the place of traditional storage, and a new set of high-scale storage architectures aimed at storing all this new data.
The chase will come from multiple dimensions:
- Software-defined SAN, to provide cheaper places to store blocks. A software-defined storage approach means that we can use software to provide the high reliable, feature-rich storage services on commodity hardware. There are a number of startups entering this space with either pure-software or software-hardware appliance combinations, and as announced at VMworld in 2012, VMware is also developing technology in this space. These storage solutions will provide moderate scale storage, and be aimed at providing the storage feature set (things like data recovery, replication, policy based placement). They will make heavy use of SSD and Flash to deliver high IOPS performance.
- Software-defined NAS, in particular scale-out NAS. Software-defined NAS can run on commodity servers, and able to scale up to the order of 10 Petabytes. I expect the NAS market will experience a resurgence in the big-data space. Scale-out NAS architectures allow for the capacity requirements of big-data, by clustering many nodes together to provide multi-petabyte configurations. Scale-out NAS offerings will have an advantage of their traditional access methods – making it easy to ingest and export data from the system through standard mechanisms such as NFS. Scale-out NAS combined with virtualization means that you can bring compute to the data, making it a viable platform for data-parallel workloads such as big-data.
- Software-defined Object stores. These are radical new object stores that claim to scale to 100 Petabytes, with interesting cloud-replication capabilities. These storage architectures trade-off some of the typical constraints that limit scale, including POSIX semantics and consistency. For big-data, we often don’t need those fine-grained semantics, but we do need to store data at scale. It’s clear that Hadoop’s HDFS is one strong player in this space with the ability to cluster several hundred servers together. A growing number of technologies exist in this space and will be positioned as commercial solutions to big-data storage, with a strong emphasis on scale and multi-cloud replication.
Prediction #2: The default infrastructure for Big Data will change
We should expect a tipping point in network infrastructure, 10GBE networks and high-bandwidth switch topologies. Cost metrics will afford the majority of new big-data installations to take advantage of 10GBE, resulting in a different set of assumptions about optimal big-data systems. Cross-sectional bandwidth within a rack of 1Tbit will ease focus on data locality, and put the emphasis more on designing storage topologies for availability. In 2013, data and compute can be anywhere in a switch domain with little or no performance difference. Beyond 2013 we’ll see more interesting flat networks evolve, which will even further relax the locality requirement.
Additionally, the decreasing cost of flash and the increasing availability of software to take advantage of multiple tiers of storage will mean that flash will be an integral part of every storage architecture. Hot blocks will be placed automatically on SSD, and writes will be buffered by SSD to give much lower latencies. In some cases, entire applications data sets will be moved to flash based storage tiers.
Prediction #1: The focus on big data use cases will shift heavily towards real-time
Businesses are starting to realize they now have a significant and new competitive advantage with the ability to make real-time decisions based on their own data.
A few of the top use cases include:
- Personalized, targeted marketing (such as new retail): Rather than just acting on buying patterns, retailers will be able to mash up large amounts of historical data and recent real-time events (what did you buy just a few minutes ago, where are you, where did you come from, what did you tweet?) and deliver customized offers targeted accurately at buyers’ needs.
- The predictive enterprise: Real-time decisions replace age-old processes for running the business become the enterprise “brain.” Stock and shipping calculations become dynamic and adaptive, responding to predicted trends and swings based on real-time inputs. Stocks can be reduced, and pricing can become dynamic based on spot markets and needs.
- Automatic failure analysis and predictive maintenance through a closed feedback loop from embedded sensors and metrics: The same technology that is used for elite cases such as nuclear reactor monitoring is being applied to everyday uses – your car, home appliances become part of a predictive failure analysis system and will proactively alert for up coming situations requiring attention. In addition, enterprises are increasingly looking to use this technology to improve performance and availability of their applications (in fact, VMware’s vCenter Operations is a good example of this type of analytics).
As a result, in 2013 I predict we’ll see an emergence of the frameworks and technologies required to implement these systems. The significant component will include:
- Real-time in-memory databases: These databases will be able to ingest the extreme rates of events that come from sources such as social: Twitter and Facebook feeds, machine generated metrics, and large-scale user-driven interactions. These databases are able to incorporate this real time data with learned behavior, and react in real time. Examples include SPARK and SHARK from UC Berkeley, Gemfire from VMware,
- Frameworks for programming event driven actions: The Storm project from Twitter, some new entrants based on NoSQL, such as Continuuity.
- Frameworks for implementing machine learning: The programming models for machine learning typically involve iterating over steps in data in rapid succession, where subsets of the data reside in memory. Platforms such as SPARK from UC Berkeley provide the reliable datagram store and iterative programming models that are needed.
Almost every application being built to incorporate these techniques is hand-rolled. In 2013, we’ll see startups emerging with new PaaS-like frameworks to aid in the development of these real-time applications.
As the need shifts from a monolithic map-reduce powered platform to a hybrid of real-time, batch and machine learning, there will strong need for running multiple framework types on the same cluster. We believe that virtualization will play a central role in creating that common distributed platform, and we see a growing number of enterprises in 2013 standarding on virtualization as the platform for their big-data solutions. I can’t wait to see how all this plays out next year!