New Year, New Assignment: Data
Cloud was so yesterday, you have to be with big-data to be hip ;-). So…
I’ve been spending a growing amount of time in the data space. This has been mostly as a strategic planning activity for VMware’s product investments as we extend into the data space, and in 2012, I’ll take the lead on our big-data development efforts. This is an important and growing area for the industry as it drives significant new value to business. We see a meaningful role for cloud and virtualization as this corner of IT begins to take real shape.
In this blog I will share highlights of our team’s work, including some exciting new developments around Hadoop.
The Path to Simplification and Optimization
Over the last decade we’ve seen a significant reduction in complexity of computing systems. At the infrastructure level, we’ve leveraged server consolidation and standard cloud infrastructure to achieve significant cost savings and incredible advances in the speed and agility of delivering applications. We all now expect infrastructure to be abstracted and delivered as a service.
More recently, we’ve seen the same pattern applied further up the stack to tackle the complexity of the operating system and application server. We’ve replaced the need for hand-weaving complex stacks with platform as a service (PaaS), allowing rapid delivery of new applications, without having to stop and think about installing Linux, Application servers or network load balancers. Applications can be pushed from code to production through a single command. The Cloud Foundry PaaS is a leading example of this layer of simplification.
We’re now seeing the need for the same pattern at the data level. The opportunity is to simplify the deployment and management of data systems.
The Data Renaissance
We are undergoing a significant data management reform. We’re seeing a new set of choices in data management systems, new cloud-oriented data topologies, and a quantum shift in data-oriented infrastructure.
It was once true that the relational database was the one place to store all application data, but this is rapidly becoming a false assumption. Rather, a new breed of data store that is better suited to today’s applications is becoming popular.
At the same time, we are drastically changing the way we develop, deploy and manage our applications. We expect to deploy our applications in seconds rather than months, and have a choice of application runtime destinations — in the public and private cloud.
As developers, we want to have choice — choice of data models and APIs, without being told by the DBA and infrastructure team that you must store data in their system of preference. As application architects, we want to allow our applications to run in the best cloud destination — free from the constraint that an application can only run where it’s data is.
Additionally, infrastructure is getting faster and cheaper, opening options for significantly more agile, scalable and economical alternatives to traditional SAN and NAS mainstays. As high throughput and big-data become competitive differentiators, we are compelled to find storage architectures with one or two orders of magnitude better price performance.
Problems with previous data systems
Several key drivers are fueling the data renaissance:
- Flexible data models: Developers of cloud applications find rigid schemas too constraining when deploying agile applications where change is a constant. Additionally, the constrained normalized-form of relational databases makes it more difficult for programmers using modern dynamic languages to store and retrieve simple objects into databases.
- Scalability: Traditional relational data systems are inherently hard to scale — a single vertically scaled system requires a significant amount of effort and investment in expensive high performance disk hardware to increase throughput. Applications that start small on relational systems quickly hit a performance ceiling, requiring specialist tuning and cost to resolve. Customers need scalable data solutions, so they can start small and grow to extreme levels.
- Cloud vs. Data: Modern applications are often consumer or public facing, requiring that the application be designed to scale across multiple geographies — so that the application can be accessed locally and can be tolerant of failures of networking or entire data centers. This is a significant challenge on the data architecture, since traditional data systems can only reside in one location. Applying replication on top of existing data solutions is complex and error prone, and only partially solves this problem. The client devices have shifted away from browsers to multiple semi-connected devices, iPhones, iPads, etc. Application data often needs to extend to these end-user devices through replication and multi-geo access. Applications should be portable to multiple clouds. Through PaaS, it is now possible to run application on different clouds, move an application between clouds, or federate an application to co-exist on more than one cloud. The data systems however don’t allow this, since data is tethered to a single location, preventing mobility.
- Analytics: The scale and rate of data required in today’s applications far exceeds the capabilities of traditional main-stay BI/DW systems. To cater for the volumes and throughput of these new analytic systems, a variety of new distributed data solutions are appearing with different flavors of scale, speed and throughput.
One size no-longer fits all
The big take-away is that there is no longer a one-size-fits all attitude to data.
This is great for developers – the choice of data systems is allowing many new exciting systems to be built.
This however is having the affect of increasing complexity for operations. In the traditional enterprise IT often enforced a single database, so that their data management team could focus on running that system well at high efficiency.
Like it or not, we’re now faced with operating multiple data systems.
Additionally, those new data systems are typically much more complex than the old single-node databases. Almost every new technology is horizontally scaled, uses local disks for cost efficiency, and has some form of partitioning and load balancing that is database specific.
The challenge and opportunity looking forward is to simplify the data management stack. I’m a believer that this isn’t going to happen by someone inventing the magic universal database, rather that we can abstract some commonality in the layers around those systems.
Why Big Data is Important
I see three big trends that are driving the rapid adoption of big data systems.
Firstly, the volume of data being generated is growing at an increasing rate. We’re now seeing 60 percent year of year growth of the volume of data being generated. It’s now predicted that by the year 2030 we’ll store 1 Yottabyte of data.
The second big trend is the unquestionable amount of value that is being realized from this data.
We’re seeing a whole new field unfold around data science, and the emergence of a new profession – the data scientist. These are the people who know how to extract significant value from the data, which is fuelling the aggressive investments in these systems. A few good examples of systems that generate value include:
- Sentiment Analysis: Mining the social datafeeds to induce information. Take a look at mombo.com. By analyzing the commentary in twitter they are running analytics in real time on the twitter firehose, and are able to reasonably accurately predict the ratings of a movie before it debuts. Sentiment analysis will drive huge dollars – it’s rapidly becoming a standard and instant way of understanding how a new product is perceived in the market.
- Fraud and intrusion detection. The throughput and scale of new big data systems allows a new level of monitoring and intelligent analysis in real-time. Many new fraud detection systems are being enabled.
- Health and Medical: We’re seeing many new developments of medical diagnosis powered by big-data. Doctors will be able to rapidly mine the symptoms of millions of similar patients and significantly improve the speed and accuracy of diagnosis of complex situations.
Finally, the amount of compute resource that can be assembled for big-data systems is rapidly accelerating through the availability of software that can take advantage of commodity hardware.
Traditional infrastructure to store data is expensive, not very scalable, and also single-homed. In many cases, storage capacity requirements are large, putting a new emphasis on the cost and performance economics of storing big-data. The current cost of storing data on SAN/NAS is about $1-$10/gigabyte. The analytics systems being built at scale are often in the $0.05 to $0.50/gigabyte range.
BI/DW systems are typically deployed on big-iron, which approached $40,000 per CPU. These scale-out big-data systems can be run on clusters of commodity hardware, often at around $1k per CPU. This is one of the reasons Hive (a SQL database optimized for large query) on Hadoop is becoming so popular. It might be less efficient than a SQL engine running on big iron, but the cost of the hardware is no much lower that it’s much more economical. In addition, the scale available is much higher, since potentially thousands of nodes can be invoked against a single SQL query.
Unified Analytics Cloud
In the next few posts, I will walk through a layered approach to building a multi-tenant big-data stack on a common infrastructure.
In thinking about stacks for big data, we find that there are five levels of functionality:
- Cloud Infrastructure: Common Infrastructure for running different types of data systems and application stacks. A common infrastructure management layer should be used here. Hadoop or Virtualization are common approaches.
- Cloud Data Platform: Common platform for managing multiple Databases. VMware’s Data Director is a good example.
- Data Systems: Systems for structuring, querying and storing data. hBase, Hadoop’s HDFS, Cassandra and VMware’s Gemfire are good examples.
- Analytics Developer frameworks: Programming frameworks for data intensive and data-parallel applications. In the analytics community, the popular frameworks are Hadoop, MPI, SQL, R and Python.
There is a significant opportunity to reduce complexity by making the levels as generic as possible, which is often most practically achieved starting from the bottom up. For example, building a common shared infrastructure for Hadoop, hBase, MPI and your favorite big-SQL database will allow much greater use of the cluster, and make it possible to deploy analytics sandboxes for users/data scientists much quicker.
Stay tuned for more on this blog… I’ll expand on what’s possible at these layers, with more detail about which technologies help at each level.