Saturday, August 18, 2012
The last 16 years or so of my professional life have been dedicated to working on problems (and solutions!) in transactional middleware - by this, I mean systems that provide strong consistency guarantees: reliable queueing, distributed 2pc engines, higher level quality of service guarantees for lower level protocols (iiop, rmi, soap, etc), and workflow engines, in particular. Much of our thinking was influenced by the world of the relational database. And as a consequence the relational model with full consistency guarantees was the normal working model influenced much of our thinking. For a sense of how strong this tendency was, check out Jim Gray’s paper “Queues are Databases”.
In the early 2000s, we struggled with relaxing those guarantees in various ways to facilitate transaction-like behavior in internet contexts. And even in simple ways that superficially appeared to succeed, I think we failed: for example, BPEL has SAGA like behavior for compensations, the truth is very few people use it. Perhaps one of the most interesting exercises in that area was the Business Transaction Protocol: an unfortunate development in the standards world, but a revealing exercise in that it showed that many problems involving loosely coupled systems could not be handled with conventional solutions. But it was in group communication protocols that I really became convinced that the problem of scale required re-thinking many fundamental assumptions: there gossip protocols and other weakly consistent models began to find more applications. Clustering techniques that valued scale and speed over coordinated transactions were becoming important, even if their applications looked suspiciously like LINDA systems at times. And it was hard to ignore some of the more interesting work coming out of Ken Birman’s group (and subsequently introduced at Amazon: there is a reason AWS was an early success).
Meanwhile, we were seeing conventional middleware pushed to the limit on both volume and throughput. Large scale relational databases have continued to find ways to scale to a point, but with increasingly expensive hardware (and license) requirements. In working on very high throughput scenarios in workflow systems (hundreds of millions to billions of processes a day), I was becoming convinced that alternative infrastructure models would become not just more important, but critical. Those alternatives had to provide scale and at a reasonable expense. Before leaving Oracle I was looking at ways to layer BPEL over HBase and HDFS, as opposed to the conventional use of the relational database for : my working model was to eliminate most of the database requirements as an option and to fundamentally rethink process analytics by using Hadoop as a wholesale replacement for BAM style solutions. As a general pattern, some form of stream processing over Hadoop looks likely to emerge as a fundamental pattern.
At some point, I came to terms with the fact that parallel shared nothing architectures were the key to much of the future of data management. So today I am no longer spending much time on strictly transactional systems. Those systems are well understood and have earned their place as a “permanent” part of the IT landscape - they will continue to be important components of applications. But they will not provide the architecture or engine to deal with the vast amount of data that is being generated either within those applications or stored in other systems - from unstructured content to log files to data feeds. The requirements for scale, for capacity and - this is just as important - low cost cannot be met by the relational database model. The new generation of “cloud scale” technologies that are being built today are the future for most of that work. It is becoming increasingly obvious that Hadoop will provide the building blocks for the industry to store, manage and process these workloads.
It may seem obvious, but let me point out a few simple answer to "Why Hadoop in particular"? Companies are going to use what is available and has been shown to work, as long as it is an open platform. Hadoop can scale out. It can also scale down. It is free and open source. It can work with multi-structured data and loose schema constraints. It also runs on relatively inexpensive hardware. It has critical mass in uptake and adoption. No one company owns it - which is critical: as the industry rallies around it, the community itself will be strengthened along with the platform. My bet is that Hortonworks is right: 50% of the enterprise’s data will be processed in Hadoop in the next 5 years.