Saturday, May 31, 2014

Memories of the way we were...

The fascinating thing about Hadoop is the obviousness of its evolutionary needs. For example, MapReduce coupled with reliable scale out storage was a powerful - even revolutionary - effect for organizations with both lots of and multi-structured data. Out of the gate, Hadoop unlocked data "applications" that were for all intents and purposes unimplementable. At the same time, it didn't take much imagination to see that separating the compute model from resource management would be essential for future applications that did not fit well with MapReduce itself. It took a lot of work and care to get YARN defined, implemented and hardened, but the need for YARN itself was fairly obvious. Now it is here and Hadoop is no longer about "batch" data processing.

Note, however, it takes a lot of work to make the evolutionary changes available. In some cases, bolt on solutions have emerged to fill the gap. For key value data management, HBase is a perfect example. Several years ago, Eric Baldeschwieler was pointing out that HDFS could have filled that role. I think he was right, but the time it would take to get "HBase-type" functionality implemented via HDFS would have been a very long path indeed. In that case, the community filled the gap with HBase and it is being "back integrated" into Hadoop via YARN in a way that will make for a happier co-existence.

Right now we are seeing multiple new bolt on attempts to add functionality to Hadoop. For example, there are projects to add MPP databases on top of Hadoop itself. It's pretty obvious that this is at best a stop gap again - and one that comes at a pretty high price - I don't know of anyone that seriously thinks that a bolt on MPP is ultimately the right model for the Hadoop ecosystem. Since the open source alternatives look to be several years away from being "production ready", that raises an interesting question: is Hadoop evolution moving ahead at a similar or even more rapid rate to provide a native solution - a solution that will be more scalable, more adaptive and more open to a wider range of use cases and applications - including alternative declarative languages and compute models?

I think the answer is yes: while SQL on Hadoop via Hive is really the only open source game in town for production use cases - and its gotten some amazing performance gains in the first major iteration on Tez that we'll talk more about in the coming days - its clear that the Apache communities are beginning to deliver a new series of building blocks for data management at scale and speed: Optiq's Cost Based Optimizer; Tez for structuring multi-node operator execution; ORC and vectorization for optimal storage and compute; HCat for DDL. But what's missing? Memory management. And man has it ever been missing - that should have been obvious as well (and it was - one reason that so many people are interested in Spark for efficient algorithm development).

What we've seen so far has been two extremes available when it comes to supporting memory management (especially for SQL) - all disk and all memory. An obvious point here is that neither is ultimately right for Hadoop. This is a long winded intro to point to two, interrelated pieces by Julian Hyde and Sanjay Radia unveiling a model that is being introduced across multiple components called Discardable In-memory Materialized Query (DIMMQ). Once you see this model, it becomes obvious that the future of Hadoop for SQL - and not just SQL - is being implemented in real time. Check out both blog posts:

http://hortonworks.com/blog/dmmq/

http://hortonworks.com/blog/ddm/


No comments: