I normally avoid anything that smacks of a competitive discussion on what I consider to be a space for personal reflection. So while I want to disclose the fact that I am not disinterested in the points I am making from a professional standpoint, my main interest is to frame some architecture points that I think are extremely important for the maturation and success of the Hadoop ecosystem.
A few weeks back, Mike Olson of Cloudera spoke at Spark Summit on how Spark relates to the future of Hadoop. The presentation can be found here:
In particular I want to draw attention to the statement made at 1:45 in the presentation that describes Spark as the "natural successor to MapReduce" - it becomes clear very quickly that what Olson is talking about is batch processing. This is fascinating as everyone I've talked to immediately points out one obvious thing: Spark isn't a general purpose batch processing framework - that is not its design center. The whole
point of Spark is to enable fast data access and interactivity.
The guys that clearly "get" Spark -
unsurprisingly - are DataBricks. In talking with Ion and company, it's
clear they understand the use cases where Spark shines - data scientist
driven data exploration and algorithmic development, machine learning, etc. - things that take advantage of the memory mapping capabilities and speed of the framework. And they have offered an online service that allows users to rapidly extract
value from cloud friendly datasets, which is smart.
Cloudera's idea of pushing SQL, Pig and other frameworks on to Spark is actually a step backwards - it is a proposal to recreate all the problems
of MapReduce 1: it fails to understand the power of refactoring resource
management away from the compute model. Spark would have to reinvent and mature models for
multi-tenancy, resource managemnet, scheduling, security, scaleout, etc that are
frankly already there today for Hadoop 2 with YARN.
The announcement of an intent to lead an implementation of Hive on Spark got some attention. This was something that I looked at carefully with my colleagues almost 2 years ago, so I'd like to make a few observations on why we didn't take this path then.
The first was maturity, in terms of the Spark implementation, of Hive itself, and Shark. Candidly, we knew Hive itself worked at scale but needed significant enhancement and refactoring for both new features on the SQL front and to work at interactive speeds. And we wanted to do all this in a way that did not compromise Hive's ability to work at scale - for real big data problems. So we focused on the mainstream of Hive and the development of a Dryad like runtime for optimal execution of operators in physical plans for SQL in a way that meshed deeply with YARN. That model took the learnings of the database community and scale out big data solutions and built on them "from the inside out", so to speak.
Anyone who has been tracking Hadoop for, oh, the last 2-3 years will understand intuitively the right architectural approach needs to be based on YARN. What I mean is that the query execution must - at the query task level - be composed of tasks that are administered directly by YARN. This is absolutely critical for multi-workload systems (this is one reason why a bolt on MPP solution is a mistake for Hadoop - it is at best a tactical model while the system evolves). This is why we are working with the community on Tez, a low level framework for enabling YARN native domain specific execution engines. For Hive-on-Tez, Hive is the engine and Tez provides the YARN level integration for resource negotiation and coorindation for DAG execution: a DAG of native operators analogous the the execution model found in the MPP world (when people compare Tez and Spark, they are fundamentally confused - Spark could be run on Tez for example for a much deeper integration with Hadoop 2 for example). This model allows the full range of use cases from interactive to massive batch to be administered in a deeply integrated, YARN native way.
Spark will undoubtedly mature into a great tool for what it is designed for: in memory, interactive scenarios - generally script driven - and likely grow to subsume new use cases we aren't anticipating today. It is, however, exactly the wrong choice for scale out big data batch processing in anything like the near term; worse still, returning to a monolithic general purpose compute framework for all Hadoop models would be a huge regression and is a disastrously bad idea.