Organizations are increasingly focusing on building enterprise data applications on top of their Hadoop and NoSQL infrastructure. But even as that's happening, Hadoop itself is becoming much more diverse and complex. That's a potential headache for developers seeking to build applications on top of that data infrastructure, but data application platform specialist Concurrent, primary sponsor of the open source Cascading application framework, sees it as an opportunity.
While Apache Hadoop began as a combination of Hadoop Distributed File System (HDFS) for file storage and MapReduce for compute, there are now a growing number of options for compute in Hadoop, including Apache Tez (a framework for near real-time big data processing), and the soon-to-be-released Apache Spark (a framework for in-memory cluster computing) and Apache Storm (a distributed computation framework for stream processing). Hadoop distribution vendor MapR even offers an alternative to HDFS in its distribution.
"Thinking in MapReduce is one thing, but then having to think in Tez is something else," says Chris Wensel, founder and CTO of Concurrent and original author of Cascading. "It's a huge challenge."
"Hadoop is balkanizing and fracturing," he adds. "There is no more Hadoop. There's HDFS and whatever runs on top of it."
Cascading Is a Software Abstraction Layer for Hadoop
Cascading is a software abstraction layer for Apache Hadoop that is intended to allow developers to write their data applications once and then deploy those applications on any big data infrastructure, regardless of the components in use. That's what has allowed Concurrent to win big Web 2.0 customers like eBay, LinkedIn, Twitter and Pinterest (as well as a slew of others) and what now contributes to more than 150,000 user downloads a month. Customers use it to make applications ranging from enterprise IT uses like ETL and operational analysis, to corporate apps like HR analytics, telecom apps like location-based services, marketing apps like funnel analysis and ad optimization, consumer/entertainment apps like music recommendations, finance apps like fraud and anomaly detection and health/biotech apps like veterinary diagnostics and next-generation genomics.
Wensel says he originally wrote Cascading in anger -- after using MapReduce once, he was determined that no one would have to use it directly again. Now, with Cascading 3.0, announced today, the framework will go even farther -- it's not just about MapReduce anymore.
Cascading 3.0 Will Support Emerging Big Data Fabrics
Cascading 3.0 will allow data apps to execute on existing and emerging fabrics through its new customizable query planner, says Wensel. When released it will support local in-memory, Apache MapReduce and Apache Tez out of the gate, with support for Apache Spark and Apache Storm soon to follow. The idea is to allow enterprises to standardize on one API that will allow them to build data applications to solve a variety of business problems ranging from simple to complex, regardless of latency or scale. In addition, Wensel says third-party products, data applications, frameworks and dynamic programming languages built on Cascading (like Scalding or Cascalog) will immediately benefit from the portability.
Concurrent has also forged close strategic partnerships with Hortonworks (one of the primary sponsors of Apache Hadoop) and Databricks (the primary sponsor of Apache Spark). Hortonworks will now integrate the Cascading SDK with its Hortonworks Data Platform (HDP) distribution of Hadoop, and will certify and support the SDK with HDP. Cascading will also support Apache Spark in a future release and notes that companies using Cascading will be able to seamlessly run their applications on Spark.
Concurrent says Cascading 3.0 will be available early this summer and freely licensable under the Apache 2.0 License Agreement.