Hadoop distribution specialist Hortonworks has joined forces with HP Labs, HP Enterprise's central research organization, in an effort to dramatically improve the performance of Spark workloads.
At a press conference in San Francisco Tuesday, the two companies announced a collaboration that has already born some fruit:
- Enhanced shuffle engine technologies. Faster sorting and in-memory computations, which has the potential to dramatically improve Spark performance.
- Better memory utilization. Improved performance and usage for broader scalability, which will help enable new large-scale use cases.
"We're hoping to enable the Spark community to derive insight more rapidly from much larger data sets without having to change a single line of code," says Martin Fink, executive vice president and CTO of HP Enterprise and a member of Hortonworks' board. "We're very pleased to be able to work with Hortonworks to broaden the range of challenges that Spark can address."
Fink explains that HP Labs had been conducting research on the efficiency and scale of memory for the enterprise, as well as ways to enhance memory utilization for the enterprise.
"Part of that research activity is we rewrote the shuffle engine from Java to C++," he says. "We saw that we had rewritten a bunch of algorithms to make much more efficient use of memory and enabled ways that you could scale memory even more."
In fact, certain customers that leveraged HP Labs' work found that it increased the performance of certain workloads from 5X to 15X.
"I've been around for a long time," Fink says. "It's not often that you come out with 15X performance increases on certain workloads. We knew we needed this to be part of a greater whole."
Fink notes that HP Enterprise chose to open source its research with the help of Hortonworks due to Hortonworks' dedication to openness and collaboration.
"This collaboration indicates our mutual support of and commitment to the growing Spark community and its solutions," adds Scott Gnau, CTO of Hortonworks. "We will continue to focus on the integration of Spark into broad data architectures supported by Apache YARN as well as enhancements for performance and functionality and better access points for applications like Apache Zeppelin."
Zeppelin is an incubating Apache project that provides a Web-based notebook that enables interactive data analytics.