IBM said it will throw its weight behind Apache Spark, an open source community developing a processing engine for large-scale datasets, putting thousands of internal developers to work on Spark-related projects and contributing its machine learning technology to the code ecosystem.
Spark, an Apache open source project born in 2009, is essentially an engine that can process vast amounts of data very quickly. It runs in Hadoop clusters through YARN or as a standalone deployment and can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat; it currently supports Scala, Java and Python.
It is designed to perform general data processing (like MapReduce) but one of the exciting things about Spark is it can also process new workloads like streaming data, interactive queries, and machine learning – making it a good match for Internet of Things applications, which is why IBM is so keen to go big on supporting the project.
The company said the technology brings huge advances when processing massive datasets generated by Internet of Things devices, improving the performance of data-dependent apps.
“IBM has been a decades long leader in open source innovation. We believe strongly in the power of open source as the basis to build value for clients, and are fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way,” said Beth Smith, general manager, analytics platform, IBM Analytics.
“Our clients will benefit as we help them embrace Spark to advance their own data strategies to drive business transformation and competitive differentiation,” Smith said.
In addition to joining Spark IBM said it would build the technology into the majority of its big data offerings, and offer Spark-as-a-Service on Bluemix. It also said it will open source its IBM SystemML machine learning technology, and collaborate with Databricks, a Spark-as-a-Service provider, to advance Spark’s machine learning capabilities.