Category Archives: Spark

Apache Spark reportedly outgrowing Hadoop as users move to cloud

cloud competition trophyApache Spark is breaking down the barriers between data scientists and engineers, making machine learning easier and is out growing Hadoop as an open source framework for cloud computing developments, a new report claims.

The 2015 Spark User Survey was conducted by Databricks, the company founded by the creators of Apache Spark.

Spark adoption is growing quickly because users are finding it easy to use, reliably fast, and aligned for future growth in analytics, the report claims, with 91 per cent of the survey citing performance as their reason for adoption. Other reasons given were ease of programming (77 per cent), easy deployment (71 per cent) advanced analytics (64 per cent) and the capacity for real time streaming (52 per cent).

The report, based on the findings of a survey of 1,400 respondents Spark stakeholders, claims that the number of Spark users with no Hadoop components doubled between 2014 and 2015. The study set out to identify how the data analytics and processing engine is being used by developers and organisations.

The Spark growth claim is based on the finding that 48 per cent of users are running Spark in standalone mode while 40 per cent run it on Hadoop’s YARN operating system. At present 11 per cent of users are running Spark on Apache Mesos. The survey also found that 51 per cent of respondents run Spark on a public cloud.

The number of contributors to Spark rose from 315 to 600 contributors in the last 12 months, which the report authors claim makes this the most active open source project in Big Data. Additionally, more than 200 organisations contribute code to Spark, which they claims makes it ‘one of’ the largest communities of engaged developers to date.

According to the report, Spark is being used for increasingly diverse applications, with data scientists particularly focused on machine learning, streaming and graph analysis projects. Spark was used to create streaming applications 56 per cent more frequently in 2015 than 2014. The use of advanced analytics, like MLib for machine learning and GraphX for graph processing, is becoming increasingly common, the report says.

According to the study, 41 per cent of those surveyed identified themselves as data engineers, while 22 per cent of respondents say they are data scientists. The most common languages used for open sourced based big data projects in cloud computing are Scala (used by 71 per cent of the survey), Python (58 per cent), SQL (36 per cent), Java (31 per cent) and R (18 per cent).

Intel, BlueData partner on big data following $20m funding round

Intel and BlueData are collaborating on big data

Intel and BlueData are collaborating on big data

Hadoop specialist BlueData announced a strategic collaboration with Intel this week after the chip company’s venture capital arm helped lead a $20m funding round for the startup.

BlueData offers a virtualised Hadoop-as-a-Service  software for on-premise infrastructure that speeds up Hadoop cluster deployment and model prototyping. The company also has some IP that The partnership will see the two companies integrate BlueData’s big data software with Intel’s Xeon processor technology, which Intel said builds on its existing big data integration initiatives with Cloudera and Apache Hadoop.

“Intel architecture provides a high-performance, secure, robust foundation for big data analytics,” said Brian Krzanich, Intel chief executive. “BlueData’s innovative software delivers the simplicity, agility and efficiency of big data-as-a-service in an on-premises model. Together, we are focused on bringing big data into the mainstream and unlocking the value for our enterprise customers.”

Kumar Sreekanti, co-founder and chief executive of BlueData  said: “This strategic collaboration with Intel will help advance BlueData’s mission of making it easy to deploy big data infrastructure. Our software platform simplifies the complexity, reduces the cost and delivers faster time to value for big data initiatives.”

“Our go-to-market relationship and joint product development with Intel will allow enterprises to accelerate their deployment of Hadoop and Spark, and deliver on the promise of big data analytics,” he added.

The move comes as Intel Captial, the chip giant’s venture capital arm, led a $20m series C funding round for BlueData along with participation from existing investors Amplify Partners, Atlantic Bridge, and Ignition Partners.

As part of the funding round Doug Fisher, senior vice president of Intel and general manager of its Software and Services Group, will join BlueData’s board of directors.

The BlueData partnership is one of a number of high-profile big data deals Intel has inked as of late. Less than a week ago the firm partnered with Oregon Health & Science University (OHSU) to develop a big data platform that can help diagnose and treat individuals for cancer based on their genetic pre-dispositions.

IBM calls Apache Spark “most important new open source project in a decade”

IBM is throwing its weight behind Apache Spark in a bid to bolster its IoT strategy

IBM is throwing its weight behind Apache Spark in a bid to bolster its IoT strategy

IBM said it will throw its weight behind Apache Spark, an open source community developing a processing engine for large-scale datasets, putting thousands of internal developers to work on Spark-related projects and contributing its machine learning technology to the code ecosystem.

Spark, an Apache open source project born in 2009, is essentially an engine that can process vast amounts of data very quickly. It runs in Hadoop clusters through YARN or as a standalone deployment and can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat; it currently supports Scala, Java and Python.

It is designed to perform general data processing (like MapReduce) but one of the exciting things about Spark is it can also process new workloads like streaming data, interactive queries, and machine learning – making it a good match for Internet of Things applications, which is why IBM is so keen to go big on supporting the project.

The company said the technology brings huge advances when processing massive datasets generated by Internet of Things devices, improving the performance of data-dependent apps.

“IBM has been a decades long leader in open source innovation. We believe strongly in the power of open source as the basis to build value for clients, and are fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way,” said Beth Smith, general manager, analytics platform, IBM Analytics.

“Our clients will benefit as we help them embrace Spark to advance their own data strategies to drive business transformation and competitive differentiation,” Smith said.

In addition to joining Spark IBM said it would build the technology into the majority of its big data offerings, and offer Spark-as-a-Service on Bluemix. It also said it will open source its IBM SystemML machine learning technology, and collaborate with Databricks, a Spark-as-a-Service provider, to advance Spark’s machine learning capabilities.