Apache Spark is breaking down the barriers between data scientists and engineers, making machine learning easier and is out growing Hadoop as an open source framework for cloud computing developments, a new report claims.
The 2015 Spark User Survey was conducted by Databricks, the company founded by the creators of Apache Spark.
Spark adoption is growing quickly because users are finding it easy to use, reliably fast, and aligned for future growth in analytics, the report claims, with 91 per cent of the survey citing performance as their reason for adoption. Other reasons given were ease of programming (77 per cent), easy deployment (71 per cent) advanced analytics (64 per cent) and the capacity for real time streaming (52 per cent).
The report, based on the findings of a survey of 1,400 respondents Spark stakeholders, claims that the number of Spark users with no Hadoop components doubled between 2014 and 2015. The study set out to identify how the data analytics and processing engine is being used by developers and organisations.
The Spark growth claim is based on the finding that 48 per cent of users are running Spark in standalone mode while 40 per cent run it on Hadoop’s YARN operating system. At present 11 per cent of users are running Spark on Apache Mesos. The survey also found that 51 per cent of respondents run Spark on a public cloud.
The number of contributors to Spark rose from 315 to 600 contributors in the last 12 months, which the report authors claim makes this the most active open source project in Big Data. Additionally, more than 200 organisations contribute code to Spark, which they claims makes it ‘one of’ the largest communities of engaged developers to date.
According to the report, Spark is being used for increasingly diverse applications, with data scientists particularly focused on machine learning, streaming and graph analysis projects. Spark was used to create streaming applications 56 per cent more frequently in 2015 than 2014. The use of advanced analytics, like MLib for machine learning and GraphX for graph processing, is becoming increasingly common, the report says.
According to the study, 41 per cent of those surveyed identified themselves as data engineers, while 22 per cent of respondents say they are data scientists. The most common languages used for open sourced based big data projects in cloud computing are Scala (used by 71 per cent of the survey), Python (58 per cent), SQL (36 per cent), Java (31 per cent) and R (18 per cent).