Google launches Dataproc after successful beta trials

Google cloud platformGoogle has announced that its big data analysis tool Dataproc is now on general release. The utility, which was one of the factors that persuaded Spotify to choose Google’s Cloud Platform over Amazon Web Services is a managed tool based on the Hadoop and Spark open source big data software.

The service first became available in beta in September and was tested by global music streaming service Spotify, which was evaluating whether it should move its music files away from its own data centres and into the public cloud – and which cloud service could support it. Dataproc in its beta form supported the MapReduce engine, the Pig platform for writing programmes and the Hive data warehousing software. Google says it has added new features and sharpened the tool since then.

While in its beta testing phase, Cloud Dataproc added features such as property tuning, VM metadata and tagging and cluster versioning. “In general availability new versions of Cloud Dataproc will be frequently released with new features, functions and software components,” said Google product manager James Malone.

Cloud Dataproc aims to minimise cost and complexity, which are the two major distractions of data processing, according to Malone.

“Spark and Hadoop should not break the bank and you should pay for what you actually use,” he said. As a result, Cloud Dataproc is priced at 1 cent per virtual CPU per hour. Billing is by the minute with a 10-minute minimum.

Analysis should run faster, Malone said, because clusters in Cloud Dataproc can start and stop operations in less than 90 seconds, where they take minutes in other big data systems. This can make analyses run up to ten times faster. The new general release of Cloud Dataproc will have better management, since clusters don’t need specialist administration people or software.

Cloud Dataproc also tackles two other data processing bugbears, scale and productivity, promised Malone. This tool complements a separate service called Google Cloud Dataflow for batch and stream processing. The underlying technology for the service has been accepted as an Apache incubator project under the name Apache Beam.