Category Archives: Hadoop

What is the promise of big data? Computers will be better than humans

AI-Artificial-Intelligence-Machine-Learning-Cognitive-ComputingBig data as a concept has in fact been around longer than computer technology, which would surprise a number of people.

Back in 1944 Wesleyan University Librarian Fremont Rider wrote a paper which estimated American university libraries were doubling in size every sixteen years meaning the Yale Library in 2040 would occupy over 6,000 miles of shelves. This is not big data as most people would know it, but the vast and violent increase in the quantity and variety of information in the Yale library is the same principle.

The concept was not known as big data back then, but technologists today are also facing a challenge on how to handle such a vast amount of information. Not necessarily on how to store it, but how to make use of it. The promise of big data, and data analytics more generically, is to provide intelligence, insight and predictability but only now are we getting to a stage where technology is advanced enough to capitalise on the vast amount of information which we have available to us.

Back in 2003 Google wrote a paper on its MapReduce and Google File System which has generally been attributed to the beginning of the Apache Hadoop platform. At this point, few people could anticipate the explosion of technology which we’ve witnessed, Cloudera Chairman and CSO Mike Olson is one of these people, but he is also leading a company which has been regularly attributed as one of the go-to organizations for the Apache Hadoop platform.

“We’re seeing innovation in CPUs, in optical networking all the way to the chip, in solid state, highly affordable, high performance memory systems, we’re seeing dramatic changes in storage capabilities generally. Those changes are going to force us to adapt the software and change the way it operates,” said Olson, speaking at the Strata + Hadoop event in London. “Apache Hadoop has come a long way in 10 years; the road in front of it is exciting but is going to require an awful lot of work.”

Analytics was previously seen as an opportunity for companies to look back at its performance over a defined period, and develop lessons for employees on how future performance can be improved. Today the application of advanced analytics is improvements in real-time performance. A company can react in real-time to shift the focus of a marketing campaign, or alter a production line to improve the outcome. The promise of big data and IoT is predictability and data defined decision making, which can shift a business from a reactionary position through to a predictive. Understanding trends can create proactive business models which advice decision makers on how to steer a company. But what comes next?

Mike Olsen

Cloudera Chairman and CSO Mike Olsen

For Olsen, machine learning and artificial intelligence is where the industry is heading. We’re at a stage where big data and analytics can be used to automate processes and replace humans for simple tasks. In a short period of time, we’ve seen some significant advances in the applications of the technology, most notably Google’s AlphaGo beating World Go champion Lee Se-dol and Facebook’s use of AI in picture recognition.

Although computers taking on humans in games of strategy would not be considered a new PR stunt, IBM’s Deep Blue defeated chess world champion Garry Kasparov in 1997, this is a very different proposition. While chess is a game which relies on strategy, go is another beast. Due to the vast number of permutations available, strategies within the game rely on intuition and feel, a complex task for the Google team. The fact AlphaGo won the match demonstrates how far researchers have progressed in making machine-learning and artificial intelligence a reality.

“In narrow but very interesting domains, computers have become better than humans at vision and we’re going to see that piece of innovation absolutely continue,” said Olsen. “Big Data is going to drive innovation here.”

This may be difficult for a number of people to comprehend, but big data has entered the business world; true AI and automated, data-driven decision may not be too far behind. Data is driving the direction of businesses through a better understanding of the customer, increase the security of an organization or gaining a better understanding of the risk associated with any business decision. Big data is no longer a theory, but an accomplished business strategy.

Olsen is not saying computers will replace humans, but the number of and variety of processes which can be replaced by machines is certainly growing, and growing faster every day.

Wipro open sources big data offering

Laptop Screen with Big Data Concept.Wipro has announced it has open sourced its big data solution Big Data Ready Enterprise (BDRE), partnering with California based Hortonworks to push the initiative forward.

The company claims the BDRE offering addresses the complete lifecycle of managing data across enterprise data lakes, allowing customers to ingest, organize, enrich, process, analyse, govern and extract data at a faster pace. BDRE is released under the Apache Public License v2.0 and hosted on GitHub. Teaming up with Hortonworks will also give the company additional clout in the market, at Hortonworks is generally considered one of the top three Hadoop distribution vendors in the market.

“Wipro takes pride in being a significant contributor to the open source community, and the release of BDRE reinforces our commitment towards this ecosystem,” said Bhanumurthy BM, COO at Wipro. “BDRE will not only make big data technology adoption simpler and effective, it will also open opportunities across industry verticals that organizations can successfully leverage. Being at the forefront of innovation in big data, we are able to guide organizations that seek to benefit from the strategic, financial, organizational and technological benefits of adopting open source technologies.”

Companies open sourcing their own technologies has become somewhat of a trend in recent months, as the product owners themselves would appear to be moving towards a service model as opposed to traditional vendor. According to ‘The Open Source Era’, an Oxford Economics Study which was commissioned by Wipro, 64% of respondents believe that open source will drive Big Data efforts in the next three years.

The report also claims open source has become a foundation stone of the technology roadmap of a number of businesses, 75% of respondent believe integration between legacy and open source is one of the main challenges and 52% said open source is already supporting development of new products and services.

MapR gets converged data platform patented

dataCalifornia-based open source big data specialist MapR Technologies has been granted patent protection for its technique for converging open source, enterprise storage, NoSQL and other event streams.

The United States Patent and Trademark Office recognised the detail differentiation of the Hadoop specialist’s work within the free, Java-based programming framework of Hadoop. Though the technology is derived from technology created by the open source oriented Apache Software Foundation, the patent office has judged that MapR’s performance, data protection, disaster recovery and multi-tenancy features merit a recognisable level of differentiation.

The key components of the patent claims include a design based on containers, self-contained autonomous units with their own operating system and app software. Containers can ring fence data against loss, optimise replication techniques and create a system that can cater for multiple node failures in a cluster.

Other vital components of the system are transactional read-write-update semantics with cluster-wide consistency, recovery techniques and update techniques. The recovery features can reconcile the divergence of replicated data after node failure, even while transactional updates are continuously being added. The update techniques allow for extreme variations of performance and scale while supporting familiar application programming interfaces (APIs).

MapR claims its Converged Data Platform allows clients to innovate with open source, provides a foundation for analytics (by converging all the data), creates enterprise grade reliability in one open source platform and makes instant, continuous data processing possible.

It’s the differentiation of the core with standard APIs that makes it stand out from other Apache projects, MapR claims. Meanwhile the system’s ability to use a single cluster, that can handle converged workloads, makes it easier to manage and secure, it claims.

“The patent details how our platform gives us an advantage in the big data market. Some of the most demanding enterprises in the world are solving their business challenges using MapR,” said Anil Gadre, MapR Technologies’ senior VP of product management.

WANdisco’s new Fusion system aims to take the fear out of cloud migration

CloudSoftware vendor WANdisco has announced six new products to make cloud migration easier and less dangerous as companies plan to move away from DIY computing.

The vendor claims its latest Fusion system aims to create a safety net of continuous availability and streaming back-up. Building on that, the platform offers uninterrupted migration and gives hybrid cloud systems the capacity to expand across both private public clouds if necessary. These four fundamental conditions are built on seven new software plug-ins designed to make the transition from production systems into live cloud systems smoother, says DevOps specialist WANdisco.

The backbone of Fusion is WANdisco’s replication technology, which ensures that all servers and clusters are fully readable and writeable, always in sync and can recover automatically from each other after planned or unplanned downtime.

The plug-ins that address continuous availability, data consistency and disaster recovery are named as Active-Active Disaster Recovery, Active-Active Hive and Active-Active Hbase. The first guarantees data consistency with failover and automated recovery over any network. It also prevents Hadoop cluster downtime and data loss. The second regulates consistent query results across all clusters and locations. The third, Hbase, aims to create continuously availability and consistency across all locations.

Three further plug ins address the threat of heightened exposure that is created when companies move their system from behind a company firewall and onto a public cloud. These plug-ins are named as Active Back-up, Active Migration and Hybrid Cloud. To supplement these offerings WANdisco has also introduced the Fusion Software Development Kit (SDK) so that enterprise IT departments can programme their own modifications.

“Ease of use isn’t the first thing that comes to mind when one thinks about Big Data, so WANdisco Fusion sets out to simplify the Hadoop crossing,” said WANdisco CEO David Richards.

Cloudera announces tighter security measures for Hadoop

Cloud securityCloudera has announced a new open source project that aims to use real-time analytical applications in Hadoop and an open source security layer for unified access control enforcement.

Kudu, an in-memory store for Hadoop, aims to give developers more choice and stop them from having their options limited. Currently developers must choose between fast analytics with HDFS or updating data with HBase. Combining the two, according to Cloudera, can be potentially fatal for any developers that try, since the systems are both highly complex.

Cloudera says Kudu eliminates the complexities involved in processes like time series analysis, machine data analytics and online reporting. It does this by supporting high-performance sequential and random reads and writes, enabling fast analytics on changing data.

Cloudera co-authored Kudu with Intel, which helped it make better use of in-memory hardware and Intel’s 3D XPoint technology. Other contributors included Xiaomi, AtScale, Splice Machine and Zoomdata.

“Our infrastructure team has been working with Cloudera to develop Kudu, taking advantage of its unique ability to support columnar scans and fast inserts and updates to continue to expand our Hadoop ecosystem footprint,” Baoqiu Cui, chief architect at smartphone developer Xiaomi, told CIO magazine. “Using Kudu, alongside interactive SQL tools like Impala, has allowed us to build a next-generation data analytics platform for real-time analytics and online reporting.”

Meanwhile a new core security for Hadoop has been launched. RecordService aims to provide unified access control enforcement for Hadoop by enforcing role based access controls. It acts as a new layer that sits between Hadoop’s storage and computing engines and aims to consistently enforce the role-based access controls defined by Sentry. RecordService also provides dynamic data masking across Hadoop, protecting sensitive data as it is accessed.

“Security is a critical part of Hadoop, but for it to evolve the security needs to become universal across the platform. With RecordService, the Hadoop community fulfils the vision of unified fine-grained access controls for every Hadoop access path,” said Mike Olson, co-founder and chief strategy officer at Cloudera.

Criteo to build giant private big data platform on Huawei servers

datacentre cloudPerformance marketing specialist Criteo has chosen Huawei to supply 700 servers for its new Hadoop Cluster data centre in Pantin, Seine St Denis, near Paris.

Huawei won the tender after its FusionServer RH2288H V3 impressed in a strict comparative study, it says. The servers were chosen for their abundance of high-capacity disks, which give the Criteo data centre a better storage density and cut energy consumption by 10 per cent, it claims.

The new Hadoop platform of Huawei servers will boost Criteo’s processing performance by 30 per cent, it’s claimed. In the first stage of the project, the 700 machines in the Paris data centre outperformed Criteo’s Amsterdam data centre, in terms of computing power and storage, even though the Dutch site has 1,200 servers at its disposal, according to Criteo’s Senior Engineering Manager for Infrastructure Operations, Matthieu Blumberg.

“This is the biggest private Hadoop platform in Europe as of today,” said Blumberg, “Huawei has undeniably good ICT solutions and extensive knowledge of Big Data. We were really impressed.”

As a result, Criteo now plans to install up to 5,000 servers, taking up 350 square meters of rack space, at its Pantin data centre. The total power consumption will rise to 2 MW as the power of the Huawei server estate grows, according to Blumberg.

“We are proud to have built this partnership with Criteo: this is the kind of project we love to develop,” said Robert Yang, Head of the Huawei France Enterprise Business Group.

Apache Spark reportedly outgrowing Hadoop as users move to cloud

cloud competition trophyApache Spark is breaking down the barriers between data scientists and engineers, making machine learning easier and is out growing Hadoop as an open source framework for cloud computing developments, a new report claims.

The 2015 Spark User Survey was conducted by Databricks, the company founded by the creators of Apache Spark.

Spark adoption is growing quickly because users are finding it easy to use, reliably fast, and aligned for future growth in analytics, the report claims, with 91 per cent of the survey citing performance as their reason for adoption. Other reasons given were ease of programming (77 per cent), easy deployment (71 per cent) advanced analytics (64 per cent) and the capacity for real time streaming (52 per cent).

The report, based on the findings of a survey of 1,400 respondents Spark stakeholders, claims that the number of Spark users with no Hadoop components doubled between 2014 and 2015. The study set out to identify how the data analytics and processing engine is being used by developers and organisations.

The Spark growth claim is based on the finding that 48 per cent of users are running Spark in standalone mode while 40 per cent run it on Hadoop’s YARN operating system. At present 11 per cent of users are running Spark on Apache Mesos. The survey also found that 51 per cent of respondents run Spark on a public cloud.

The number of contributors to Spark rose from 315 to 600 contributors in the last 12 months, which the report authors claim makes this the most active open source project in Big Data. Additionally, more than 200 organisations contribute code to Spark, which they claims makes it ‘one of’ the largest communities of engaged developers to date.

According to the report, Spark is being used for increasingly diverse applications, with data scientists particularly focused on machine learning, streaming and graph analysis projects. Spark was used to create streaming applications 56 per cent more frequently in 2015 than 2014. The use of advanced analytics, like MLib for machine learning and GraphX for graph processing, is becoming increasingly common, the report says.

According to the study, 41 per cent of those surveyed identified themselves as data engineers, while 22 per cent of respondents say they are data scientists. The most common languages used for open sourced based big data projects in cloud computing are Scala (used by 71 per cent of the survey), Python (58 per cent), SQL (36 per cent), Java (31 per cent) and R (18 per cent).

SAP announces improvements to cloud platform and Vora analytics software

SAP HANA VoraSAP has released new software that it claims will make analytics easier for users of open source Hadoop software.

The SAP HANA Vora is a new in-memory query engine that improves the performance of the Apache Spark execution framework. As a result, anyone running data analysis should be able to get better interactions with their data if it’s held on Hadoop and companies will benefit from more useful intelligence.

SAP claims this new software will overcome the general ‘lack of business process awareness’ that exists in companies across enterprise apps, analytics, big data and Internet of Things (IoT) sources. The software will make it easier for data scientists and developers to get access to the right information by simplifying the access to corporate and Hadoop data.

SAP HANA Vora will bring most benefit in industries where Big Data analytics in business process context is paramount. SAP identified financial services, telecommunications, healthcare and manufacturing as target markets. The savings created by the new software will come from a number of areas, it said. In the financial sector, the return on investment in the systems will come from mitigating risk and fraud by detecting new anomalies in financial transactions and customer history data.

Telecoms companies will benefit from optimising their bandwidth, SAP claims, as telcos use the software to analyse traffic patterns to avoid network bottlenecks and improve the quality of service. Manufacturers will benefit from preventive maintenance and improved product re-call processes as a result of SAL HANA Vora’s newly delivered powers of analysis of bills-of-material, services records and sensor data.

The use of Hadoop and SAP HANA to manage large unstructured data sets left room for improvement, according to user Aziz Safa, Intel IT Enterprise Applications and Application Strategy VP. “One of the key requirements is better analyses of big data,” said Safa, “but mining these large data sets for contextual information in Hadoop is a challenge.”

SAP HANA Vora will be released by the end of September, when a cloud-based developer edition will also be available. Here’s a SAP vid on the matter.

 

Intel, BlueData partner on big data following $20m funding round

Intel and BlueData are collaborating on big data

Intel and BlueData are collaborating on big data

Hadoop specialist BlueData announced a strategic collaboration with Intel this week after the chip company’s venture capital arm helped lead a $20m funding round for the startup.

BlueData offers a virtualised Hadoop-as-a-Service  software for on-premise infrastructure that speeds up Hadoop cluster deployment and model prototyping. The company also has some IP that The partnership will see the two companies integrate BlueData’s big data software with Intel’s Xeon processor technology, which Intel said builds on its existing big data integration initiatives with Cloudera and Apache Hadoop.

“Intel architecture provides a high-performance, secure, robust foundation for big data analytics,” said Brian Krzanich, Intel chief executive. “BlueData’s innovative software delivers the simplicity, agility and efficiency of big data-as-a-service in an on-premises model. Together, we are focused on bringing big data into the mainstream and unlocking the value for our enterprise customers.”

Kumar Sreekanti, co-founder and chief executive of BlueData  said: “This strategic collaboration with Intel will help advance BlueData’s mission of making it easy to deploy big data infrastructure. Our software platform simplifies the complexity, reduces the cost and delivers faster time to value for big data initiatives.”

“Our go-to-market relationship and joint product development with Intel will allow enterprises to accelerate their deployment of Hadoop and Spark, and deliver on the promise of big data analytics,” he added.

The move comes as Intel Captial, the chip giant’s venture capital arm, led a $20m series C funding round for BlueData along with participation from existing investors Amplify Partners, Atlantic Bridge, and Ignition Partners.

As part of the funding round Doug Fisher, senior vice president of Intel and general manager of its Software and Services Group, will join BlueData’s board of directors.

The BlueData partnership is one of a number of high-profile big data deals Intel has inked as of late. Less than a week ago the firm partnered with Oregon Health & Science University (OHSU) to develop a big data platform that can help diagnose and treat individuals for cancer based on their genetic pre-dispositions.

IBM calls Apache Spark “most important new open source project in a decade”

IBM is throwing its weight behind Apache Spark in a bid to bolster its IoT strategy

IBM is throwing its weight behind Apache Spark in a bid to bolster its IoT strategy

IBM said it will throw its weight behind Apache Spark, an open source community developing a processing engine for large-scale datasets, putting thousands of internal developers to work on Spark-related projects and contributing its machine learning technology to the code ecosystem.

Spark, an Apache open source project born in 2009, is essentially an engine that can process vast amounts of data very quickly. It runs in Hadoop clusters through YARN or as a standalone deployment and can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat; it currently supports Scala, Java and Python.

It is designed to perform general data processing (like MapReduce) but one of the exciting things about Spark is it can also process new workloads like streaming data, interactive queries, and machine learning – making it a good match for Internet of Things applications, which is why IBM is so keen to go big on supporting the project.

The company said the technology brings huge advances when processing massive datasets generated by Internet of Things devices, improving the performance of data-dependent apps.

“IBM has been a decades long leader in open source innovation. We believe strongly in the power of open source as the basis to build value for clients, and are fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way,” said Beth Smith, general manager, analytics platform, IBM Analytics.

“Our clients will benefit as we help them embrace Spark to advance their own data strategies to drive business transformation and competitive differentiation,” Smith said.

In addition to joining Spark IBM said it would build the technology into the majority of its big data offerings, and offer Spark-as-a-Service on Bluemix. It also said it will open source its IBM SystemML machine learning technology, and collaborate with Databricks, a Spark-as-a-Service provider, to advance Spark’s machine learning capabilities.