Category Archives: Hadoop

Google reveals Bigtable, a NoSQL service based on what it uses internally

Google has punted another big data service, a variant of what it uses internally, into the wild

Google has punted another big data service, a variant of what it uses internally, into the wild

Search giant Google announced Bigtable, a fully managed NoSQL database service the company said combines its own internal database technology with open source Apache HBase APIs.

The company that helped give birth to MapReduce and its sister Hadoop is now making available the same non-relational database tech driving a number of its services including Google Search, Gmail, and Google Analytics.

Google said Bigtable is powered by BigQuery underneath, and is extensible through the HBase API (which provides real-time read / write access capabilities).

“Google Cloud Bigtable excels at large ingestion, analytics, and data-heavy serving workloads. It’s ideal for enterprises and data-driven organizations that need to handle huge volumes of data, including businesses in the financial services, AdTech, energy, biomedical, and telecommunications industries,” explained Cory O’Connor, product manager at Google.

O’Connor said the service, which is now in beta, can deliver over two times the performance of its direct competition (which will likely depend on the use case), and has a TCO of less than half that of its direct competitors.

“As businesses become increasingly data-centric, and with the coming age of the Internet of Things, enterprises and data-driven organizations must become adept at efficiently deriving insights from their data. In this environment, any time spent building and managing infrastructure rather than working on applications is a lost opportunity.”

Bigtable is Google’s latest move to bolster its data services, a central pillar of its strategy to attract new customers to its growing platform. Last month the company announced the beta launch of Google Cloud Dataflow, a Java-based service that lets users build, deploy and run data processing pipelines for other applications like ETL, analytics, real-time computation, and process orchestration, while abstracting away all the other infrastructure bits like cluster management.

Hortonworks buys SequenceIQ to speed up cloud deployment of Hadoop

CloudBreak

SequenceIQ will help boost Hortonworks’ position in the Hadoop ecosystem

Hortonworks has acquired SequenceIQ, a Hungary-based startup delivering infrastructure agnostic tools to improve Hadoop deployments. The company said the move will bolster its ability to offer speedy cloud deployments of Hadoop.

SequenceIQ’s flagship offering, Cloudbreak, is a Hadoop as a Service API for multi-tenant clusters that applies some of the capabilities of Blueprint (which lets you create a Hadoop cluster without having to use the Ambari Cluster Install Wizard) and Periscope (autoscaling for Hadoop YARN) to help speed up deployment of Hadoop on different cloud infrastructures.

The two companies have partnered extensively in the Hadoop community, and Hortonworks said the move will enhance its position among a growing number of Hadoop incumbents.

“This acquisition enriches our leadership position by providing technology that automates the launching of elastic Hadoop clusters with policy-based auto-scaling on the major cloud infrastructure platforms including Microsoft Azure, Amazon Web Services, Google Cloud Platform, and OpenStack, as well as platforms that support Docker containers. Put simply, we now provide our customers and partners with both the broadest set of deployment choices for Hadoop and quickest and easiest automation steps,” Tim Hall, vice president of product management at Hortonworks, explained.

“As Hortonworks continues to expand globally, the SequenceIQ team further expands our European presence and firmly establishes an engineering beachhead in Budapest. We are thrilled to have them join the Hortonworks team.”

Hall said the company also plans to contribute the Cloudbreak code back into the Apache Foundation sometime this year, though whether it will do so as part of an existing project or standalone one seems yet to be decided.

Hortonworks’ bread and butter is in supporting enterprise adoption of Hadoop and bringing the services component to the table, but it’s interesting to see the company commit to feeding the Cloudbreak code – which could, at least temporarily, give it a competitive edge – back into the ecosystem.

“This move is in line with our belief that the fastest path to innovation is through open source developed within an open community,” Hall explained.

The big data M&A space has seen more consolidation over the past few months, with Hitachi Data Systems acquiring big data and analytics specialist Pentaho and Infosys’ $200m acquisition of Panaya.

Quest Software Announces Hadoop-Centric Software Analytics

 

Image representing Hadoop as depicted in Crunc...Quest Software, Inc. (now part of Dell) announced three significant product releases today aimed at helping customers more quickly adopt Hadoop and exploit their Big Data:

  • Kitenga Analytics ? Based on the recent acquisition of Kitenga,
    Quest Software now enables customers to analyze structured,
    semi-structured and unstructured data stored in Hadoop. Available
    immediately, Kitenga Analytics delivers sophisticated capabilities,
    including text search, machine learning, and advanced visualizations,
    all from an easy-to-use interface that does not require understanding
    of complex programming or the Hadoop stack itself. With Kitenga
    Analytics and the Quest Toad®
    Business Intelligence Suite, an organization has a complete
    self-service analysis environment that empowers business and systems
    analysts across a variety of backgrounds and job roles.
  • Toad for Hadoop ? Quest Software expands support for Hadoop in
    the upcoming release of Toad® for Hadoop. With more than two million
    users, and ranked No. 1 in Database Development and Optimization for
    three consecutive years by IDC [1], Toad has been enhanced to help
    database developers and DBAs bridge the gap between what they already
    know about relational database management systems and the new world of
    Hadoop. Toad will provide query and data management functionality for
    Hadoop, as well as an interface to perform data transfers using the
    Quest Hadoop Connector. Like Toad for any other platform, Toad for
    Hadoop makes the lives of developers, DBAs, and analysts easier and
    more productive.
  • SharePlex with Hadoop Capabilities ? Quest Software adds Hadoop
    capabilities to the next release of SharePlex® for Oracle,
    its robust, high-performance Oracle-to-Oracle database replication
    technology. For enterprise mission-critical systems that must always
    be available, the new release will seamlessly create multiple copies
    of Oracle data for movement simultaneously to both another Oracle
    environment and Hadoop, with no downtime. Customers can choose how
    they optimize Oracle and Hadoop environments based on data
    requirements, such as high availability; analytics and reporting;
    image and text processing; and general archiving. The architecture
    allows for scalable data distribution on-premise, in the cloud, and
    across multiple data centers without a single point of failure.


Google’s Dremel is the Holy Grail of Big Data: Really Big, Really Fast, Really Simple

First Google created, and wrote papers on, Hadoop and MapReduce, which got reverse-engineered into the current state of the art for Big Data.

But Google has moved on to Dremel, and the rest of the world is slow in catching up.

With BigQuery Google offers a simple-to-user service that doesn’t sacrifice Big Data scale OR speed.

As  Armando Fox, a professor of computer science at the University of California, Berkeley who specializes in these sorts of data-center-sized software platforms. put it in a Wired article:

“This is unprecedented. Hadoop is the centerpiece of the “Big Data” movement, a widespread effort to build tools that can analyze extremely large amounts of information. But with today’s Big Data tools, there’s often a drawback. You can’t quite analyze the data with the speed and precision you expect from traditional data analysis or “business intelligence” tools. But with Dremel, Fox says, you can.

“They managed to combine large-scale analytics with the ability to really drill down into the data, and they’ve done it in a way that I wouldn’t have thought was possible,” he says. “The size of the data and the speed with which you can comfortably explore the data is really impressive. People have done Big Data systems before, but before Dremel, no one had really done a system that was that big and that fast.

“Usually, you have to do one or the other. The more you do one, the more you have to give up on the other. But with Dremel, they did both.”


NextBio, Intel Collaborate to Optimize Hadoop for Genomics Big Data

Image representing nextbio as depicted in Crun...

NextBio and Intel announced today a collaboration aimed at optimizing and stabilizing the Hadoop stack and advancing the use of Big Data technologies in genomics. As a part of this collaboration, the NextBio and Intel engineering teams will apply experience they have gained from NextBio’s use of Big Data technologies to the improvement of HDFS, Hadoop, and HBase. Any enhancements that NextBio engineers make to the Hadoop stack will be contributed to the open-source community. Intel will also showcase NextBio’s use of Big Data.

“NextBio is positioned at the intersection of Genomics and Big Data. Every day we deal with the three V’s (volume, variety, and velocity) associated with Big Data – We, our collaborators, and our users are adding large volumes of a variety of molecular data to NextBio at an increasing velocity,” said Dr. Satnam Alag, chief technology officer and vice president of engineering at NextBio. “Without the implementation of our algorithms in the MapReduce framework, operational expertise in HDFS, Hadoop, and HBase, and investments in building our secure cloud-based infrastructure, it would have been impossible for us to scale cost-effectively to handle this large-scale data.”

“Intel is firmly committed to the wide adoption and use of Big Data technologies such as HDFS, Hadoop, and HBase across all industries that need to analyze large amounts of data,” said Girish Juneja, CTO and General Manager, Big Data Software and Services, Intel. “Complex data requiring compute-intensive analysis needs not only Big Data open source, but a combination of hardware and software management optimizations to help deliver needed scale with a high return on investment. Intel is working closely with NextBio to deliver this showcase reference to the Big Data community and life science industry.”

“The use of Big Data technologies at NextBio enables researchers and clinicians to mine billions of data points in real-time to discover new biomarkers, clinically assess targets and drug profiles, optimally design clinical trials, and interpret patient molecular data,” Dr. Alag continued. “NextBio has invested significantly in the use of Big Data technologies to handle the tsunami of genomic data being generated and its expected exponential growth. As we further scale our infrastructure to handle this growing data resource, we are excited to work with Intel to make the Hadoop stack better and give back to the open-source community.”


NextBio, Intel Collaborate to Optimize Hadoop for Genomics Big Data

Image representing nextbio as depicted in Crun...

NextBio and Intel announced today a collaboration aimed at optimizing and stabilizing the Hadoop stack and advancing the use of Big Data technologies in genomics. As a part of this collaboration, the NextBio and Intel engineering teams will apply experience they have gained from NextBio’s use of Big Data technologies to the improvement of HDFS, Hadoop, and HBase. Any enhancements that NextBio engineers make to the Hadoop stack will be contributed to the open-source community. Intel will also showcase NextBio’s use of Big Data.

“NextBio is positioned at the intersection of Genomics and Big Data. Every day we deal with the three V’s (volume, variety, and velocity) associated with Big Data – We, our collaborators, and our users are adding large volumes of a variety of molecular data to NextBio at an increasing velocity,” said Dr. Satnam Alag, chief technology officer and vice president of engineering at NextBio. “Without the implementation of our algorithms in the MapReduce framework, operational expertise in HDFS, Hadoop, and HBase, and investments in building our secure cloud-based infrastructure, it would have been impossible for us to scale cost-effectively to handle this large-scale data.”

“Intel is firmly committed to the wide adoption and use of Big Data technologies such as HDFS, Hadoop, and HBase across all industries that need to analyze large amounts of data,” said Girish Juneja, CTO and General Manager, Big Data Software and Services, Intel. “Complex data requiring compute-intensive analysis needs not only Big Data open source, but a combination of hardware and software management optimizations to help deliver needed scale with a high return on investment. Intel is working closely with NextBio to deliver this showcase reference to the Big Data community and life science industry.”

“The use of Big Data technologies at NextBio enables researchers and clinicians to mine billions of data points in real-time to discover new biomarkers, clinically assess targets and drug profiles, optimally design clinical trials, and interpret patient molecular data,” Dr. Alag continued. “NextBio has invested significantly in the use of Big Data technologies to handle the tsunami of genomic data being generated and its expected exponential growth. As we further scale our infrastructure to handle this growing data resource, we are excited to work with Intel to make the Hadoop stack better and give back to the open-source community.”


NextBio, Intel Collaborate to Optimize Hadoop for Genomics Big Data

Image representing nextbio as depicted in Crun...

NextBio and Intel announced today a collaboration aimed at optimizing and stabilizing the Hadoop stack and advancing the use of Big Data technologies in genomics. As a part of this collaboration, the NextBio and Intel engineering teams will apply experience they have gained from NextBio’s use of Big Data technologies to the improvement of HDFS, Hadoop, and HBase. Any enhancements that NextBio engineers make to the Hadoop stack will be contributed to the open-source community. Intel will also showcase NextBio’s use of Big Data.

“NextBio is positioned at the intersection of Genomics and Big Data. Every day we deal with the three V’s (volume, variety, and velocity) associated with Big Data – We, our collaborators, and our users are adding large volumes of a variety of molecular data to NextBio at an increasing velocity,” said Dr. Satnam Alag, chief technology officer and vice president of engineering at NextBio. “Without the implementation of our algorithms in the MapReduce framework, operational expertise in HDFS, Hadoop, and HBase, and investments in building our secure cloud-based infrastructure, it would have been impossible for us to scale cost-effectively to handle this large-scale data.”

“Intel is firmly committed to the wide adoption and use of Big Data technologies such as HDFS, Hadoop, and HBase across all industries that need to analyze large amounts of data,” said Girish Juneja, CTO and General Manager, Big Data Software and Services, Intel. “Complex data requiring compute-intensive analysis needs not only Big Data open source, but a combination of hardware and software management optimizations to help deliver needed scale with a high return on investment. Intel is working closely with NextBio to deliver this showcase reference to the Big Data community and life science industry.”

“The use of Big Data technologies at NextBio enables researchers and clinicians to mine billions of data points in real-time to discover new biomarkers, clinically assess targets and drug profiles, optimally design clinical trials, and interpret patient molecular data,” Dr. Alag continued. “NextBio has invested significantly in the use of Big Data technologies to handle the tsunami of genomic data being generated and its expected exponential growth. As we further scale our infrastructure to handle this growing data resource, we are excited to work with Intel to make the Hadoop stack better and give back to the open-source community.”


Qubole Exits Stealth Mode, Introduces Auto-Scaling Big Data Platform

Image representing Hadoop as depicted in Crunc...

Qubole exited stealth mode today to introduce its auto-scaling Big Data platform, “combining the power of Apache Hadoop and Hive with the simplicity of a Cloud platform in order to accelerate time-to-value from Big Data.” Qubole, a Silver Sponsor of next week’s Hadoop Summit 2012 conference, also invites business analysts, data scientists, and data engineers to participate in the Qubole early access program.

While most well known as creators of Apache Hive and long-time contributors to Apache Hadoop, Qubole’s founders Ashish Thusoo and Joydeep Sen Sarma also managed the Facebook data infrastructure team that was responsible for nearly 25PB of compressed data. The data services built by this team are used across business and engineering teams who submit tens of thousands of jobs, queries and ad hoc analysis requests every day. Thusoo and Sen Sarma applied their experiences and learnings to create the industry’s next generation big data platform for the cloud. With Qubole, organizations can literally begin uncovering new insights from their structured and unstructured data sources within minutes.

“We believe a new approach is needed – one that hides the complexity associated with storing and managing data and instead provides a fast, easy path to analysis and insights for business analysts, data scientists and data engineers,” said Joydeep Sen Sarma, Co-Founder of Qubole. “We gained significant experience helping a web-scale company build and manage a complex Big Data platform. We don’t want our customers to worry about choosing a flavor of Hadoop, or spinning up clusters, or trying to optimize performance. Qubole will manage all of that so that users can focus on their data and their algorithms.”

Qubole Auto-Scaling Big Data Platform for the Cloud Benefits Include:

  • Fastest Path to Big Data Analytics –
    Qubole handles all infrastructure complexities behind the scenes so
    users can begin doing ad hoc analysis and creating data pipelines
    using SQL and MapReduce within minutes.
  • Scalability “On the Fly” – Qubole
    features the industry’s first auto-scaling Hadoop clusters so users
    can get the right amount of computing power for each and every project.
  • Fast Query Authoring Tools – Qubole
    provides fast access to sample data so that queries can be authored
    and validated quickly.
  • Fastest Hadoop and Hive Service in the Cloud
    – Using advanced caching and query acceleration techniques, Qubole has
    demonstrated query speeds up to five times faster than other
    Cloud-based Hadoop solutions.
  • Quick Connection to Data – Qubole
    provides mechanisms to work with data sets stored in any format in
    Amazon S3. It also allows users to easily export data to S3 or to
    databases like MySQL.
  • Integrated Data Workflow Engine – Qubole
    provides mechanisms to easily create data pipelines so users can run
    their queries periodically with a high degree of reliability.
  • Enhanced Debugging Abilities – Qubole
    provides features that helps users get to errors in Hadoop/Hive jobs
    fast, thus saving time in debugging queries.
  • Easy Collaboration with Peers – Qubole’s
    Cloud-based architecture makes it ideal for analysts working in a
    geographically distributed environment to share information and
    analysis.

“Companies are increasingly moving to the Cloud and for good reason. Applications hosted in the Cloud are much easier to use and manage, especially for companies without very large IT organizations. While Software as a Service (SaaS) is now the standard for many different types of applications, it has not yet been made easy for companies to use the Cloud to convert their ever-increasing volume and variety of data into useful business and product insights. Qubole makes it much easier and faster for companies to analyze and process more of their Big Data, and they will benefit tremendously,” said Ashish Thusoo, Co-Founder of Qubole.

To join the early access program, please visit www.qubole.com. Qubole is looking to add a select number of companies for early access to its service, with the intention of making the service more generally available in Q4 2012. People interested in seeing a demo of the platform can visit Qubole at the Hadoop Summit June 13 – 14 at the San Jose Convention Center, kiosk #B11.


Lucid Imagination Combines Search, Analytics and Big Data to Tackle the Problem of Dark Data

Image representing Lucid Imagination as depict...

Organizations today have little to no idea how much lost opportunity is hidden in the vast amounts of data they’ve collected and stored.  They have entered the age of total data overload driven by the sheer amount of unstructured information, also called “dark” data, which is contained in their stored audio files, text messages, e-mail repositories, log files, transaction applications, and various other content stores.  And this dark data is continuing to grow, far outpacing the ability of the organization to track, manage and make sense of it.

Lucid Imagination, a developer of search, discovery and analytics software based on Apache Lucene and Apache Solr technology, today unveiled LucidWorks Big Data. LucidWorks Big Data is the industry’s first fully integrated development stack that combines the power of multiple open source projects including Hadoop, Mahout, R and Lucene/Solr to provide search, machine learning, recommendation engines and analytics for structured and unstructured content in one complete solution available in the cloud.

With LucidWorks Big Data, Lucid Imagination equips technologists and business users with the ability to initially pilot Big Data projects utilizing technologies such as Apache Lucene/Solr, Mahout and Hadoop, in a cloud sandbox. Once satisfied, the project can remain in the cloud, be moved on premise or executed within a hybrid configuration.  This means they can avoid the staggering overhead costs and long lead times associated with infrastructure and application development lifecycles prior to placing their Big Data solution into production.

The product is now available in beta. To sign up for inclusion in the beta program, visit http://www.lucidimagination.com/products/lucidworks-search-platform/lucidworks-big-data.

How big is the problem of dark data? The total amount of digital data in the world will reach 2.7 zettabytes in 2012, a 48 percent increase from 2011.* 90 percent of this data will be unstructured or “dark” data. Worldwide, 7.5 quintillion bytes of data, enough to fill over 100,000 Libraries of Congress get generated every day. Conversely, that deep volume of data can serve to help predict the weather, uncover consumer buying patterns or even ease traffic problems – if discovered and analyzed proactively.

“We see a strong opportunity for search to play a key role in the future of data management and analytics,” said Matthew Aslett, research manager, data management and analytics, 451 Research. “Lucid’s Big Data offering, and its combination of large-scale data storage in Hadoop with Lucene/Solr-based indexing and machine-learning capabilities, provides a platform for developing new applications to tackle emerging data management challenges.”

Data analytics has traditionally been the domain of business intelligence technologies. Most of these tools, however, have been designed to handle structured data such as SQL, and cannot easily tap into the broad range of data types that can be used in a Big Data application. With the announcement of LucidWorks Big Data, organizations will be able to utilize a single platform for their Big Data search, discovery and analytics needs. LucidWorks Big Data is the only complete platform that:

  • Combines the real time, ad hoc data accessibility of LucidWorks (Lucene/Solr) with compute and storage capabilities of Hadoop
  • Delivers commonly used analytic capabilities along with Mahout’s proven, scalable machine learning algorithms for deeper insight into both content and users
  • Tackles data, both big and small with ease, seamlessly scaling while minimizing the impact of provisioning Hadoop, LucidWorks and other components
  • Supplies a single, coherent, secure and well documented REST API for both application integration and administration
  • Offers fault tolerance with data safety baked in
  • Provides choice and flexibility, via on premise, cloud hosted or hybrid deployment solutions
  • Is tested, integrated and fully supported by the world’s leading experts in open source search
  • Includes powerful tools for configuration, deployment, content acquisition, security, and search experience that is packaged in a convenient, well-organized application

Lucid Imagination’s Open Search Platform uncovers real-time insights from any enterprise data, whether structured in databases, unstructured in formats such as emails or social channels, or semi-structured from sources such as websites.  The company’s rich portfolio of enterprise-grade solutions is based on the same proven open source Apache Lucene/Solr technology that powers many of the world’s largest e-commerce sites. Lucid Imagination’s on-premise and cloud platforms are quicker to deploy, cost less than competing products and are more easily tailored to specific needs than business intelligence solutions because they leverage innovation from the open source community.

“We’re allowing a broad set of enterprises to test and implement data discovery and analysis projects that have historically been the province of large multinationals with large data centers. Cloud computing and LucidWorks Big Data finally level the field,” said Paul Doscher, CEO of Lucid Imagination. “Large companies, meanwhile, can use our Big Data stack to reduce the time and cost associated with evaluating and ultimately implementing big data search, discovery and analysis. It’s their data – now they can actually benefit from it.”