Category Archives: Apache Hadoop

Real-Time Processing Solutions for Big Data Application Stacks – Integration of GigaSpaces XAP, Cassandra DB

Guest post by Yaron Parasol, Director of Product Management, GigaSpaces

GigaSpaces Technologies has developed infrastructure solutions for more than a decade and in recent years has been enabling Big Data solutions as well. The company’s latest platform release – XAP 9.5 – helps organizations that need to process Big Data fast. XAP harnesses the power of in-memory computing to enable enterprise applications to function better, whether in terms of speed, reliability, scalability or other business-critical requirements. With the new version of XAP, increased focus has been placed on real-time processing of big data streams, through improved data grid performance, better manageability and end-user visibility, and integration with other parts of your Big Data stack – in this version, integration with Cassandra.

XAP-Cassandra Integration

To build a real-time Big Data application, you need to consider several factors.

First– Can you process your Big Data in actual real-time, in order to get instant, relevant business insights? Batch processing can take too long for transactional data. This doesn’t mean that you don’t still rely on your batch processing in many ways…

Second – Can you preprocess and transform your data as it flows into the system, so that the relevant data is made digestible and routed to your batch processor, making batch more efficient as well. Finally, you also want to make sure the huge amounts of data you send to long-term storage are available for both batch processing and ad hoc querying, as needed.

XAP and Cassandra DB together can easily enable all the above to happen. With built-in event processing capabilities, full data consistency, and high-speed in-memory data access and local caching – XAP handles the real-time aspect with ease. Whereas, Cassandra is perfect for storing massive volumes of data, querying them ad hoc, and processing them offline.

Several hurdles had to be overcome to make the integration truly seamless and easy for end users – including XAP’s document-oriented model vs. Cassandra’s columnar data model, XAP’s immediate consistency (data must be able to move between models smoothly), XAP offers immediate consistency with performance, while Cassandra trades off between performance and consistency (with Cassandra as the Big Data store behind XAP processing, both consistency and performance are maintained).

Together with the Cassandra integration, XAP offers further enhancements. These include:

Data Grid Enhancements

To further optimize your queries over the data grid XAP now includes compound indices, which enable you to index multiple attributes. This way the grid scans one index instead of multiple indices to get query result candidates faster.
On the query side, new projections support enables you to query only for the attributes you’re interested in instead of whole objects/documents. All of these optimizations dramatically reduce latency and increase the throughput of the data grid in common scenarios.

The enhanced change API includes the ability to change multiple objects using a SQL query or POJO template. Replication of change operations over the WAN has also been streamlined, and it now replicates only the change commands instead of whole objects. Finally, a hook in the Space Data Persister interface enables you to optimize your DB SQL statements or ORM configuration for partial updates.

Visibility and Manageability Enhancements

A new web UI gives XAP users deep visibility into important aspects of the data grid, including event containers, client-side caches, and multi-site replication gateways.

Managing a low latency, high throughput, distributed application is always a challenge due to the amount of moving parts. The new enhanced UI helps users to maintain agility when managing their application.

The result is a powerful platform that offers the best of all worlds, while maintaining ease of use and simplicity.

Yaron Parasol is Director of Product Management for GigaSpaces, a provider of end-to-end scaling solutions for distributed, mission-critical application environments, and cloud enabling technologies.

Quest Software Announces Hadoop-Centric Software Analytics

 

Image representing Hadoop as depicted in Crunc...Quest Software, Inc. (now part of Dell) announced three significant product releases today aimed at helping customers more quickly adopt Hadoop and exploit their Big Data:

  • Kitenga Analytics ? Based on the recent acquisition of Kitenga,
    Quest Software now enables customers to analyze structured,
    semi-structured and unstructured data stored in Hadoop. Available
    immediately, Kitenga Analytics delivers sophisticated capabilities,
    including text search, machine learning, and advanced visualizations,
    all from an easy-to-use interface that does not require understanding
    of complex programming or the Hadoop stack itself. With Kitenga
    Analytics and the Quest Toad®
    Business Intelligence Suite, an organization has a complete
    self-service analysis environment that empowers business and systems
    analysts across a variety of backgrounds and job roles.
  • Toad for Hadoop ? Quest Software expands support for Hadoop in
    the upcoming release of Toad® for Hadoop. With more than two million
    users, and ranked No. 1 in Database Development and Optimization for
    three consecutive years by IDC [1], Toad has been enhanced to help
    database developers and DBAs bridge the gap between what they already
    know about relational database management systems and the new world of
    Hadoop. Toad will provide query and data management functionality for
    Hadoop, as well as an interface to perform data transfers using the
    Quest Hadoop Connector. Like Toad for any other platform, Toad for
    Hadoop makes the lives of developers, DBAs, and analysts easier and
    more productive.
  • SharePlex with Hadoop Capabilities ? Quest Software adds Hadoop
    capabilities to the next release of SharePlex® for Oracle,
    its robust, high-performance Oracle-to-Oracle database replication
    technology. For enterprise mission-critical systems that must always
    be available, the new release will seamlessly create multiple copies
    of Oracle data for movement simultaneously to both another Oracle
    environment and Hadoop, with no downtime. Customers can choose how
    they optimize Oracle and Hadoop environments based on data
    requirements, such as high availability; analytics and reporting;
    image and text processing; and general archiving. The architecture
    allows for scalable data distribution on-premise, in the cloud, and
    across multiple data centers without a single point of failure.


Google’s Dremel is the Holy Grail of Big Data: Really Big, Really Fast, Really Simple

First Google created, and wrote papers on, Hadoop and MapReduce, which got reverse-engineered into the current state of the art for Big Data.

But Google has moved on to Dremel, and the rest of the world is slow in catching up.

With BigQuery Google offers a simple-to-user service that doesn’t sacrifice Big Data scale OR speed.

As  Armando Fox, a professor of computer science at the University of California, Berkeley who specializes in these sorts of data-center-sized software platforms. put it in a Wired article:

“This is unprecedented. Hadoop is the centerpiece of the “Big Data” movement, a widespread effort to build tools that can analyze extremely large amounts of information. But with today’s Big Data tools, there’s often a drawback. You can’t quite analyze the data with the speed and precision you expect from traditional data analysis or “business intelligence” tools. But with Dremel, Fox says, you can.

“They managed to combine large-scale analytics with the ability to really drill down into the data, and they’ve done it in a way that I wouldn’t have thought was possible,” he says. “The size of the data and the speed with which you can comfortably explore the data is really impressive. People have done Big Data systems before, but before Dremel, no one had really done a system that was that big and that fast.

“Usually, you have to do one or the other. The more you do one, the more you have to give up on the other. But with Dremel, they did both.”


NextBio, Intel Collaborate to Optimize Hadoop for Genomics Big Data

Image representing nextbio as depicted in Crun...

NextBio and Intel announced today a collaboration aimed at optimizing and stabilizing the Hadoop stack and advancing the use of Big Data technologies in genomics. As a part of this collaboration, the NextBio and Intel engineering teams will apply experience they have gained from NextBio’s use of Big Data technologies to the improvement of HDFS, Hadoop, and HBase. Any enhancements that NextBio engineers make to the Hadoop stack will be contributed to the open-source community. Intel will also showcase NextBio’s use of Big Data.

“NextBio is positioned at the intersection of Genomics and Big Data. Every day we deal with the three V’s (volume, variety, and velocity) associated with Big Data – We, our collaborators, and our users are adding large volumes of a variety of molecular data to NextBio at an increasing velocity,” said Dr. Satnam Alag, chief technology officer and vice president of engineering at NextBio. “Without the implementation of our algorithms in the MapReduce framework, operational expertise in HDFS, Hadoop, and HBase, and investments in building our secure cloud-based infrastructure, it would have been impossible for us to scale cost-effectively to handle this large-scale data.”

“Intel is firmly committed to the wide adoption and use of Big Data technologies such as HDFS, Hadoop, and HBase across all industries that need to analyze large amounts of data,” said Girish Juneja, CTO and General Manager, Big Data Software and Services, Intel. “Complex data requiring compute-intensive analysis needs not only Big Data open source, but a combination of hardware and software management optimizations to help deliver needed scale with a high return on investment. Intel is working closely with NextBio to deliver this showcase reference to the Big Data community and life science industry.”

“The use of Big Data technologies at NextBio enables researchers and clinicians to mine billions of data points in real-time to discover new biomarkers, clinically assess targets and drug profiles, optimally design clinical trials, and interpret patient molecular data,” Dr. Alag continued. “NextBio has invested significantly in the use of Big Data technologies to handle the tsunami of genomic data being generated and its expected exponential growth. As we further scale our infrastructure to handle this growing data resource, we are excited to work with Intel to make the Hadoop stack better and give back to the open-source community.”


NextBio, Intel Collaborate to Optimize Hadoop for Genomics Big Data

Image representing nextbio as depicted in Crun...

NextBio and Intel announced today a collaboration aimed at optimizing and stabilizing the Hadoop stack and advancing the use of Big Data technologies in genomics. As a part of this collaboration, the NextBio and Intel engineering teams will apply experience they have gained from NextBio’s use of Big Data technologies to the improvement of HDFS, Hadoop, and HBase. Any enhancements that NextBio engineers make to the Hadoop stack will be contributed to the open-source community. Intel will also showcase NextBio’s use of Big Data.

“NextBio is positioned at the intersection of Genomics and Big Data. Every day we deal with the three V’s (volume, variety, and velocity) associated with Big Data – We, our collaborators, and our users are adding large volumes of a variety of molecular data to NextBio at an increasing velocity,” said Dr. Satnam Alag, chief technology officer and vice president of engineering at NextBio. “Without the implementation of our algorithms in the MapReduce framework, operational expertise in HDFS, Hadoop, and HBase, and investments in building our secure cloud-based infrastructure, it would have been impossible for us to scale cost-effectively to handle this large-scale data.”

“Intel is firmly committed to the wide adoption and use of Big Data technologies such as HDFS, Hadoop, and HBase across all industries that need to analyze large amounts of data,” said Girish Juneja, CTO and General Manager, Big Data Software and Services, Intel. “Complex data requiring compute-intensive analysis needs not only Big Data open source, but a combination of hardware and software management optimizations to help deliver needed scale with a high return on investment. Intel is working closely with NextBio to deliver this showcase reference to the Big Data community and life science industry.”

“The use of Big Data technologies at NextBio enables researchers and clinicians to mine billions of data points in real-time to discover new biomarkers, clinically assess targets and drug profiles, optimally design clinical trials, and interpret patient molecular data,” Dr. Alag continued. “NextBio has invested significantly in the use of Big Data technologies to handle the tsunami of genomic data being generated and its expected exponential growth. As we further scale our infrastructure to handle this growing data resource, we are excited to work with Intel to make the Hadoop stack better and give back to the open-source community.”


NextBio, Intel Collaborate to Optimize Hadoop for Genomics Big Data

Image representing nextbio as depicted in Crun...

NextBio and Intel announced today a collaboration aimed at optimizing and stabilizing the Hadoop stack and advancing the use of Big Data technologies in genomics. As a part of this collaboration, the NextBio and Intel engineering teams will apply experience they have gained from NextBio’s use of Big Data technologies to the improvement of HDFS, Hadoop, and HBase. Any enhancements that NextBio engineers make to the Hadoop stack will be contributed to the open-source community. Intel will also showcase NextBio’s use of Big Data.

“NextBio is positioned at the intersection of Genomics and Big Data. Every day we deal with the three V’s (volume, variety, and velocity) associated with Big Data – We, our collaborators, and our users are adding large volumes of a variety of molecular data to NextBio at an increasing velocity,” said Dr. Satnam Alag, chief technology officer and vice president of engineering at NextBio. “Without the implementation of our algorithms in the MapReduce framework, operational expertise in HDFS, Hadoop, and HBase, and investments in building our secure cloud-based infrastructure, it would have been impossible for us to scale cost-effectively to handle this large-scale data.”

“Intel is firmly committed to the wide adoption and use of Big Data technologies such as HDFS, Hadoop, and HBase across all industries that need to analyze large amounts of data,” said Girish Juneja, CTO and General Manager, Big Data Software and Services, Intel. “Complex data requiring compute-intensive analysis needs not only Big Data open source, but a combination of hardware and software management optimizations to help deliver needed scale with a high return on investment. Intel is working closely with NextBio to deliver this showcase reference to the Big Data community and life science industry.”

“The use of Big Data technologies at NextBio enables researchers and clinicians to mine billions of data points in real-time to discover new biomarkers, clinically assess targets and drug profiles, optimally design clinical trials, and interpret patient molecular data,” Dr. Alag continued. “NextBio has invested significantly in the use of Big Data technologies to handle the tsunami of genomic data being generated and its expected exponential growth. As we further scale our infrastructure to handle this growing data resource, we are excited to work with Intel to make the Hadoop stack better and give back to the open-source community.”


Actuate and Pervasive Software Team to Provide Interactive Visualization of Big Data Analytics

 

Image representing Actuate as depicted in Crun...

Image representing Pervasive Software as depic...

Actuate Corporation today announced an alliance with Pervasive Software Inc. that will enable business data analysts to rapidly review, prepare and analyze big data, and display intuitive data visualizations to support users’ ability to make efficient business decisions.

By speeding Big Data-based decision making, powering predictive analytics and decreasing capital and operating costs, ActuateOne and Pervasive RushAnalyzer “will make Big Data analytics and powerful visualizations available to business users in any industry and to the BIRT developer community.”

“Pervasive RushAnalyzer, the first predictive analytics product to run natively on Hadoop, enables users to rapidly transform and analyze terabytes of data on commodity hardware, and ActuateOne provides the advanced visualization capabilities to support insights and more productive conclusions,” said Mike Hoskins, CTO and general manager of Pervasive, Big Data Products and Solutions. “Pervasive’s seamless integration with Actuate, via BIRT, puts advanced Big Data analytic insights and actionable intelligence into the hands of multiple roles within an organization.”

“Big Data analytics has been traditionally been the realm of data scientists,” said Nobby Akiha, Senior Vice President of Marketing for Actuate. “By teaming with Pervasive, we are changing the game to ensure business users are in the driver’s seat to analyze Big Data sources so that they can operationalize and deliver insights to everyday users.”

ActuateOne – an integrated suite of standard and cloud software built around BIRT – enables easy visualization of data trends through customizable BIRT-based dashboards and Google-standard plug-and-play gadgets. Pervasive RushAnalyzer lets data analysts build and deploy predictive analytics solutions on multiple platforms, including Hadoop clusters and high-performance servers, to rapidly discover data patterns, build operational analytics and deliver predictive analytics. The drag-and-drop graphical interface speeds data preparation with direct access to multiple databases and file formats, as well as a prebuilt library of data mining and analytic operators, leading to simpler data manipulation, mining and visualization.


Qubole Exits Stealth Mode, Introduces Auto-Scaling Big Data Platform

Image representing Hadoop as depicted in Crunc...

Qubole exited stealth mode today to introduce its auto-scaling Big Data platform, “combining the power of Apache Hadoop and Hive with the simplicity of a Cloud platform in order to accelerate time-to-value from Big Data.” Qubole, a Silver Sponsor of next week’s Hadoop Summit 2012 conference, also invites business analysts, data scientists, and data engineers to participate in the Qubole early access program.

While most well known as creators of Apache Hive and long-time contributors to Apache Hadoop, Qubole’s founders Ashish Thusoo and Joydeep Sen Sarma also managed the Facebook data infrastructure team that was responsible for nearly 25PB of compressed data. The data services built by this team are used across business and engineering teams who submit tens of thousands of jobs, queries and ad hoc analysis requests every day. Thusoo and Sen Sarma applied their experiences and learnings to create the industry’s next generation big data platform for the cloud. With Qubole, organizations can literally begin uncovering new insights from their structured and unstructured data sources within minutes.

“We believe a new approach is needed – one that hides the complexity associated with storing and managing data and instead provides a fast, easy path to analysis and insights for business analysts, data scientists and data engineers,” said Joydeep Sen Sarma, Co-Founder of Qubole. “We gained significant experience helping a web-scale company build and manage a complex Big Data platform. We don’t want our customers to worry about choosing a flavor of Hadoop, or spinning up clusters, or trying to optimize performance. Qubole will manage all of that so that users can focus on their data and their algorithms.”

Qubole Auto-Scaling Big Data Platform for the Cloud Benefits Include:

  • Fastest Path to Big Data Analytics –
    Qubole handles all infrastructure complexities behind the scenes so
    users can begin doing ad hoc analysis and creating data pipelines
    using SQL and MapReduce within minutes.
  • Scalability “On the Fly” – Qubole
    features the industry’s first auto-scaling Hadoop clusters so users
    can get the right amount of computing power for each and every project.
  • Fast Query Authoring Tools – Qubole
    provides fast access to sample data so that queries can be authored
    and validated quickly.
  • Fastest Hadoop and Hive Service in the Cloud
    – Using advanced caching and query acceleration techniques, Qubole has
    demonstrated query speeds up to five times faster than other
    Cloud-based Hadoop solutions.
  • Quick Connection to Data – Qubole
    provides mechanisms to work with data sets stored in any format in
    Amazon S3. It also allows users to easily export data to S3 or to
    databases like MySQL.
  • Integrated Data Workflow Engine – Qubole
    provides mechanisms to easily create data pipelines so users can run
    their queries periodically with a high degree of reliability.
  • Enhanced Debugging Abilities – Qubole
    provides features that helps users get to errors in Hadoop/Hive jobs
    fast, thus saving time in debugging queries.
  • Easy Collaboration with Peers – Qubole’s
    Cloud-based architecture makes it ideal for analysts working in a
    geographically distributed environment to share information and
    analysis.

“Companies are increasingly moving to the Cloud and for good reason. Applications hosted in the Cloud are much easier to use and manage, especially for companies without very large IT organizations. While Software as a Service (SaaS) is now the standard for many different types of applications, it has not yet been made easy for companies to use the Cloud to convert their ever-increasing volume and variety of data into useful business and product insights. Qubole makes it much easier and faster for companies to analyze and process more of their Big Data, and they will benefit tremendously,” said Ashish Thusoo, Co-Founder of Qubole.

To join the early access program, please visit www.qubole.com. Qubole is looking to add a select number of companies for early access to its service, with the intention of making the service more generally available in Q4 2012. People interested in seeing a demo of the platform can visit Qubole at the Hadoop Summit June 13 – 14 at the San Jose Convention Center, kiosk #B11.


Actuate, Hortonworks Collaborate to Visualize Big Data

Image representing Actuate as depicted in Crun...

Actuate Corporation, the people behind BIRT and an open source Business Intelligence (BI) vendor, today announced a collaboration between Actuate BIRT and the Hortonworks Data Platform, to enable Big Data visualization. The Hortonworks Data Platform is a completely open source, tightly integrated and tested distribution of Apache Hadoop, backed by extensive customer support and training.

The ActuateOne integrated product suite—built around BIRT—uses native access Hive query to leverage MapReduce functions to extract data from Hadoop, pulling those data sets into customizable BIRT-based dashboards and scorecards for interactive visualization and analysis.

“We have dedicated significant resources to make Apache Hadoop more robust and easier to integrate, extend, deploy and use,” said John Kreisa, VP of Marketing at Hortonworks. “Our partnership with open source BI leader Actuate enables more users to cost effectively analyze vast amounts of data stored in Hadoop using open source technologies.”

“Actuate’s collaboration with Hortonworks will ease the transition from Big Data hype to Big Data usefulness,” said Nobby Akiha, Senior Vice President of Marketing at Actuate. “We believe the key to success with Big Data lies in building the right infrastructure to manage it. Teaming with Hortonworks will further our goal of helping organizations figure out how best to leverage and integrate Big Data sources to enable better decision making.”

Large organizations are increasingly turning to Apache Hadoop for the storage and management of massive amounts of data and thus need scalable ways to explore, analyze and visualize the insights stored within it. The combination of the Hortonworks Data Platform’s distributed processing of Hadoop data sources of any size, with Actuate’s scalable infrastructure and intuitive data visualization capabilities, enables organizations to more effectively operationalize Big Data for thousands of customers, partners and employees.