Category Archives: Database

Google launches Dataproc after successful beta trials

Google cloud platformGoogle has announced that its big data analysis tool Dataproc is now on general release. The utility, which was one of the factors that persuaded Spotify to choose Google’s Cloud Platform over Amazon Web Services is a managed tool based on the Hadoop and Spark open source big data software.

The service first became available in beta in September and was tested by global music streaming service Spotify, which was evaluating whether it should move its music files away from its own data centres and into the public cloud – and which cloud service could support it. Dataproc in its beta form supported the MapReduce engine, the Pig platform for writing programmes and the Hive data warehousing software. Google says it has added new features and sharpened the tool since then.

While in its beta testing phase, Cloud Dataproc added features such as property tuning, VM metadata and tagging and cluster versioning. “In general availability new versions of Cloud Dataproc will be frequently released with new features, functions and software components,” said Google product manager James Malone.

Cloud Dataproc aims to minimise cost and complexity, which are the two major distractions of data processing, according to Malone.

“Spark and Hadoop should not break the bank and you should pay for what you actually use,” he said. As a result, Cloud Dataproc is priced at 1 cent per virtual CPU per hour. Billing is by the minute with a 10-minute minimum.

Analysis should run faster, Malone said, because clusters in Cloud Dataproc can start and stop operations in less than 90 seconds, where they take minutes in other big data systems. This can make analyses run up to ten times faster. The new general release of Cloud Dataproc will have better management, since clusters don’t need specialist administration people or software.

Cloud Dataproc also tackles two other data processing bugbears, scale and productivity, promised Malone. This tool complements a separate service called Google Cloud Dataflow for batch and stream processing. The underlying technology for the service has been accepted as an Apache incubator project under the name Apache Beam.

IBM Watson Health buys Truven Health Analytics for $2.6B

Legs of Fit Couple Exercising on Treadmill DeviceIBM Watson Health has announced an agreement to acquire cloud based big data specialist Truven Health Analytics. The deal, valued at $2.6 billion, will give the IBM Watson Health portfolio an additional 8,500 clients and information on 215 million new patients, subject to the merger being concluded. Upon completion of due diligence, IBM will buy Truven from its current owner Veritas Capital.

Truven Health Analytics has a client list that includes US federal and state government agencies, employers, health plans, hospitals, clinicians and life sciences companies. The 215 million records of patient lives from Truven will be added to data from previous IBM Watson Health acquisitions of big data companies. These include 50 million patient case histories that came with its acquisition of cloud based health care intelligence company Explorys and 45 million records owned by population health analyser Phytel. IBM Watson Health has also bought medical imaging expert Merge Healthcare. In total, IBM Watson Health now has 310 million records of ‘patient lives’ which, IBM claims, gives it a health cloud housing ‘one of the world’s largest and most diverse collections of health-related data’.

In September BCN reported how two new cloud services, IBM Watson Health Cloud for Life Sciences Compliance and IBM Watson Care Manager had been created to unblock the big data bottlenecks in clinical research. The first service helps biomedical companies bring their inventions to market more efficiently, while the Care Manager system gives medical professionals a wider perspective on the factors they need to consider for personalised patient engagement programmes.

According to IBM it has now invested over $4 billion on buying health data and systems and will have 5,000 staff in its Watson Health division, including clinicians, epidemiologists, statisticians, healthcare administrators, policy experts and consultants.

Truven’s cloud-based technology, systems and health claims data, currently housed in offices and data centers across facilities in Michigan, Denver, Chicago, Carolina and India, are to be integrated with the Watson Health Cloud.

IBM has invited partners to build text, speech and image recognition capacity into their software and systems and 100 ecosystem partners have launched their own Watson-based apps. IBM opened a San Francisco office for its Watson developer cloud in September 2015 and is also building a new Watson data centre there, which is due to open in early 2016.

IBM and Microsoft race to develop Blockchain-As-A-Service

Money cloudIBM has made 44,000 lines of code available to the Linux Foundation’s open source Hyperledger Project in a bid to speed the development of a Blockchain ledger for secure distributed online financial transactions.

IBM is now competing with a number of vendors, such as Microsoft Azure and Digital Asset, to bring Blockchain services to market either as a Bitcoin crypto currency enabler or for wider applications in financial services trading and even the IoT.

In a statement IBM said it wants to help create a new class of distributed ledger applications by letting developers use IBM’s new blockchain services available on Bluemix, where they can get DevOps tools to create and run blockchain apps on the IBM Cloud or z System mainframes. New application programming interfaces mean Blockchain apps will now be able to access existing transactions on these systems to support new payment, settlement, supply chain and business processes.

IBM also unveiled plans to put Blockchain technology to use on the Internet of Things (IoT) through its Watson IoT Platform. Information from RFID-based locations, barcode-scans or device-reported data could be managed through IBM’s version of Blockchain with devices communicating with the ledger to update or validate smart contracts. Under the scheme, the movement of an IoT-connected package through multiple distribution points could be managed and updated on a Blockchain system to give a more accurate and timely record of events in the supply chain.

The vendor intends to foster greater levels of Blockchain app design activity through its new IBM Garages that will open in London, New York, Singapore and Tokyo.

In Tokyo IBM and the Japan Exchange Group have agreed to test the potential of blockchain technology for use in trading in low transaction markets. As the Linux Foundation’s Hyperledger Project evolves, the joint IBM and JPX evaluation work will adapt to use of the code produced by that effort.

Meanwhile, Microsoft is to launch its own Blockchain as a service (BaaS) offering within in its Azure service portfolio with a certified version of the online ledger scheduled to be launched in April.

In January 2016, Microsoft announced that it is developing Blockchain related services in its Azure’s DevTest Labs. In November BCN reported that Microsoft has launched a cloud-based ledger system for would-be bitcoin traders.

Microsoft is also inviting potential service provider partners pioneer the use of Blockchain technology in the cloud.

MapR gets converged data platform patented

dataCalifornia-based open source big data specialist MapR Technologies has been granted patent protection for its technique for converging open source, enterprise storage, NoSQL and other event streams.

The United States Patent and Trademark Office recognised the detail differentiation of the Hadoop specialist’s work within the free, Java-based programming framework of Hadoop. Though the technology is derived from technology created by the open source oriented Apache Software Foundation, the patent office has judged that MapR’s performance, data protection, disaster recovery and multi-tenancy features merit a recognisable level of differentiation.

The key components of the patent claims include a design based on containers, self-contained autonomous units with their own operating system and app software. Containers can ring fence data against loss, optimise replication techniques and create a system that can cater for multiple node failures in a cluster.

Other vital components of the system are transactional read-write-update semantics with cluster-wide consistency, recovery techniques and update techniques. The recovery features can reconcile the divergence of replicated data after node failure, even while transactional updates are continuously being added. The update techniques allow for extreme variations of performance and scale while supporting familiar application programming interfaces (APIs).

MapR claims its Converged Data Platform allows clients to innovate with open source, provides a foundation for analytics (by converging all the data), creates enterprise grade reliability in one open source platform and makes instant, continuous data processing possible.

It’s the differentiation of the core with standard APIs that makes it stand out from other Apache projects, MapR claims. Meanwhile the system’s ability to use a single cluster, that can handle converged workloads, makes it easier to manage and secure, it claims.

“The patent details how our platform gives us an advantage in the big data market. Some of the most demanding enterprises in the world are solving their business challenges using MapR,” said Anil Gadre, MapR Technologies’ senior VP of product management.

Paradigm4 puts oncology in the cloud with Onco-SciDB

Digital illustration DNA structure in abstract colour backgroundBoston-based cloud database specialist Paradigm4 has launched a new system designed to speed up the process of cancer research among biopharmaceutical companies.

The new Onco-SciDB (oncology scientific database) features a graphical user interface designed for exploring data from The Cancer Genome Atlas (TCGA) and other relevant public.

The Onco application runs on top of the Paradigm4’s SciDB database management system devised for analysing multi-dimensional data in the cloud. The management system was built by database pioneer Michael Stonebraker in order to use the cloud for massively parallel processing and offering an elastic supply of computing resources.

A cloud-based database system gives research departments cost control and the capacity to ramp up production when needed, according to Paradigm4 CEO Marilyn Matz. “The result is that research teams spend less time curating and accessing data and more time on interactive exploration,” she said.

Currently, the bioinformatics industry lacks the requisite analytical tools and user interfaces to deal with the growing mass of molecular, image, functional, and clinical data, according to Matz. By simplifying the day-to-day challenge of working with multiple lines of evidence, Paradigm4 claims that SciDB supports clinical guidance for programmes like precision anti-cancer chemotherapy drug treatment. By making massively parallel processing possible in the cloud, it claims, it can provide sufficient affordable computing power for budget-constrained research institutes to trawl through petabytes of information and create hypotheses over the various sources of molecular, clinical and image data.

Database management system SciDB serves as the foundation for the 1000 Human Genomes Project and is used by bio-tech companies such as Novartis, Complete Genomics, Agios and Lincoln Labs. A custom version of Onco-SciDB has been beta tested at cancer research institute Foundation Medicine.

Industry veteran Stonebraker, the original creator of the Ingres and Postgres systems in 1985 that formed the basis of IBM’s Informix and EMC’s Greenplum, won the Association for Computing Machinery’s Turing Award and $1million from Google for his pioneering of database design.

Microsoft acquires Metanautix with Quest for intelligent cloud

MicrosoftMicrosoft has bought Californian start up Metanautix for an undisclosed fee in a bid to improve the flow of analytics data as part of its ‘intelligent cloud’ strategy.

The Palo Alto vendor was launched by Theo Vassilakis and Toli Lerios in 2014 with $7 million. The Google and Facebook veterans had impressed venture capitalists with their plans for more penetrative analysis of disparate data. The strategy was to integrate the data supply chains of enterprises by building a data computing engine, Quest, that created scalable SQL access to any data.

Modern corporations aspire to data-driven strategies but have far too much information to deal with, according to Metanautix. With so many sources of data, only a fraction can be analysed, often because too many information silos are impervious to query tools.

Metanautix uses SQL, the most popular query language, to interrogate sources as diverse as data warehouses, open source data base, business systems and in-house/on-premise systems. The upshot is that all data is equally accessible, whether it’s from Salesforce or SQL Server, Teradata or MongoDB.

“As someone who has led complex, large scale data warehousing projects myself, I am excited about building the intelligent cloud and helping to realize the full value of data,” said Joseph Sirosh, corporate VP of Microsoft’s  Data Group, announcing the take-over on the company web site.

Metanautix’s technology, which promises to connect to all data regardless of type, size or location, will no longer be available as a branded product or service. Microsoft is to initially integrate it within its SQL Server and Cortana Analytics systems with details of integration with the rest of Microsoft’s service portfolio to be announced in later months, Sirosh said.

The blog posting from Metanautix CEO Theo Vassilakis hinted at further developments. “We look forward to being part of Microsoft’s important efforts with Azure and SQL Server to give enterprise customers a unified view of all of their data across cloud and on-premises systems,” he said.

Google upgrades Cloud SQL, promises managed MySQL offerings

Google officeGoogle has announced the beta availability of a new improved Cloud SQL for Google Cloud Platform – and an alpha version of its much anticipated Content Delivery Network offering.

In a blog post Brett Hesterberg, Product Manager for Google’s Cloud Platform, says the second generation of Cloud SQL will aim to give better performance and more ‘scalability per dollar’.

In Google’s internal testing, the second generation Cloud SQL proved seven times faster than the first generation and it now scales to 10TB of data, 15,000 IOPS and 104GB of RAM per instance, Hesterberg said.

The upshot is that transactional databases now have a flexibility that was unachievable with traditional relational databases. “With Cloud SQL we’ve changed that,” Hesterberg said. “Flexibility means easily scaling a database up and down.”

Databases can now ramp up and down in size and the number of queries per day. The allocation of resources like CPU cores and RAM can be more skilfully adapted with Cloud SQL, using a variety of tools such as MySQL Workbench, Toad and the MySQL command-line. Another promised improvement is that any client can be used for access, including Compute Engine, Managed VMs, Container Engine and workstations.

In the new cloud environment databases need to be easier to stop and restart if they are only used on occasion for brief or infrequent tasks, according to Hesterberg. Cloud SQL now caters for these increasingly common cloud applications of database technology through the Cloud Console, the command line within Google’s gCloud SDK or a RESTful API. This makes admin a scriptable job and minimises costs by only running the databases when necessary.

Cloud SQL will create more manageable MySQL databases, claims Hesterberg, since Google will apply patches and updates to MySQL, manage backups, configure replication and provide automatic failover for High Availability (HA) in the event of a zone outage. “It means you get Google’s operational expertise for your MySQL database,” says Hesterberg. Subscribers signed up for Google Cloud Platform can now get a $300 credit to test drive Cloud SQL, it announced.

Meanwhile in another Google blog, it announced an alpha release of its own content delivery network, Google Cloud CDN. The system may not be consistent and is not recommended for production use, Google warned.

Google Cloud CDN will speed up its cloud services using distributed edge caches to bring content closer to users in a bid to compensate for its relatively low global data centre coverage against rivals AWS and Azure.

MapR claims world’s first converged data platform with Streams

Navigating big dataApache Hadoop system specialist MapR Technologies claims it has invented a new system to make sense of all the disjointed streams of real time information flooding into big data platforms. The new MapR Streams system will, it says, blend everything from systems logs to sensors to social media feeds, whether it’s transactional or tracking data, and manage it all under one converged platform.

Stream is described as a stream processing tool that can handle real-time event handling and high scalability. When combined with other MapR offerings, it can harmonise existing storage data and NoSQL tools to create a converged data platform. This, it says, is the first of its kind in the cloud industry.

Starting from early 2016, when the technology becomes available, cloud operators can combine Streams with MapR-FS for storage and the MapR-DB in-Hadoop NoSQL database, to build a MapR Converged Data Platform. This will liberate users from having to monitor information from streams, file storage, databases and analytics, the vendor says.

Since it can handle billions of messages per second and join clusters from separate data centres across the globe, the tool could be of particular interested to cloud operators, according to Michael Brown, CTO at comScore. “Our system analyses over 65 billion new events a day, and MapR Streams is built to ingest and process these events in real time, opening the doors to a new level of product offerings for our customers,” he said.

While traditional workloads are being optimised, new workloads from the emerging IoT dataflows are presenting far greater challenges that need to be solved in a fraction of the time, claims MapR. The MapR Streams will help companies deal with the volume, variety and speed at which data has to be analysed while simplifying the multiple layers of hardware stacks, networking and data processing systems, according to MapR. Blending MapR Streams into a converged data system eliminates multiple siloes of data for streaming, analytics and traditional systems of record, MapR claimed.

MapR Streams supports standard application programming interfaces (APIs) and integrates with other popular stream processors like Spark Streaming, Storm, Flink and Apex. When available, the MapR Converged Data Platform will be offered as a free to use Community Edition to encourage developers to experiment.

Microsoft launches cloud based Blockchain tech for would be bitcoin traders

bitcoin logo2Microsoft has launched a cloud-based ledger system for would-be bitcoin traders. The Azure service provider aims to ease traditional bankers and financial service companies into a new increasingly legitimised market as it gains currency in the world’s finance centres.

The system uses blockchain technology provided by New York based financial technology start up ConsensYs. It will give financial institutions the means to create affordable testing and proof of concept models as they examine the feasibility of bitcoin trading.

Blockchain technology’s ability to secure and validate any exchange of data will help convince compliance constrained finance institutes that this form of trading is no more dangerous than any other high speed automated trading environment, according to Microsoft. In a bitcoin system the ConsenYs blockchain will be used as a large decentralized ledger which keeps track of every bitcoin transaction.

Cloud technology could aggregate sufficient processing power to cater for all fluctuations demand for capacity in online trading. In turn this means that the IT service provider, whether internal or external, can guarantee that every transaction is verified and shared across a global computer network. The omnipresence of the blockchain reporting system makes it impossible for outside interference to go unmonitored.

The Microsoft blockchain service, launched on November 10th, also uses Ethereum’s programmable blockchain technology which will be delivered to existing banks and insurance clients already using Microsoft’s Azure cloud service. According to Microsoft four global financial institutions have already subscribed to the service.

Until now blockchain has been the ‘major pain point’ in bitcoin trading, according to Marley Gray, Microsoft’s director of tech strategy for financial services. Gray told Reuters that the cloud technology had made the technology affordable and easy enough to adopt. According to Gray it now only takes 20 minutes and no previous experience to spin up a private blockchain. Microsoft said it has simplified the system with templates it created, used in combination with its cloud-based flexible computing model.

The new testing systems made possible by ConsensYs create a ‘fail fast, fail cheap’ model that allows finance companies to explore the full range of possibilities of this new type of trading, said Gray.

Veritas warns of ‘databerg’ hidden dangers

Deep WebBackup specialist Veritas Technologies claims European businesses waste billions of euros on huge stories of useless information which are growing every year. By 2020 it claims the damage caused by this excessive data will cost over half a trillion pounds (£576bn) a year.

According to the Veritas Databerg Report 2015, 59% of data stored and processed by UK organisations is invisible and could contain hidden dangers. From this it has estimated that the average mid-sized UK organisation holding 1000 Terabytes of information spends £435k annually on Redundant, Obsolete or Trivial (ROT) data. According to its estimate just 12% of the cost of data storage is justifiably spent on business-critical intelligence.

The report blames employees and management for the waste. The first group treats corporate IT systems as their own personal infrastructure, while management are too reliant on cloud storage, which leaves them open to compliance violations and a higher risk of data loss.

The survey identified three major causes for Databerg growth, which stem from volume, vendor hype and the values of modern users. These root causes create problems in which IT strategies are based on data volumes not business value. Vendor hype, in turn, has convinced users to become increasingly reliant on free storage in the cloud and this consumerisation has led to a growing disregard for corporate data policies, according to the report’s authors.

As a result, big data and cloud computing could lead corporations to hit the databerg and incur massive losses. They could also sink under a prosecution for compliance failing, according to the key findings of the Databerg Report 2015.

It’s time to stop the waste, said Matthew Ellard, Senior VP for EMEA at Veritas. “Companies invest a significant amount of resources to maintain data that is totally redundant, obsolete and trivial.” This ‘ROT’ costs a typical midsize UK company, which can expect to hold 500 Terabytes of data, nearly a million pounds a year on photos, personal ID doc, music and videos.

The study was based on a survey answered by 1,475 respondents in 14 countries, including 200 in the UK.