At the company’s annual Build conference this week Microsoft unveiled among other things an Azure Data Lake service, which the company is pitching as a hyperscale big data repository for all kinds of data.
The data lake concept is a fairly new one, the gist of it being that data of varying types and structures is created at such a high velocity and in such large volumes that it’s prompting a necessary evolution in the applications and platforms required to handle that data.
It’s really about being able to store all that data in a volume-optimised (and cost-efficient) way that maintains the integrity of that information when you go to shift it someplace else, whether that be an application / analytics or a data warehouse.
“While the potential of the data lake can be profound, it has yet to be fully realized. Limits to storage capacity, hardware acquisition, scalability, performance and cost are all potential reasons why customers haven’t been able to implement a data lake,” explained Microsoft’s product marketing manager, Hadoop, big data and data warehousing Oliver Chiu.
The company is pitching the Azure Data Lakes service as a means of running Hadoop and advanced analytics using Microsoft’s own Azure HDInsight, as well as Revolution-R Enterprise and other Hadoop distributions developed by Hortonworks and Cloudera.
It’s built to support “massively parallel queries” so information is discoverable in a timely fashion, and built to handly high volumes of small writes, which the company said makes the service ideal for Internet of Things applications.
“Microsoft has been on a journey for broad big data adoption with a suite of big data and advanced analytics solutions like Azure HDInsight, Azure Data Factory, Revolution R Enterprise and Azure Machine Learning. We are excited for what Azure Data Lake will bring to this ecosystem, and when our customers can run all of their analysis on exabytes of data,” Chiu explained.
Pivotal is also among a handful of vendors seriously bought into the concept of data lakes. However, although Chiu alluded to cost and performance issues associated with the data lakes approach, many enterprises aren’t yet at a stage where the variety, velocity and volume of data their systems ingest are prompting a conceptual change in how that data is being perceived, stored or curated; in a nutshell, many enterprises are still too siloed – not the least of which in how they treat data.