All posts by drpaulterry

It’s time to take your big data strategy beyond the data lake

(c)iStock.com/Bastar

A well-established – and unsettling – metric we anecdotally hear from organisations is that analysts spend 70% or more of their time hunting and gathering data and less than 30% of its time deriving insights.

The promise of big data is to reverse this 70/30 ratio. Organisations striving to build a data lake and use data for competitive advantage are seduced by the opportunity to discover new insights to better target customers, improve operations, lower costs, and discover new scientific breakthroughs.

“Big data” is no longer on the horizon; it’s here. According to Gartner, 40% of organisations have already deployed big data.  But a closer look reveals that only 13% of organisations have put their big data solution into a production environment.

Why are organisations struggling to implement big data?  We’ve spoken with customers across industries from around the world, and the barriers to deploying big data in a production environment tend to fall into three categories: findability of data; simplifying data access and making the data more consumable for users; and protecting data privacy and security.

Findability

Big data systems let organisations ingest any type of data such as social media, clickstream, wearable devices, images, emails, documents, and data from relational databases.  This ability to break down silos and gather a wide variety of data at speed is a key enabler of Hadoop. 

Our research, however, suggests that organisations are struggling with how to convert raw, unstructured, and semi-structured information into linkable, consistently defined, analytics-ready digital assets. For example, how does a hospital link a particular gene variant with a patient population or a manufacturer find Tier 1 customers who are dissatisfied with their product? 

To solve this problem, a big data solution needs to be able to automatically decorate data with rich and extensible metadata including data definitions, data quality metrics, retention policies, digital rights, and provenance.  Moreover, a big data system needs to build powerful indexes to let users interactively explore vast and diverse data, generating results with sub-second performance.

Simplifying data access and consumability

The secondary use of data has traditionally been the role of data scientists and analysts who produce business intelligence reports and analytics. But as organisations strive to become more agile and data-driven, analysts are increasingly being pushed to the limit. 

Organisations we speak with are looking to empower knowledge workers with selfservice access to information. Intuitive visualisation and analytics tools like Tableau and QlikView are an important part of the solution.  But for self-serve data to be truly consumable, information-based workers need a catalog of curated datasets they can draw from with a simple point-and-click user interface, eliminating the requirement for advanced SQL and complex schema expertise.

Protecting privacy and security

Ironically, one of the most common concerns we come across in discussions with large organisations – and also perhaps the biggest barrier to deploying big data in a production environment – is the concern that “If I can easily consolidate my organisation’s data for secondary use, what prevents anyone from seeing everything?

Big data systems are notoriously weak in managing information privacy and security. Following an extensive review of industry best practices, we believe that the globally ratified Privacy by Design framework presents a powerful seven-point model to scalable system design. 

A big data system should be able to control who is allowed to see and do what with the data.  For example, a CEO may be allowed to view but not download top secret data; the Accounts Receivable department can view a fully identified dataset but download only de-identified data; and a researcher may be able to view a de-identified dataset and collaborate with an external partner who can only see a subset of the de-identified data.  More advanced systems can also enforce who can see and do what on mobile devices or from outside the corporate firewall.

A scalable and secure system should be able to de-identify information on the fly, and control who is allowed to see and do what, depending on user authorisations. A robust security model that reduces the risk of a data breach should be able to set universally and consistent policy enforcement rules. 

Start at the data lake – but don’t stop there

So by all means, start your big data strategy and implement your data lake. Just don’t stop there. Look for a solution that has a flexible, comprehensive metadata infrastructure out-of-the-box that lets you quickly find and link the right information; gives your end-users self-service access to data without becoming experts in SQL and complex database schema; and universally and consistently enforces fine-grained privacy and security.

Does this sound hard? It doesn’t have to be. You can certainly grow your internal development team to build your data lake from a commercial Hadoop distribution, but there are also players who understand the problem and are building a supported best-of-breed solution. Whichever way you go, it’s time to become more data-driven. So go ahead and enjoy the lake. The water’s fine.