How to create a software engineering approach to big data analytics

Big data analysis can provide continuous business intelligence to managers, enabling sophisticated monitoring of current activities, and helping them to make smarter, faster decisions based on fact data and hidden trends and patterns. But while managers may think it’s cool, rushing to implement “a big data analytics” program can get you into trouble.  There are many architectures, tools, and algorithms to sort through, and you’ll need to manage stakeholder expectations. Software engineers deal with these kinds of problems regularly, so it’s helpful, to take their approach in building data analytics solutions.

Management questions

First it is very important to manage customer expectations. Software engineers would seek to discover the true intent for the analytics program, by asking the following questions: What is the real problem that is being solved? Is analytics really the answer? Is it a technology problem or a political problem in disguise? Assuming the problem can be solved by analytics, there may be constraints that need to be addressed. For example, is the required data scattered across databases and in many places in the organization? Are the legacy systems up to the task? Are there data governance issues related to ownership, and what are the privacy, security and trust issues?  What risks is the manager willing to take to relax any constraints?  Buy-in from all stakeholders in the organization is also going to be needed — without this you may face political difficulties from uncooperative colleagues.

Finally, you need to ask: what is the budget for this project? While many of the tools enabling data analytics are open source, commercial tools may be needed for some aspects. There could be significant costs to purchasing or leasing hardware, hosting (depending on the architectural model) and providing release time to set up the system and for training and support.

Software architecture

Software engineers seek generalizable solutions and compatibility across the enterprise and industry. In seeking an appropriate, and reusable architecture for data analytics, the focus is on efficient, cost-effective utilization and sharing of resources. Architectural decisions include whether to query in batch mode or real-time, to host on site or use third party provisioning, or to use some combination of these.

Hadoop is frequently a first choice as a data analytics platform, but there are many alternatives. While “large” organizations may have the hardware infrastructure to collect, process and analyze massive amounts of data, smaller organizations may not. Whatever architecture is chosen, problems may arise when the database lacks in situ analytics, the analytics are too slow, or can’t scale in terms of data load time. Some platforms may choke on the vast amounts of data that are frequently updated from live feeds from social media, Websites, mobile applications and even sensors in cyberphysical systems.

Tools

Data analytics can involve vast amounts of data possibly petabytes worth. This data will likely come from different platforms, data stores and sources and data quality can vary greatly. Data can come from multiple internal and external sources including email text, sales records, Web server logs, internet clickstream data, mobile phone data and even sensor data from devices connected to the Internet of Things. This data will be varyingly structured, unstructured, or semi-structured, and much of it will be redundant and inconsistent. Special tools, then, will be needed to clean, compress, format and visualize this data before, during and after analysis.

There are hundreds of tools of various intent to choose from; many are open source. Choosing the right tool is an important software engineering problem. Considerations include: compatibility with the operating environment, support provided (if any) and programming language needed to interface to the tool. Building data analytics solutions will likely involve more than one programming language, typically C, Python, Java, SQL and others. Is your development team prepared for this challenge?

These tools will also need to be configured to the data set and analytical problem. But seamless integration of these tools into a one button solution for managers isn’t always easy. Finally, you need to consider how reconfigurable the solution will be for different kinds of related problems. A single purpose analytics solution isn’t going to be cost effective.  

Analytics

Finding the right machine learning algorithm to apply to data sets in search of patterns and relationships for situational analysis and for predictive analytics is a significant challenge.  There are numerous desktop data miners, deep learning libraries and cognitive toolkits, to choose from, but deep learning using multilayer neural networks is computationally expensive. Failure to consider performance and throughput at full scale can lead to customer dissatisfaction.

Fortunately, software engineers and related professionals are working together to solve these kinds of problems. For more information visit the IEEE Big Data Initiative and NIST Big Data Working Group.