Building data pipelines

Cutting the Gordian Knot of Operational Data: Do You Know How Well Your Pipelines Work?

McKinsey predicts that by 2025, data-driven enterprises will see data embedded in every decision, interaction and process across an organisation. That’s a nice idea – but as someone who is intimately familiar with how companies build and manage data pipelines today, it sure looks like wishful thinking to me. 

This prediction assumes that companies have reasonably good data pipelines that flow in a manageable way. In reality, the vast majority of these pipelines are tangled messes that rival the infamous Gordian Knot.

 

One size does not fit all

Traditionally, a data pipeline handles the connectivity to business applications, controls the requests and flow of data into new data environments and then manages the steps needed to cleanse, organise and present a refined data product to consumers, inside or outside your business walls. These results empower decision-makers to drive their business forward every day. 

We’ve become familiar with so-called ‘big data’ success stories: how companies like Netflix build pipelines that manage over a petabyte of data every single day, or how social media firms like Meta analyse over 300 petabytes (that’s 300 quadrillion bytes!) of clickstream data inside their analytics platforms. It’s easy to assume that we’ve already solved all the hard problems once we’ve reached this scale.

Unfortunately, it’s not that simple. Just ask anyone who works with pipelines for operational data – they will be the first to tell you that one size definitely does not fit all.

When it comes to operational data – i.e., the data that underpins the core parts of a business that create value, such as financials, supply chain, human resources, etc. – organisations routinely fail to deliver value from analytics pipelines that were designed in a way that resembles ‘big data’ environments. Why? Because they are trying to solve a fundamentally different data challenge with essentially the same approach, and it doesn’t work (at least not well enough to enable McKinsey’s image of ‘data-driven enterprises’ in 2025).

Let’s be clear: the issue here is not the size of the data, but how complex it is. 

Leading social or digital streaming platforms often store large data sets as a series of simple, ordered events. One row of data gets captured in a data pipeline for a user watching a TV show, another records each ‘Like’ button that gets clicked on a social media profile. All this data gets processed through data pipelines at tremendous speed and scale using cloud technology. The data sets themselves are large, and that’s okay because the underlying data is extremely well-ordered and managed to begin with. The highly organised structure of clickstream data means that billions upon billions of records can be analysed in no time.

For operational systems, such as ERP (enterprise resource planning) platforms that most organisations use to run their essential day-to-day processes, on the other hand, it’s a very different data landscape. 

Since their introduction in the 1970s, ERP systems have evolved to optimise every ounce of performance for capturing raw transactions from the business environment. Every sales order, every financial ledger entry, every item of supply chain inventory has to be captured and processed as fast as possible. To achieve this performance, the design for ERP systems evolved into a system with tens of thousands of individual database tables that track business data elements and even more relationships between those objects. This data architecture is effective at ensuring a customer or supplier’s records are consistent over time. 

But, as it turns out, what’s great for transaction speed within that business process typically isn’t so great for analytics performance. Instead of clean, straightforward, well-organised tables that modern online applications create, we have a spaghetti-like mess of data, spread across a complex, real-time, mission-critical application. For instance, analysing a single financial transaction to a company’s books might require data from upwards of fifty distinct tables in the backend ERP database, often with multiple lookups and calculations. This data is like the Gordian Knot, which was so twisted and entangled that no one could possibly untie it.

 

The right tools for the right jobs

To answer questions that span hundreds of tables and relationships, business analysts must write increasingly complex queries that often take hours to return results. Unfortunately, these queries simply never return answers in time and leave the business flying blind at a critical moment in their decision-making as a result. 

To solve this, organisations attempt to further engineer the design of their data pipelines with the aim of routing data into increasingly simplified business views that minimise the complexity of various different queries to make them easier to run.

This might work in theory, but it comes as the cost of oversimplifying the data itself. Rather than enabling analysts to ask and answer any question with data, this approach frequently summarises or reshapes the data to boost performance. In effect, it means that analysts can get fast answers to predefined questions and long wait times for everything else.

With inflexible data pipelines, asking new questions means going back to the source system, which is time consuming and becomes expensive quickly. And if anything changes within the ERP application, the pipeline breaks completely.

 

Deliberately design for the data you have

Rather than applying a static, ‘one size fits all’ pipeline model that can’t respond effectively to data that is more interconnected, we have to design with this level of connection in mind from the start. 

Rather than making pipelines ever smaller to break up the problem, the design should encompass those connections instead. In practice, means starting out by addressing the fundamental reason behind the pipeline itself: making data accessible to users without the time and cost overheads associated with expensive analytical queries. 

Every connected table in a complex analysis puts additional pressure on both the underlying platform and those tasked with maintaining business performance through tuning and optimising these queries. To reimagine the approach you have to look at how everything is optimised when the data is loaded – but, importantly, before any queries are made. This is generally referred to as ‘query acceleration’ and it provides a useful shortcut.

This ‘query acceleration’ approach delivers many multiples of performance compared to traditional data analysis. It achieves this without needing the data to be prepared or modeled in advance. By scanning the entire data set and preparing that data before a query is run, there are fewer limitations on how questions can be answered. This also improves the usefulness of the query by delivering the full scope of the raw business data that is available for exploration.

This shift in focus helps cut the proverbial Gordian Knot around operational data and analytics by eliminating traditional data pipeline bottlenecks, such as the need to move, reshape or remodel data from the ERP source in order to reach the analytics tools. 

As an example, the world’s largest coffee retailer wanted to monitor 20,000 product SKUs in incredible detail. With previous analytics, the retailer saw a picture of each of their global stores but only on a day-to-day basis. Pre-loading the data and understanding the relationships involved meant that the team could analyse data more efficiently and make the results available to each store manager in 32,000 locations.

 

Planning ahead around your real-world data problems

By questioning the fundamental assumptions in how we acquire, process and analyse our operational data, it’s possible to simplify and streamline the steps needed to move from high-cost, fragile data pipelines to faster business decisions. 

Without this mindset, we won’t be able to embed all the data that we need into those business processes. This would lead to decisions being made based on incomplete data sets, which would defeat the entire purpose and objective around being data-driven. 

From the simple and highly structured data that modern internet platforms provide, through to the complex data that big business-critical applications create, it’s important to understand that there are multiple types of data business need to engage with. The tools used to analyse one type of data are not necessarily the right for the others. That’s why thinking through the goals, objectives and overall approach is necessary in order to succeed. 

Just like the Gordian Knot, we have to take different approaches to cut through different data problems, rather than simply implementing ever longer pipelines. 

This post originally appeared as a contributed article on on The Stack, a UK-based business technology publication founded in 2020 by Ed Targett. 

 

ZG Meta