Data governance means different things to different people. To some, it involves security and access to data. To others, it’s about how consistent the data is from system to system. And to others, it revolves around master data management. In a large multifaceted organization, it could mean all of the above.
Data governance originated with transactional systems, such as point-of-sale and shop floor management, and was primarily centered around security and access. As new transactional systems such as ERPs and CRMs came on line, naming conventions and standardization across processes were needed to minimize redundant data and duplicative data entry work. Ensuring the quality of the data was also important and automated checks were placed on inputs to ensure that input errors were not passed to downstream processes.
A well-designed data governance program today might be run by a dedicated data management team. It could also include executives and business leaders chartered with setting the rules and processes for managing data availability, usability, integrity, and security across the entire enterprise.
Regardless of how it’s done, the aim of data governance is to provide the organization with secure, high quality, high confidence data to fuel business processes, analysis, and decision making.
However, due to traditional data preparation practices and technical limitations, data governance within analytic pipelines has been limited. This has prevented organizations from using certain types of data and from doing certain types of analysis.
Let’s unpack that last one, security, because this limitation is so common that most people have come to just accept it, and yet the reasons and implications may not be completely obvious.
In a transactional system, data security is enforced in the kinds of things a user can do, and the types of data that can be accessed. For example, while everyone can see an organization chart, only HR managers can make changes in the HR system.
In addition, security is often enforced at the item level. Think of rows, columns, and individual cells in a spreadsheet. There are rules to govern who can see each piece of data. Health data is a good example. Just a few people need to see your prescription history, a somewhat larger group of people need to know your health plan coverages, and to everyone else you’re just a statistic.
But these security policies do not necessarily carry across to analytic systems. This is due to the way we’ve been building analytics pipelines. By transforming and reshaping data to better optimize for analytics, we lose the data detail needed to replicate the same security controls that are present in the transactional system. This means that we limit the data we can analyze, the kinds of analysis we can do, and the trust we have in the data.
So, what happens with sensitive data is one of two things: Either it just doesn't go into an analytic system and you don’t analyze it, or it gets put into a system that is restricted to a really small number of people. This frequently happens with HR data, sourcing information for components that are trade secrets, and PII, for example.
These are important data sets, so there are some significant implications to not being able to analyze them. Let’s say you conduct employee satisfaction surveys on a regular basis, and you want to see if there’s a correlation between satisfaction and salary. Or you want to do a vendor cost analysis but you don’t know where the goods are coming from. You can’t do it because there are too many restrictions and controls on these data sets.
This leaves a lot of dark corners in enterprise analytics where data can't be shared or blended with your other organizational data, or even seen by most people.
Fortunately, times and technology have changed, and we can approach data preparation and storage for analysis in a new way that preserves good data governance while providing access to data. The basic principles are:
Now, if this afternoon you tell a data professional that we can all run analytics on unprocessed original data (aka “third normal form data”) they will say you are out of your mind. This is because traditional data pipelines, with ETL and data warehouses, can’t do this.
But with the right foundation — including a hybrid architecture leveraging data lakes, columnar storage, an in-memory analytics engine, and some advanced optimization techniques — we can shift the paradigm and gain the best of all worlds.
Incorta is an example of just such a system.
With Incorta, you bring in transactional data — direct and from multiple sources — for analysis without having to reshape or pre-aggregate it. Analytical semantics and structure are applied as the data is queried. This brings a host of benefits:
While data governance was originally designed for transactional systems to yield quality data for analysis, the traditional way of building data pipelines limited the application of good data governance practices to analytics. This prevented organizations from doing deep analysis, or any analysis at all, on some very valuable data sets.
We now have technology that supports best practices such keeping the original data untouched, and applying data semantics and security upon read. Good data governance practices are preserved, and we can deliver more high quality, high confidence data for analysis and decision making.
Ready to find out what this new paradigm can do for your organization? Spin up a free trial today and try it for yourself.