Data mapping is the process of extracting data fields from one or more sources or databases, identifying them, and mapping them to target fields in a destination database. Data mapping is widely used in analytics, business intelligence and data science applications to gather and organize data so that it can be easily queried.
Approaches to data mapping include manual mapping, where data from a source is mapped using code or SQL. While offering the most amount of flexibility, this process is time-consuming and requires changes to the code or SQL scripts. Semi-automated mapping uses a combination of coding and a visual user interface to create custom schemas while fully automated mapping does not require any coding, can easily be changed and scaled as needed.
To obtain actionable insights from data, analysts typically need to gather and aggregate information from multiple sources. For example, having a customer’s contact and demographic information can be helpful — but suppose you can combine that information with details such as prior purchases, browsing activity and chat sessions with a customer service representative. You can understand the customer much better with access to more complete information. This puts you in a better position to mount effective marketing campaigns that will capture their interest, strengthen the relationship, and increase future sales.
Define your data
The first challenge that data engineers often face when gathering and mapping data is making sense of it. Source field data formats and encodings can vary widely. For example, the same date might be represented as “2022-01-23”, “Jan 23, 2022” or in timestamp format as “1642896000”. Also, data can be represented using different character encodings, be inconsistent, imprecise or have missing or duplicated values. When mapping data from a source to a destination, data engineers frequently need to account for these differences and transform data to a standard representation.
Differences between measures and dimensions
Another important consideration in understanding data is distinguishing between measures and dimensions. A measure refers to a field that can be aggregated in some way. For example, a numeric database field called “order_amount” in an orders table describes a measure because it can be aggregated in some way — summed, averaged or sorted. On the other hand, dimensions refer to fields that can be used to segment or bucket data in some fashion. An analyst might want to examine various measures and roll them up by different data dimensions such as gender, sales_region or country. Understanding data is essential when mapping it into a central repository such as a data warehouse. With a better understanding of data, data engineers can anticipate how it will be used and structure it accordingly.
Challenges of using ETL pipelines for data mapping
Organizations have traditionally solved this challenge of data mapping using ETL pipelines to extract data from various sources, transform it, and load it into a destination repository. The problem with ETL pipelines is that they typically need to be created and maintained by data engineers. Pipelines need to be updated and revalidated whenever source or destination schemas change. ETL pipelines are typically expensive, complex and can take a long time to execute on large data sets. Also, they run only periodically, so data is often out of date by the time it lands in a data warehouse. ETL pipelines also typically involve the creation of intermediate data sets, further increasing costs and complicating security and governance.
A better approach: Direct Data Mapping™ tools
Advances such as new columnar file formats and in-memory query engines now allow organizations to efficiently gather data from multiple sources and query it without the need for cumbersome ETL pipelines. Incorta, a unified data analytics platform, employs a revolutionary approach called Direct Data Mapping®. Incorta automatically examines source data and stores it in a fashion that delivers exceptional query performance — even when data sources involve billions of rows, hundreds of joins and thousands of tables.
Incorta’s Direct Data Mapping technology essentially eliminates the need for costly and time-consuming ETL processing. Incorta rapidly loads data using high-throughput data connectors and performs schema introspection, automatically recognizing dimensions and measures. Data is stored in highly efficient, industry-standard Parquet format files where it can be analyzed within Incorta or queried using existing BI tools. By using Incorta with Direct Data Mapping, organizations can reduce cost, dramatically reduce data engineering workload, and realize faster query performance for better insights.