The rise of Cloud Data Warehouses in the marketplace is dominated by names like Snowflake, Amazon Redshift, Google Big Query, Azure Synapse, IBM, Oracle, and Teradata, to name a few. Massive amounts of data from different sources and types of data supporting business intelligence and analytics for organizations get ingested into these systems. One important problem plaguing these systems is the lack of an affordable Master Data Management solution that will scale with growing volumes and is agile enough to respond to real-time data requests.
Many data warehouses adopt a dimensional model that includes various dimensions such as Customer, Product, Account, Asset, etc. Typically, these dimensions are linked to the transactional data via a fact table. However, if there is no Master Data Management (MDM) system like OpenDQ, processes will create duplicate and erroneous data in the dimension tables. Reporting and analytics that run on this data will have errors in them.
Matching the core of an MDM system is the critical process that performs the work of identity resolution, identifying the same customer across sources and within sources. Most of the time, matching is based on fuzzy algorithms since a common identifier is not present in the records, and there could be variations in the data.
OpenDQ, the scalable zero license cost MDM solution, frees you from licensing constraints whether your matching process runs on a single or 10000 nodes. This is an important distinction from competing solutions that charge licensing fees based on the volume of records, nodes, CPUs, and Cores and are restrictive to customers.
Two typical use cases for building and maintaining a Dimension table for cloud data warehouses.
Initial full load:
In this step, Customer, Product, Location, and Asset data is extracted from transactional data and then sent through standardization steps into the Matching engine to identify matches. A unique ID is assigned to the cluster of matched records.
Matched records are merged based on survivorship rules.
A Universal ID or a Global ID is assigned to the Master/Surviving Records.
Mastered curated data is published to the Cloud Data warehouse.
Incremental data loads:
This process step must compare the existing master data set against the incoming incremental data and match these sets together.
Incremental data loads can range from a few thousand records to millions of records.
OpenDQ can use a partition-index-based approach or a cluster-based matching process to identify matches between data records. You can choose the option based on the volume of the data. This approach gives complete flexibility in meeting SLAs for the delivery of data.
One big difference between incremental and full loads is that when a record matches an existing master record, data from the incremental record is generally attached to the master record.
There can be rules on how this data is attached to the master record; some incremental data can overwrite existing data or simply be appended to master data.
In summary, OpenDQ provides a high-performance matching framework. This solution can help your organization meet demanding Service Level Agreements (SLAs) despite growing data volumes and shorter time frames to deliver data in cloud-based data warehouses.
OpenDQ offers Data Quality, Data Governance, and Master Data Management solutions with Zero licensing cost to our customers. Contact us today for a demo of our product!
All product names, logos, and brands are the property of their respective owners.