Why Data Quality Dimensions Are the Secret Ingredient for Data-Driven Success

#data #dataquality #database #dataengineering

In today’s fast-paced world of data collection, we are generating and accumulating more data than ever before. Much of this data remains unorganized and poorly defined, preventing business professionals from extracting valuable insights from data applications. Let’s explore the need for data quality dimensions in this article.

Identification of the Dimensions

Organizations should start their data quality journey by identifying the need and understanding the problems they are facing as a whole. Deciding on the data quality dimensions is the most important part of the roadmap. It sets the tone for the insights and how we understand our data from the statistics provided.

There are generic data dimensions that can fit into any of the data quality (DQ) solutions, which we will discuss below. Not all dimensions need to be applied to all datasets. For example, there is a dimension called Timeliness, but it cannot be applied to data entities that do not pull data from the sources frequently. Domain experts should help with the identification of the dimensions respective to sources, considering load frequency, type of data, data requirements, etc.

Data Quality Dimensions

Some of the commonly used dimensions include:

Completeness measures the fully contained data in the given data entities. It states how much data can be used. If a column contains many null values, it indicates a lack of data. In the worst cases, the data procurement and ETL process must be analyzed to find the root cause.
Timeliness measures how frequently the data is loaded. It can involve checking recently updated timestamps or logs to verify the last update time.
Uniqueness ensures that a column has no duplicate values to maintain the integrity of the table. It is often used with primary keys and other unique keys existing in the provided table.
Conformity helps to verify whether the data imported is valid or not. Sometimes, data may contain unsupported or garbage symbols that could cause significant issues. This dimension ensures the data matches the required format.
Accuracy assesses how well the data aligns with the provided requirements. For instance, if a table contains an address and a country code, and the country code does not align with the address, it is considered inaccurate. This check ensures the data is true and usable.
Connectedness verifies the referential integrity of the table. It checks whether a foreign key exists in the reference table. It ensures the integrity of the relationships between the tables as a whole.

Other dimensions include consistency, validity, reliability, usability, availability, etc. The dimensions should be selected based on business requirements.

How dimensions affect the Data Quality Score

There are instances where many of these dimensions might not apply well. For such datasets, specific dimensions must be identified and applied. Otherwise, it might result in a very low DQ score on average. For example, a field collecting symbols based on user input cannot be evaluated under typical conformity rules, as it would result in a very low DQ score.

By carefully selecting and applying the appropriate data quality dimensions, organizations can ensure their data is reliable, actionable, and ready to drive impactful decisions.

Overlapping of Dimensions

While calculating the conformity score for a field, the results might be skewed if null values are included in the denominator of the data quality (DQ) score. To address this, overlapping dimensions can provide a more accurate DQ score.

For instance, consider a field labeled Optional that contains many null values because it is not mandatory. When calculating the conformity score for this field, null values should be excluded to avoid artificially lowering the score. Instead, we should calculate conformity based only on non-null values.

Here’s an example table:In this table, the completeness DQ score would be 3/5 (three non-null values out of five total entries). The conformity score, calculated without overlapping, would be 2/5 (two valid entries out of five total entries). However, if we overlap dimensions by excluding null values, the conformity score becomes 2/3 (two valid entries out of three non-null entries), providing a more accurate representation of the data’s quality.

Conclusion

Data quality dimension identification and the order of execution result in different data quality scores. Identification of the DQ dimensions is equivalent to crossing half of the ocean. Once the dimensions are identified, organizations can surf through the metadata and gain valuable insights.

In this article, we explored data quality in terms of dimensions and highlighted the importance of tailoring DQ checks to specific datasets. By understanding and applying these dimensions effectively, organizations can ensure their data is reliable, actionable, and ready to drive impactful decisions. Start by assessing your data quality strategy today and take the first step toward unlocking the true potential of your data.

Check out my blog at here

Forem

Why Data Quality Dimensions Are the Secret Ingredient for Data-Driven Success

Top comments (0)

Read next

Enhancing SwiftUI Alerts : A Reusable Approach with Unit and UI Testing 🔄

Power Platform - The Managed Delivery Process

Adapter l’intelligence artificielle au créole

Fully Automated Gradient Calculation on Expression Graph (As Explained By Karpathy)