Datamonk

Posted on Jan 28

Simplifying Data Governance with DataHub and PostgreSQL Integration

#database #postgres #datahub #tutorial

In the digital age, data is growing at an unprecedented rate. Terabytes of new data are created—almost as effortlessly as you take a breath. This rapid influx of data presents both opportunities and challenges for businesses, organizations, and data professionals alike. The need to manage, understand, and make sense of this data has never been more crucial.

As data continues to grow, tools like Datahub for managing and governing become essential. DataHub stands out as one such tool—empowering organizations to centralize, organize, and make their work easy and data more accessible and useful. Whether you’re working with small datasets or large enterprise databases, DataHub offers the solutions needed to ensure your data is not only discoverable but also governed and well-documented.

Managing a database is just a small part of the larger picture, much like how DataHub plays a key role in the vast ocean of data management. In this blog, we’ll take a closer look at how you can ingest a PostgreSQL database into DataHub, explore its key features, and see how it can help you manage and visualize your data with ease.

Prerequisite

(Note:- It is an Ubuntu process)

python 3.8+
postgres
Run Datahub locally, refer this

Overview

When ingesting a PostgreSQL database into DataHub, the process begins by connecting DataHub to the database using a configured source connector(precisely will create a .yml file which will act like a bridge between the datahub and the postgres database). DataHub then extracts metadata such as database structure, schemas, tables, columns, and relationships. Key features like table-level lineage, data profiling, and classification are enabled during this process to enhance data discovery and governance.

The ingestion process also captures additional metadata such as column types, foreign key relationships, constraints. Once ingested, this metadata is indexed and made available for searching, browsing, and visualization within the DataHub platform.

The overall goal is to centralize the metadata, improve data discoverability, and facilitate collaboration across teams by providing a comprehensive view of the database structure and data flow.

After running the command

datahub docker quickstart

you will be able to access the Datahub UI on http://localhost:9002

Source Connector

To ingest metadata from PostgreSQL into DataHub, source connector is used, which is defined in a configuration file—typically a .yml file. For instance, create a file called postgresql_ingestion.yml that contains all the necessary connection details and parameters required for the ingestion process.

The configuration file acts as a bridge between your PostgreSQL database and DataHub. It includes essential details like the host, database name, username, and the DataHub server link to ensure that metadata is successfully extracted and ingested into the platform.
like this-

After setting up the source connector in the .yml configuration file, the next step is to ingest your database into DataHub. This process involves running the ingestion command, which uses the configuration file to extract metadata from your database and push it to your DataHub instance.

datahub ingest -c postgres2_ingestion.yml

With the dataset now ingested into DataHub, all its information is seamlessly organized and made available for efficient management and governance through a suite of advanced features.

Features

Let's understand these features using the weather data project example I discussed in the Data Orchestration blog post. This project involves a schema containing hourly weather data, daily weather summaries, and global averages, now seamlessly integrated into DataHub for enhanced data management and governance.

Containers

These can be used to represent Weather data into logical groupings like databases, schemas, and tables. For example, hourly weather data can be in one schema, while daily and global data occupy others, enabling clear categorization for streamlined navigation.

Classification

By tagging sensitive weather data or business-critical metrics (e.g., identifying PII(Personally Identifiable Information) like location, date, address or compliance-relevant data like personal or sensitive data), we can classify the data effectively to enforce security, regulatory compliance, and usage guidelines.

Data Profiling

Through SQL profiling, DataHub can generate comprehensive statistics for tables, such as average temperatures, the frequency of anomalies, and missing data percentages etc.

Description

Each table, column, or dataset can include metadata descriptions (e.g., "Hourly Weather Data includes temperature, humidity, and wind speed per hour"). This ensures clear context, enabling users and systems to understand the data easily without repeated explanations.

Detect Deleted Entities

If data is removed (e.g., outdated weather stations or deprecated data tables) from the PostgreSQL database, DataHub detects and reflects these changes, reducing clutter and ensuring that outdated references no longer impact the system.

Domains

Domains allow grouping datasets by their purpose, like "Forecasting," "Historical Weather," and "Climate Analytics." This structure simplifies data governance by enabling domain-specific visibility and control.

Platform Instance

Platform-specific metadata allows you to tag and distinguish between different instances of the PostgreSQL database, such as production and staging environments. This makes it easier to track where the data is coming from and avoids any confusion between test data and live production data.

Schema Metadata

The weather project’s database schema, with details on tables, columns, and relationships, is automatically extracted and indexed, enabling quick discovery of how data is organized. For instance, knowing the relationship between hourly data and daily summaries aids in analysis automation.

Table-Level Lineage

Tracking lineage maps the entire journey of the data, showing how the hourly weather dataset is aggregated into daily summaries and further processed to generate global averages. This transparency ensures trust in the results and makes it easier to identify and resolve any issues.

So...

After exploring all the features, it's clear that integrating DataHub with your PostgreSQL database can really take data management to the next level. For example, lineage tracking allows you to easily trace how your weather data flows, from hourly temperature readings to daily summaries and even global averages. This transparency not only builds trust in the data but also makes it much easier to fix any issues if something goes wrong.

Another powerful feature is data profiling, which helps you monitor the quality of your data. Whether it's checking for missing values or spotting unusual patterns, it ensures the data you're working with is reliable—something that's especially important when you're forecasting weather trends or working with large datasets.

By using these features, you're not just managing data—you're ensuring that it’s secure, accurate, and trustworthy. DataHub is like an extra layer of safety and efficiency for your weather project, streamlining workflows while giving you confidence that the data you’re using is always up to par.

Adopting DataHub isn't just about tools, it's about improving how we handle data so that we can make better decisions and keep things running smoothly.

DEV Community