Iceberg is gaining traction, and Collibra is expanding its presence as a data governance tool, primarily in Europe. Meanwhile, Snowflake remains a well-known data lakehouse solution. I previously wrote a blog on using AWS Iceberg tables within Snowflake. Snowflake and Iceberg.
Collibra Governance/Catalog
Collibra offers a suite of tools for data governance. At a high level, it provides the following capabilities:
- Importing metadata from databases
- Classifying data (e.g., identifying and classifying as PII, Confidential, Public)
- Creating datasets in the Business Catalog
For more information, visit the Collibra Website.
Data Access Rules & Collibra Protect
Creating and maintaining fine-grained access control is a complex task. With GDPR, employees must have access to the data they need. But how do you handle a dataset with 100 columns when only two are GDPR-sensitive, like name and birthday? Should you block full access, create a new view without those two columns, or mask them?
Collibra Protect offers a solution. It enables data stewards to create rules based on datasets, classification, and groups to define access. Collibra Protect then translates these rules into grants within the underlying data lakehouse. With Snowflake as the target, this allows data stewards to create complex grants with masking and row/column-level access control.
SQL Example from Collibra Protect
Here’s an example that creates a masking policy to SHA2-hash a field when users belong to the group USERS_NON_PII, attaching it to the ACCOUNT table’s EMAIL field.
CREATE OR REPLACE MASKING POLICY "GLUE_CATALOG"."MYDATABASE"."COLLIBRA/MASKING_POLICY/01932628-ebd2-7136-bfc4-19b62f760a88"
AS (val STRING) RETURNS STRING ->
CASE
WHEN CURRENT_ROLE() = 'USERS_NON_PII' THEN sha2(val)
ELSE val
END;
ALTER TABLE IDENTIFIER('GLUE_CATALOG.DATABASE.ACCOUNT')
MODIFY COLUMN "EMAIL"
SET MASKING POLICY "GLUE_CATALOG"."MYDATABASE"."COLLIBRA/MASKING_POLICY/01932628-ebd2-7136-bfc4-19b62f760a88";
So far, so good with standard Snowflake tables.
Collibra and Snowflake Iceberg Tables
Snowflake Setup
Snowflake supports internal Iceberg tables or connecting to Iceberg tables on AWS Glue. See the article for more details.
CREATE OR REPLACE ICEBERG TABLE glue_db.glue_schema.my_iceberg_table
EXTERNAL_CATALOG = glue_catalog
EXTERNAL_DATABASE = 'your_glue_database'
EXTERNAL_TABLE = 'your_glue_table'
EXTERNAL_LOCATION = 's3://your-bucket-name/path/to/iceberg/data/'
STORAGE_INTEGRATION = s3_int;
A standard table:
CREATE TABLE my_table (
id INT AUTOINCREMENT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(255),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Note the difference between CREATE ICEBERG TABLE
and CREATE TABLE
. I’ll return to this later.
Collibra Catalog Import
When configuring Collibra Catalog ingestion, you need to set up a JDBC connection and JDBC ingestion capability. Refer to the manual for details. Although there are many steps, Collibra executes substantial SQL to Snowflake to gather information. The crucial one is:
SELECT
table_name
FROM IDENTIFIER('glue_catalog.information_schema.tables')
WHERE
table_schema = 'MYDATABASE'
AND table_type = 'BASE TABLE';
This retrieves all tables from the information_schema
where the schema is MYDATABASE
and the type is BASE TABLE
. It will then gather all column information.
Collibra Protect and Iceberg Tables
Iceberg tables appear as standard tables in Collibra, allowing data stewards to catalog and classify fields. Once access rules are created in Collibra Protect, an error may occur, detailed as follows:
SQL Compilation error: The table ACCOUNT is an Iceberg table. Iceberg tables should use ALTER ICEBERG TABLE commands. Please ensure the appropriate command format is used.
This results from Collibra using the statement:
ALTER TABLE IDENTIFIER('GLUE_CATALOG.DATABASE.ACCOUNT')
instead of the correct ALTER ICEBERG TABLE IDENTIFIER('GLUE_CATALOG.DATABASE.ACCOUNT')
, required for Iceberg tables in Snowflake.
Deep Dive
I wanted to uncover the root cause of this issue.
Snowflake Table Types
According to Snowflake documentation, there are several TABLE_TYPE values:
- BASE TABLE
- TEMPORARY TABLE
- EXTERNAL TABLE
- EVENT TABLE
- VIEW
- MATERIALIZED VIEW
These correspond with different SQL statements to create/alter these table types, like ALTER MATERIALIZED VIEW
and ALTER EXTERNAL TABLE
. However, no special type exists for Iceberg tables, which appear as BASE TABLE
.
Two additional fields differentiate base table types:
- IS_ICEBERG: Indicates whether the table is an Iceberg table.
- IS_DYNAMIC: Indicates whether the table is a dynamic table.
I verified that my Iceberg tables have TABLE_TYPE='BASE TABLE' and IS_ICEBERG='YES'.
It seems Snowflake is inconsistent here. While most special table/view types have their own TABLE_TYPE, Iceberg is the exception. It’s marked as a BASE TABLE but requires special commands like ALTER ICEBERG TABLE.
Collibra
Collibra's metadata ingestion ignores special flags from the information_schema. The is_iceberg flag is neither queried nor stored with the table information. Collibra Protect retrieves the database/schema/table but is unaware of the special type.
Fix/Future
This is an edge case I encountered during a PoC with both products. It’s a challenging issue, and I hope this blog clarifies it for both parties. By adding more context for tables during ingestion, Collibra Protect could use this information to execute the correct query types. This could also enhance support for different table types, like event or materialized views, in Collibra.
Top comments (0)