Parsing SQL queries provides superpowers for monitoring data health. This post describes how to get started on parsing
SQL for data observability.
Query history of a data warehouse is a rich source of information to glean how data is used in your organization.
Many aspects of data observability can be tracked by analyzing query history. For example, query history analysis can
extract:
- Popular tables and columns
- Unused tables and columns
- Column-level lineage
- Freshness
These statistics also help to automate common data engineering tasks like:
- Backup and Disaster Recovery
- Triage Data Quality issues
- Track sensitive data and how they are used.
Challenges and Approaches
SQL language is an ISO/IEC standard and the latest version is SQL2016.
However, every database implements the standard
differently, uses different function names for the same operation, and has extensions to access specific custom features.
Therefore, there isn’t one SQL parser for dialects of all popular databases and data warehouses.
Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly
become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes.
sqlparse is a popular python package that uses regular expressions to parse
SQL.
An alternate approach is to implement the SQL grammar using parser generators like ANTLR. There
are similar open source parser generators in other popular languages.
There are multiple projects that maintain parsers for popular open source databases like MySQL and Postgres. For other
open source databases, the grammar can be extracted from the open-source project. For commercial databases, the only
option is to reverse engineer the complete grammar. There are SQL parser/optimizer platforms like Apache Calcite
that help to reduce the effort to implement the SQL dialect of your choice.
Open Source Parsers
Some popular open source databases and data warehouses are:
MySQL/MariaDB
- Pingcap parser is a MySQL parser in Go.
- SQL Parser in phpmyadmin is a validating SQL lexer and parser with a focus on MySQL dialect.
Postgres
libpg_query extracts the parser (written in C) from the postgres project and
packages it as a stand-alone library. This library is wrapped in other languages by other projects like:
- Python: pglast
- Ruby : pg_query
- Golang: pg_query_go
- JS: psql-parser in Node and pg-query-emscripten in the browser
- Rust: pg_query.rs
Multiple Engines
- queryparser implements Apache Hive, Presto/Trino and Vertica dialects.
- zetasql implements BigQuery, Spanner, and Dataflow dialects.
Generic Parsers
- Python: sqlparse
- Rust: sqlparser-rs
- Python: mo-sql-parseing
Platforms
Parser/Optimizer platforms implement the common SQL language features and allow customization as first-class feature
of the platform. Two popular open source projects are:
- Apache Calcite is a popular parser/optimizer that is used in popular databases and query engines like Apache Hive, BlazingSQL and many others.
- JSQLParser can parse multiple SQL dialects like MySQL, Postgres and Oracle. The grammar can be modified to support other SQL dialects.
Apache Calcite allows customizations at various points of the parsing process.
- Parser rules can be changed to support custom syntax.
- Conventions such as quotes vs double quotes, case sensitivity.
- Add optimizer rules.
Apache Calcite also provides visitors for traversing the SQL execution plan. Visitor pattern is an algorithm to traverse
a SQL plan.
Practical tips to Getting Started
There are many abandoned open source SQL parsers. The first filter is to use a project that will be supported in the
future. For popular databases such as Postgres and MySQL/MariaDB, there are parsers available in multiple programming
languages.
What if there is no parser for your database?
Most teams do not create a parser from scratch. A popular option is to use the Postgres parser and then add custom
SQL syntax. AWS Redshift, Vertica and DuckDB are examples. Use a Postgres SQL parser to parse query history of these
databases to parse the majority of the queries.
Many queries will fail to parse such as UNLOAD in AWS Redshift. If it is important to also parse the variants, consider
modifying the projects to accept the custom grammar OR use a platform like Apache Calcite.
Conclusion
There is a demand for SQL parsers to build reports on database or data warehouse usage. There are a number of good
open-source projects. However, there is a steep learning curve to use these projects and in many cases a project may not
fit your specific requirements.
Struggling with parsing query history? Get in touch
Top comments (0)