On January 14, 2025, dbt Labs announced the acquisition of SDF Labs.
For those unfamiliar, SDF is a multi-dialect SQL compiler, transformation framework, and analytical database engine packaged into one CLI. In the webinar Accelerating dbt with SDF, it was even called "the future of the dbt engine".
On the other hand, in Japan, although I've seen some mentions on the acquisition on X, not many people seem to deep-dive the upcoming new tech.
So in this post, I want to shine a spotlight on SDF’s potential to enhance the dbt development experience, especially in tackling some of the challenges that dbt-osmosis faces in metadata management.
If you want to get hands-on, all the code from this article is available at myshmeh/osmosis_vs_sdf. Feel free to check it out and experiment in your own environment! 🚀
Table of Contents
- Before anything..
- dbt-osmosis: Antidote to Copy-Paste Hell?
- Crash Landing on YAML✈️💥: The Column Rename Dilemma
- Inside the Black Box — The Limits of dbt-osmosis
- SDF to the Rescue! 🚀
- Understanding SQL – More Than Just Strings
- Conclusion – The Future of dbt x SDF
1. Before anything..
First off, huge respect to dbt-osmosis—it’s an outstanding open-source project that has significantly boosted the efficiency of dbt development. At this point, it’s practically the go-to tool for managing model properties, and for good reason.
That said, the issues I’ll be discussing in this post aren’t really about dbt-osmosis itself but stem from dbt’s underlying architecture. My goal here isn’t to critique dbt-osmosis—far from it.
Instead, I’m using a comparison approach to highlight the potential of SDF as a way to navigate some of these challenges. This isn’t about throwing shade at dbt-osmosis or its community; it’s about exploring ways to improve the overall development experience.
2. dbt-osmosis: Antidote to Copy-Paste Hell?
When it comes to managing dbt model property column descriptions, dbt-osmosis has been a game-changer—arguably the de facto standard for data engineers looking to escape the tedious cycle of copy-pasting.
I still remember the first time I used it. Watching dbt-osmosis yaml refactor
seamlessly propagate column descriptions and metadata felt like a revelation. “This is exactly the kind of automation I’ve been waiting for!”
That said, while using dbt-osmosis, I occasionally ran into situations where renaming columns broke the propagation process. It was frustrating at times, making me wonder: “Under what conditions does propagation work, and when does it fail?”
So, I decided to dig in and break it down. Here’s what I found.
3. Crash Landing on YAML✈️💥: The Column Rename Dilemma
Issue 1: Renaming to Completely Different Name
One of the simplest yet most problematic scenarios is when a column is renamed to something entirely different. In this case, dbt-osmosis fails to propagate the corresponding column description from the parent model.
Let’s take a look at an example. Suppose we have a parent model (parent.sql
) and a child model (child1.sql
) defined in our dbt project like this:
Now, when we run dbt-osmosis yaml refactor
, only the description for columnA
gets propagated from the parent
model to child1
. The renamed column? It’s left behind.
Issue 2: Wait.. Does Some Renaming Actually Work?
So, does renaming always break metadata propagation? Surprisingly, not necessarily.
To explore this, I set up a new child model (child2.sql
), where columns are renamed by simply converting between camelCase and snake_case:
Then, I ran dbt-osmosis yaml refactor
, and here’s what happened:
Let's gooo — it worked for columnA → column_a
! 🎉
But hold on — why didn’t columnB → column_b
work? 🤔
Issue 3: What the Hell is Going On?!
Okay, let’s take things up a notch—what if we swap column names entirely? Sure, this isn’t something you’d normally do in a real-world scenario, but let’s see what happens.
Here’s our modified child model (child3.sql
):
And now, for the moment of truth…
Wait, what?! The description for parent.columnB
somehow got assigned to child3.column_b
, and parent.columnA
’s description landed on child3.columnA
! 🤯
At this point, we’ve tested three different rename scenarios, and dbt-osmosis consistently fails to propagate descriptions correctly whenever columns are renamed.
To be fair, when column names stay the same, dbt-osmosis works beautifully, seamlessly passing descriptions downstream. But as soon as renaming gets involved, things quickly fall apart.
So, what’s actually happening inside dbt-osmosis that causes this behavior? Let’s dig deeper. 🔍
4. Inside the Black Box — The Limits of dbt-osmosis
A String Comparison Approach
At its core, dbt-osmosis propagates column descriptions by performing string comparisons between parent and child models.
In the simplest case, if the column names match exactly, metadata is passed down correctly—this is what we saw with the child1
model.
However, in real-world scenarios, we often rename columns for readability, like:
- Converting uppercase to lowercase (or vice versa)
- Switching between camelCase and snake_case
To handle these cases, dbt-osmosis offers fuzzy matching methods for column name mapping. According to the official documentation, the tool supports:
-
FuzzyCaseMatching
— Matches column names by converting them to uppercase, lowercase,camelCase
, orPascalCase
. -
FuzzyPrefixMatching
— Allows child columns to include a prefix before matching with the parent column (for the sake of simplicity, we don't cover this here).
This should explain what happened with our child2
model. It propagated successfully from parent.columnA
to child2.column_a
. But it didn't from parent.column_b
to child2.columnB
..
At first, fuzzy matching sounds like a great solution, but as we’ve seen, it has unexpected hiccups—leading to unpredictable metadata propagation.
The Hidden Pitfall of FuzzyCaseMatching
Digging into the dbt-osmosis implementation, we can see how column descriptions are propagated from parent to child models.
Specifically, in osmosis.py:_build_column_knowledge_graph
, dbt-osmosis attempts to match child model columns against parent model columns by generating different variations of the child column names.
In other words, as shown in the diagram below, FuzzyCaseMatching
only generates alternative name variations for child columns, not for parent columns:
This means that FuzzyCaseMatching
can handle cases where a camelCase column in the parent model is renamed to snake_case in the child model, but not the other way around.
As a result, in our child2
model test case, no matter which case transformation is applied (uppercase, lowercase, camelCase, PascalCase), columnB
never converts to column_b
, which explains the inconsistency we observed.
Unknown Column Relationship
So far, we’ve seen how dbt-osmosis relies on string-based comparisons to determine relationships between parent and child model columns. But is this really the right approach?
A perfect example of why this method falls short is our child3
model. While this setup isn’t something you’d normally see in production, it highlights a fundamental limitation: we swapped the names of columnA
and column_b
, but their semantics should be clear:
-
parent.column_b
→ should map tochild3.columnA
-
parent.columnA
→ should map tochild3.column_b
Yet, as expected, dbt-osmosis mistakenly assumes that child3.columnA
is derived from parent.columnA
—all because it relies purely on string matching. It completely ignores the fact that columnA
in child3
was actually renamed from parent.column_b
.
This example makes one thing clear: a simple string comparison isn’t enough to reliably track column relationships.
Sure, one possible improvement could be expanding FuzzyCaseMatching
to generate alternative names for parent columns as well, but that could open up performance issues and new edge cases.
More importantly, the problem with child3
goes beyond case matching—it reveals a fundamental limitation in the current architecture of dbt-osmosis. Fixing this within the existing design would be extremely difficult, if not impossible.
Why String Matching?
So, why did dbt-osmosis take the string-matching route in the first place? The reason is likely because dbt Core treats SQL model code as plain text and doesn’t actually "understand" it.
dbt-osmosis relies on dbt Core’s parser to gather information about model dependencies. However, this parser doesn’t interpret SQL—it simply reads it as a string. Because of this limitation, dbt-osmosis has no way of accessing deeper column-level lineage information—it can only work at the model dependency level. In fact, this exact issue was highlighted in dbt’s announcement about acquiring SDF Labs as one of the fundamental problems with the current architecture.
Of course, an alternative approach would be for dbt-osmosis to implement its own SQL parser. But that’s a massive workload — SQL has many dialects, and building a parser that handles them all would require significant effort.
From a practical standpoint, dbt-osmosis’s existing FuzzyCaseMatching
and FuzzyPrefixMatching
methods already cover most renaming scenarios. Implementing a custom parser just to handle the edge cases we explored in this post probably isn’t worth the trade-off.
Given these constraints, it makes sense that dbt-osmosis chose string matching as a pragmatic solution within the limits of dbt’s current architecture.
With all these challenges around string matching, it will be clear why SDF’s approach is a game-changer in the next section!
5. SDF to the Rescue! 🚀
Unlike dbt, SDF has a built-in SQL compiler, meaning it doesn’t just treat SQL as plain text—it actually understands the structure of the queries.
This gives SDF a huge advantage: it can precisely track how parent model columns are referenced in child models, allowing it to tackle the column rename problem head-on.
So, let’s put this to the test! In this section, we’ll replicate our earlier dbt project inside an SDF workspace and compile our three test models using the SDF CLI. Let’s see how it handles things differently. 🔍
Setting Up the Workspace
📌 As of writing, the official SDF documentation states that "The engine used in these docs and guides is no longer available for installation." However, SDF is still available for installation here.
First, let’s create a minimal SDF workspace using:
sdf new --sample hello
Since we want to replicate the same models from our dbt setup, we’ll adjust the file structure like this:
The model definitions are nearly identical to their dbt counterparts. However, in SDF, you don’t need to use the ref
function—you can directly reference the parent model.
Additionally, we define column descriptions for the parent model in parent.sdf.yml
:
Looking at the file tree, you’ll notice that column descriptions exist only in the parent model. Now, let’s compile each model and see how well these descriptions propagate across the workspace.
See the Power of SDF!
Let's execute this magical command:
sdf compile --show all
The compiled results are in, showing detailed metadata for each model.
And now, let’s check the description
fields...
Every single child model has correctly inherited its column descriptions from the parent model—exactly as expected!
- The snake_case → camelCase transformation? Works perfectly.
- The column name swap that confused dbt-osmosis? No problem here.
SDF nailed every edge case where dbt-osmosis struggled.
This confirms, because SDF actually understands SQL structure during compilation, it can track column-level dependencies with precision—something dbt-osmosis simply can’t do with string matching alone. 🚀
6. Understanding SQL – More Than Just Strings
So, when we say SDF "understands" SQL, what does that really mean?
Unlike dbt, which treats SQL as raw text, SDF's compiler breaks SQL down into meaningful units, or tokens. These tokens are then structured into a data representation that captures relationships between different parts of the query. Ultimately, this structured data is what allows SDF to transform SQL into an executable form.
Because of this process, the resulting data structure is far richer than a simple string—it retains valuable metadata, such as:
- Which tables are being referenced
- Which columns are used and what data types they have
This deeper level of context enables the compiler to fully grasp the meaning of SQL statements, allowing for intelligent transformations. That’s what we mean by "understanding" in this article.
We won’t go into the full details of SQL compilation here, but dbt Labs has already published some excellent deep dives (Part 1 & Part 2) on the topic, which are worth checking out.
The key takeaway for this post is that during the compilation process, SQL is transformed from plain text into an Abstract Syntax Tree (AST)—a structured representation that captures the grammatical structure of the SQL query.
According to dbt Labs’ acquisition announcement, SDF uses ANTLR, a Java-based parser generator, to build these ASTs. ASTs are commonly used for static analysis—that is, analyzing SQL syntax and meaning without actually executing it.
The diagram below (adapted from the acquisition article) illustrates how SDF leverages ANTLR for this purpose.
Now, let’s take a closer look—what does the AST actually look like in action? Using the SQL models from this article, we’ll peek under the hood and explore the structure of an SDF-generated AST.
AST in Action
⚠️ The following discussion is not an exact breakdown of SDF’s internal implementation. Instead, it’s a general exploration of what ASTs can do. For clarity, I’ve also simplified some compiler front-end behavior, so the actual SDF implementation may differ.
Since SDF uses ANTLR, we can visualize the kind of AST it might generate.
Fortunately, ANTLR provides an open-source repository (antlr/grammars-v4
) containing grammar definition files for various languages—including Snowflake SQL. Using this, we can simulate how an AST might be constructed from our child1
model’s SQL.
To do this, we use the ANTLR v4 Plugin for JetBrains, which allows us to parse SQL and generate an AST. The result looks like this:
⚠️ It’s highly likely that SDF maintains its own proprietary Snowflake SQL grammar files, meaning its actual AST structure might differ. The AST shown here is simply a conceptual example to help illustrate the mental model of the SDF internals.
At first glance, this looks overwhelming—but if you step back, you can see that an AST is structured as a tree-like data hierarchy. This visualization helps us understand how SQL queries get decomposed and processed in a way that raw text matching simply can’t achieve.
Let’s walk through the AST structure.
At the root level, we have a node labeled select_statement
. This branches into:
-
select_clause
→ Corresponding toSELECT columnA, column_b
-
select_optional_clause
→ Corresponding toFROM parent
This tree-based structure isn’t just raw text—it represents SQL in a hierarchical format, where each keyword and clause is treated as a distinct node.
By traversing this tree-like AST, we can analyze SQL queries in a way that simple string matching never could. Instead of searching for raw text patterns, we can query specific nodes that hold meaningful SQL components.
For example, let’s say we want to determine which parent model child1
is based on.
- Instead of relying on a manually defined
ref()
function (as dbt does), we can simply follow the AST structure to find thatFROM parent
exists. - This means SDF can "understand" that
child1
is derived fromparent
—without needing explicit references.
Now that we see how SDF can determine parent-child model relationships, let’s explore how it might use the AST to track column dependencies across models.
By understanding this, we can get a clearer picture of how SDF successfully propagates column descriptions—something dbt-osmosis struggles with.
An Example of Deriving Column Dependencies with AST
Let’s look at a simple approach to extracting column dependencies between parent and child models. As shown in the diagram below, this method involves traversing the AST three times to gather the necessary information.
Using child1
as an example:
- First, we search for the table source and identify
parent
as the parent model. - Next, we follow the
select_clause
to find the selected column, which in this case iscolumnA
. - Finally, we check for aliasing and determine that the renamed column is
column_name_alias
.
Through this process, we can confirm that column_name_alias
in child1
originates from parent.columnA
.
By applying the same method to other models, we should be able to map column relationships!
Additionally, by leveraging the SDF workspace context, it should also be possible to track how column information flows downstream across models, as illustrated in the diagram below.
With this AST-based approach, column descriptions can be propagated with precision. Unlike dbt-osmosis, which has limited support for renaming, this method ensures that column descriptions always align correctly, regardless of naming changes.
So far, we’ve explored one possible approach to analyzing column dependencies using ASTs. Hopefully, this gives you a clearer understanding of why SDF is capable of accurately propagating column descriptions!
7. Conclusion – The Future of dbt x SDF
In this article, we explored the column renaming challenges faced by dbt-osmosis, the limitations of string-based matching, and how SDF overcomes these issues by leveraging compiler-level SQL understanding. While LLMs can automate many tasks, when speed and accuracy are critical, deterministic approaches like SDF offer a great solution.
During dbt Labs’ webinar, they mentioned that SDF’s most "meaningful" features will be integrated into both dbt Core and dbt Cloud. This means that, in the near future, all dbt developers will likely benefit from SDF’s powerful column description propagation.
Beyond just metadata propagation, SDF already offers some game-changing column-related features, including:
For dbt users struggling with metadata management, these capabilities could be a major breakthrough.
Needless to say, there’s much more to SDF than what we covered here, but we’ll leave that for another time.
One thing I'm sure of–the future of dbt x SDF is bright! 🚀
Top comments (0)