It's February, but every day is Halloween in the data warehouse. Do you have ghosts in your dbt project? If so, they could be costing you days of lost productivity and thousands of dollars in data warehouse compute. However, the problem is also very easy to fix!
The ghost model scenario
For the rest of the post, suppose we have the following scenario:
- Production job 1 runs
dbt build --select tag:tag1
on a regular schedule - Production job 2 runs
dbt build --select tag:tag2
on a regular schedule - There is a model
ghost
in the production Git branch that's not taggedtag1
ortag2
- so it never runs on either schedule and never gets materialized at all.
Spooky manifests
The file manifest.json
is regenerated every time you run a dbt command (with some exceptions, see the documentation). The full manifest is generated for all models, even if you restrict the build to select only certain nodes.
Even if you're only running some models or tests, all resources will appear in the manifest (unless they are disabled) with most of their properties.
On one hand, that's a good thing, because we can now grab the manifest from either production job without worrying that we're missing resources.
On the other hand, the model ghost
will appear in the manifests of both production jobs along with its supposed location in the database, even though neither job materializes it. The production manifest is now "spooky".
How deferral makes your CI jobs faster and less expensive
dbt Cloud makes it easy to set up CI jobs that allow your team to automatically test proposed changes, but those jobs can easily become very slow and very expensive if not managed properly.
The best way to manage those costs while still getting the full benefits of CI is by combining state selection and deferral into something called slim CI.
The documentation for deferral has this nice diagram. The diagram shows that if you just modified model_c
, you don't need to rebuild its unmodified ancestors model_a
and model_b
, and instead just point to the production version:
In other words, if the raw SQL of model_c
is select * from {{ ref("model_b") }}
, then without deferral it would get compiled to create view dev.model_c as select * from dev.model_b
, and you'd also need to create dev_model_b
and the rest of its ancestors. However, with deferral it would instead get compiled to create view dev.model_c as select * from prod.model_b
, allowing us to reuse production models and save a ton of money and time.
How ghost models break deferral and slim CI
Suppose you've created a model ghost_child
that depends on the model ghost
. What will happen if you try to run a pull request containing that change through the slim CI process?
We've established that all production manifests will contain information about the model ghost
as if it really exists, even though neither job created it. That means that the slim CI process will helpfully try to compile {{ ref("ghost") }}
as prod.ghost
. The object prod.ghost
doesn't exist, so your CI process will crash and burn.
While ghost models don't break state comparison, they completely break deferral. Even with ghost models, it's still completely possible to determine which models have changed. However, it's not possible to establish that an unchanged model will actually exist in production.
One workaround is to disable deferral completely. If you're interested in only building the modified models, you'd have to run dbt build --select +state:modified
to build your modified models and all their ancestors in DEV. If you're interested in also testing downstream changes, you'll have to use the "at" operator to build all ancestors of all descendants of the selected model: dbt build --select @state:modified
. What could be just running a handful of models with deferral could become hundreds of models without deferral.
Easy fix: Disable your ghost models
The fix to the problem is hinted at in the manifest documentation: Disabled models don't show up in the manifest. Track down your ghost models and disable them. It won't affect production since they don't get materialized anyway. Then they'll stop showing up in the manifest and slim CI will work again! When you need the ghost model in the future, you can enable it, and then it will run as part of your CI pipeline. Just remember to make it run on a schedule before merging to production!
Conclusion
Models that exist in your dbt project's primary branch but don't get materialized in the data warehouse are ghost models. Your production job manifests will include them as if they really exist, but slim CI pipelines will crash and burn when they try to defer to them. This will force you to move away from deferral, which could cause an exponential increase of your CI runtime and costs. To avoid this issue, be sure to disable all your ghost models.
Top comments (0)