Aram Panasenco

Posted on Feb 21

Ghost models and spooky manifests in dbt

#dbt #dataengineering #analytics

It's February, but every day is Halloween in the data warehouse. Do you have ghosts in your dbt project? If so, they could be costing you days of lost productivity and thousands of dollars in data warehouse compute. However, the problem is also very easy to fix!

The ghost model scenario

For the rest of the post, suppose we have the following scenario:

Production job 1 runs dbt build --select tag:tag1 on a regular schedule
Production job 2 runs dbt build --select tag:tag2 on a regular schedule
There is a model ghost in the production Git branch that's not tagged tag1 or tag2 - so it never runs on either schedule and never gets materialized at all.

Spooky manifests

The file manifest.json is regenerated every time you run a dbt command (with some exceptions, see the documentation). The full manifest is generated for all models, even if you restrict the build to select only certain nodes.

Even if you're only running some models or tests, all resources will appear in the manifest (unless they are disabled) with most of their properties.

On one hand, that's a good thing, because we can now grab the manifest from either production job without worrying that we're missing resources.

On the other hand, the model ghost will appear in the manifests of both production jobs along with its supposed location in the database, even though neither job materializes it. The production manifest is now "spooky".

How deferral makes your CI jobs faster and less expensive

dbt Cloud makes it easy to set up CI jobs that allow your team to automatically test proposed changes, but those jobs can easily become very slow and very expensive if not managed properly.

The best way to manage those costs while still getting the full benefits of CI is by combining state selection and deferral into something called slim CI.

The documentation for deferral has this nice diagram. The diagram shows that if you just modified model_c, you don't need to rebuild its unmodified ancestors model_a and model_b, and instead just point to the production version:

In other words, if the raw SQL of model_c is select * from {{ ref("model_b") }}, then without deferral it would get compiled to create view dev.model_c as select * from dev.model_b, and you'd also need to create dev_model_b and the rest of its ancestors. However, with deferral it would instead get compiled to create view dev.model_c as select * from prod.model_b, allowing us to reuse production models and save a ton of money and time.

How ghost models break deferral and slim CI

Suppose you've created a model ghost_child that depends on the model ghost. What will happen if you try to run a pull request containing that change through the slim CI process?

We've established that all production manifests will contain information about the model ghost as if it really exists, even though neither job created it. That means that the slim CI process will helpfully try to compile {{ ref("ghost") }} as prod.ghost. The object prod.ghost doesn't exist, so your CI process will crash and burn.

While ghost models don't break state comparison, they completely break deferral. Even with ghost models, it's still completely possible to determine which models have changed. However, it's not possible to establish that an unchanged model will actually exist in production.

One workaround is to disable deferral completely. If you're interested in only building the modified models, you'd have to run dbt build --select +state:modified to build your modified models and all their ancestors in DEV. If you're interested in also testing downstream changes, you'll have to use the "at" operator to build all ancestors of all descendants of the selected model: dbt build --select @state:modified. What could be just running a handful of models with deferral could become hundreds of models without deferral.

Easy fix: Disable your ghost models

The fix to the problem is hinted at in the manifest documentation: Disabled models don't show up in the manifest. Track down your ghost models and disable them. It won't affect production since they don't get materialized anyway. Then they'll stop showing up in the manifest and slim CI will work again! When you need the ghost model in the future, you can enable it, and then it will run as part of your CI pipeline. Just remember to make it run on a schedule before merging to production!

Conclusion

Models that exist in your dbt project's primary branch but don't get materialized in the data warehouse are ghost models. Your production job manifests will include them as if they really exist, but slim CI pipelines will crash and burn when they try to defer to them. This will force you to move away from deferral, which could cause an exponential increase of your CI runtime and costs. To avoid this issue, be sure to disable all your ghost models.

DEV Community

Ghost models and spooky manifests in dbt

The ghost model scenario

Spooky manifests

How deferral makes your CI jobs faster and less expensive

How ghost models break deferral and slim CI

Easy fix: Disable your ghost models

Conclusion

Top comments (0)

Read next

15 Best Sites to Download Free SVG Illustrations

[𝗖𝗮𝗹𝗹 𝗳𝗼𝗿 𝗣𝗮𝗽𝗲𝗿𝘀] 𝟭𝟮𝘁𝗵 𝗜𝗯𝗲𝗿𝗶𝗮𝗻 𝗖𝗼𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗼𝗻 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝗥𝗲𝗰𝗼𝗴𝗻𝗶𝘁𝗶𝗼𝗻 𝗮𝗻𝗱 𝗜𝗺𝗮𝗴𝗲 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀

SQL Data Comparison: What It Is and Why It Matters

Firebase Authentication with Angular 19