Aram Panasenco

Posted on Nov 6

Test-Driven Wide Tables

Test-driven wide tables (TDWT) is the absolute simplest production-grade approach to analytics engineering. Removing anything from TDWT would make it unsuitable for production. Adding anything to TDWT is unnecessary.

The test-driven wide tables flow

Get requirements from the data customer. What part of the final spreadsheet-like output needs to be changed? Document in a dbt models properties file if applicable.
Turn the requirements into dbt data tests.
Run dbt tests on the model - the new ones should fail.
Implement the change necessary to make the test pass. Write your code as simply as possible.
Run dbt tests on the model - they should all pass.
Repeat from step 1.

What are test-driven wide tables? Why use them?

Test-driven wide tables (TDWT) combine Test-driven development (TDD) and wide tables. To understand why we're advocating for TDWT, let's think about how a failing data warehouse can be made successful.

Undisciplined data warehouses are untrustworthy, slow, and unmaintainable

In your career, you may have seen data warehouses built haphazardly without consistent discipline. At one company I've worked at, the "legacy" data warehouse implementation was bloated with thousands of lines of copied-and-pasted code, hundreds of separate views/tables, and circular dependencies. The views and tables are a nightmare to maintain and are not trusted by many data customers. Development seems constantly stuck.

Just changing the shape of the data can't increase trust, velocity, or maintainability

Experts will usually propose following a structured data modeling approach - Kimball/Inmon/DataVault/etc. All of these approaches primarily focus on shaping your data to follow a certain structure. They will differ in their pitches and focuses but the basic selling points are that following their structure will improve the trustworthiness, development speed, and maintainability of your data warehouse.

However, I don't believe that just changing the shape of the data can do any of that. In cases where it seems like it does, it's actually the processes that get implemented alongside the structure that are driving the real change. Data that's modeled dimensionally can be just as untrustworthy, messy, and difficult to reuse as data that's not modeled at all.

The focus needs to be on process rather than on structure

It's not possible to perfectly define trust, maintainability, and reusability in a way that satisfies everyone. However, I believe most can agree that there is some element of conversation in all of these things. Trust requires conversation. Maintaining or extending a codebase can feel like having a conversation with the previous developers. All these things are dynamic, not static. On the other hand, focusing on the structure of the data is static. It's like trying to talk to a rock.

Instead of focusing on data structure, the focus should be on our processes. What processes can we follow to earn and grow data customer trust? What processes can make the codebase more maintainable and reusable?

I argue that there is one process that can achieve all of the above: Test-driven development.

Tests show data customers that regressions will be prevented. Tests catch data issues before data customers do.
Having to have a test for every feature prevents analytics engineers from writing thousands of lines of SQL bloated with irrelevant logic. The data models are slimmed down to the bare necessities and are therefore easier to maintain.
If you want to reuse a piece of logic from a previous model, you can pull it out and refactor with confidence. If you broke something, the tests will let you know.

Since we've established that the shape of the data is irrelevant to the outcome, we can just adopt the simplest possible data structure, which is wide tables. The result: Test-driven wide tables!

Is using specifically wide tables important?

No, the approach holds that the shape of the data is irrelevant. The wide tables modeling approach is chosen because it's the simplest. If it makes more sense for your team to model dimensionally or any other way, go for it. For example, folks who want to take advantage of dbt Cloud's Semantic Layer should create normalized models instead of wide tables. You could have test-driven Kimball or test-driven normal tables.

Setting up a test-driven wide tables project

Folder structure

Follow dbt's official guide How we structure our dbt projects. In fact, their guide explicitly calls for models inside the "marts" folder to be "wide and denormalized". Test-driven wide tables has its own take on the folders inside the "models" folder, which is slightly different in points from the official guide:

staging: There should be a staging view model for each raw table. Any type casting and sanitizing should be done in the staging model. All other models should use the staging view instead of accessing the raw table directly. This helps to avoid polluting business logic with data cleaning logic.
intermediate: Any piece of business logic that's used in two different models should have its own intermediate model instead of being copied and pasted. Beyond that, creating or not creating intermediate models is up to the developer.
marts: Models fit for end user consumption go here.

Style

Follow the official dbt style guide where it makes sense for your team. Personally, I'm strongly against import CTEs because having to constantly scroll up and down to change the CTEs breaks my focus and flow. Use common sense here and don't reject pull requests for things that don't really affect anything.

I suggest setting up sqlfmt with pre-commit to enable your whole team's code to be automatically formatted and have the same style.

Detailed test-driven wide tables flow

Let's expand on each step of the flow we defined in the beginning.

1. Get requirements from the data customer

It all starts by talking to the data consumer and understanding their needs. If something broke, what's an example? Turn that example into your test. If something new is needed, what does it look like? Ask them to mock up a few examples cases in a spreadsheet-like format. The columns of that spreadsheet become your dbt model. The rows of that spreadsheet become your tests - we'll cover that in the next step.

Write the documentation as you're gathering requirements, not after the data model is written. dbt allows model properties (including documentation) to be defined before any SQL is written.

For example, if you're developing a transactions model with your accounting team, you can create the file models/marts/accounting/_accounting__models.yml:

models:
  - name: accounting_transactions
    description: Transactions table for accounting
    columns:
      - name: transaction_key
        description: Synthetic key for the transaction
        tests:
          - not_null
          - unique
      - name: action_date
        description: Date of the transaction
        tests:
          - not_null

You should be taking notes when gathering data customer requirements anyway. Instead of writing the notes down in something like a Google Doc or an email, take notes in this YAML format instead. That'll get you kick-started on your documentation and testing.

2. Turn the requirements into dbt tests

There are two approaches to writing tests in TDWT:

Testing with production data. This approach is resilient to refactors, but brittle against data changes. The data could change, which could cause the test to start failing. Certain edge cases that should ideally be tested might not exist in production data until after the data modeling is complete. Each test only takes a couple of minutes to create.
Writing unit tests. This approach is resilient against data changes, but brittle with refactors. Because unit tests require you to specify the exact names and values of all inputs going into the model, refactoring becomes very labor-intensive.

I recommend writing integration tests that use production data by default. The speed of this method lowers the barrier to entry and prevents reasonable analytics engineers from saying that they don't have time to write tests.

Testing with production data

Think about the columns of the mockup spreadsheet your data customer gave you. One or more of those columns will be able to be used as an identifier of that particular example row. There should only be one row with that identifier. Values in some of the rest of the columns will represent the business logic of the example. Therefore, we need to test two things: Does that example row exist, and do the business logic values match?

A dbt data test returns failing records. In other words, the test has succeeded when no rows are returned. Here's an example implementation:

with row_count as (
    select
        count(*) as n
    from {{ ref("model_being_tested") }}
    where id1 = 'some_id'
        and id2 = 'other_id'
)
select 'Not exactly 1 row' as error_msg
from row_count
where n <> 1
union all
select 'Row test failed' as error_msg
from {{ ref("model_being_tested") }}
where id1 = 'some_id'
    and id2 = 'other_id'
    and not (
        value1 = 'some example value'
        and value2 is not null and value2 = 'some other value'
        and abs(value3) <= 0.01
    )

Let's dissect what's happening in this query. There are two select statements joined together with a union all. The first will return a failing record if the row identified by the identifier(s) doesn't exist in the data. This is important so we don't inadvertently pass a test when the data is not there in the first place. The second identifies that same row, and then looks for any discrepancies in the business logic values. That's easiest to achieve by wrapping the expected values in a not().

Do watch out for null values. Due to three-valued logic in SQL, the filter not(column = 'value') will not return rows where the column is null. I recommend testing for nulls separately using dbt's generic not_null test so that you don't have to remember each time.

This kind of test is very easy to copy and paste and adapt quickly. It's also easy to read and maintain. This will be all you need 90% of the time.

It's easy to accidentally write a SQL query that produces no rows. That's why it's also easy to write a dbt data test that accidentally passes. The test should be written and run first, before any development work is done. The test should fail. Then the change should be implemented, and the test should succeed.

Writing unit tests

Use dbt unit tests if you can't or don't want to test with production data. Using the unit test syntax, you can define synthetic source data in your model's YAML. This allows you to test complex edge cases while being confident that your tests will never break as long as the model itself doesn't change.

3. Run dbt tests on the model - the new ones should fail

Run the tests on the model you're developing:

dbt test --select model_being_tested

If you start writing tests regularly, you'll definitely write a few that always pass by accident. This step catches them.

4. Implement the change necessary to make the test pass

You've documented the columns and have written your tests. Now it's finally time to write the logic! Don't follow any preconceived data structure beyond staging the raw data. Use intermediate models if you need to, but don't feel pressured to if you don't.

5. Run dbt tests on the model - they should all pass

If all tests pass, you're set! If not, keep developing. :)

6. Go back to step 1

Go back to the data customer with your new model. As long as your manager allows it, you can ask if they have any new edge cases or requirements for you to test and implement. :)

Enforcing test-driven development

It's a good idea to work towards enforcing test-driven development in your analytics engineering team. Rather than surprising folks with a new policy, I recommend setting a deadline by which test-driven development will be mandated, and ensuring the team gets familiar with it before the deadline.

Here's an example workflow that incorporates test-driven development:

All dbt models are stored in a Git repo with a write-protected production branch. All changes to production have to come through pull requests with at least one peer approval.
Analytics engineers create feature branches off the production branch and open pull requests when the features are ready.
Every peer reviewer is expected to only approve the pull request if they see tests corresponding to every feature. If they don't see corresponding tests, that means TDD wasn't followed and the pull request shouldn't be approved.

Conclusion

If you're an analytics engineer, I hope this post has convinced you to give test-driven wide tables a try. If you're an analytics engineering team leader, I hope you consider making test-driven wide tables a requirement for your team.

Analytics engineering is uniquely well-suited to test-driven development. The cost of effort of creating tests from end user requirements is low, and the cost of regressions from complex and untested business logic in your data models is high. Using the test-driven wide tables approach boosts trust in data throughout your organization, makes the codebase easy to maintain and refactor, and maximizes the development velocity of analytics engineers.

DEV Community