Cameron Archer for Tinybird

Posted on Jan 31 • Originally published at tinybird.co

Analyzing the Dub.co analytics playbook

#webdev #react #database #opensource

When you're building something, it helps to have a model. Something or someone to look to for inspiration and best practices. If you are a modern small SaaS company, Dub may be the archetype for how to do it right; a canonical model to guide your product up and to the right.

Dub helps marketing teams manage and shorten links to track their performance and conversions. It is a great example of a modern SaaS; a small team solving a hard but common problem at scale, powered by modern dev tools.

Dub is open source, and you can learn a lot about building a successful SaaS by just exploring the repository (and following Steven Tey on Twitter). But you can learn even more by running Dub locally and hosting it.

Here, I want to dive into how Dub handles its user-facing analytics. With Dub Analytics, every Dub user gets real-time dashboards to analyze link performance and filter by many different dimensions. Given Dub’s scale, processing over 150M events last year for thousands of users, this isn’t a simple problem.

If you're interested in learning about Dub Analytics as a product - its features and capabilities - I recommend reading the official Dub Analytics help article (or exploring a sample dashboard).

But I want to dive into how Dub Analytics it works technically, from data model to APIs, since Dub should be considered “canon” for how to build a modern SaaS, and its approach to analytics is no exception.

Data Model

The foundation for performant user-facing analytics is a solid data model based on data fundamentals.

How Dub does it

Dub stores both links metadata and link click events.

Link metadata is generated from the admin console of the Dub UI. Users can create links, add UTM parameters and tags, and edit various other metadata describing the links they build.

Dub.co uses NextJS, PlanetScale, and Prisma as their metadata backend, and this backend handles CRUD operations for links.

But we're focused on analytics, not transactions. Link metadata is also stored in Tinybird, as it is used for enrichment and filtering in the analytics views.

Here's the schema for the link metadata table in Tinybird:

SCHEMA >
    `timestamp` DateTime `json:$.timestamp` DEFAULT now(),
    `link_id` String `json:$.link_id`,
    `domain` String `json:$.domain`,
    `key` String `json:$.key`,
    `url` String `json:$.url`,
    `tag_ids` Array(String) `json:$.tag_ids[:]`,
    `program_id` String `json:$.program_id`,
    `workspace_id` String `json:$.workspace_id`,
    `created_at` DateTime64(3) `json:$.created_at`,
    `deleted` UInt8 `json:$.deleted`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYear(timestamp)"
ENGINE_SORTING_KEY "timestamp, link_id, workspace_id"

You can see columns for all the things that describe a link; its id, domain, url, tags, etc. You'll also notice timestamp and deleted columns.

How Dub handles updates or deletions of link metadata in Tinybird

Dub follows an event-driven architecture when storing and processing link events, so each event - in the case of metadata this means CRUD operations - is tracked and stored in a raw append log.

Unlike in the transactional backend built on PlanetScale, link updates and deletions in Tinybird are stored as a new event for a given link_id. This is why Dub includes timestamp and deleted columns in the raw schema.

Dub appends delete events to the log by setting the value for deleted to 1, and deduplicates event metadata by using a materialized view that writes to a ReplacingMergeTree (RMT) data source. RMT is a special table engine that selectively updates rows based on a version column, thereby deduplicating events describing the same metadata and keeping the latest state.

This is how the schema for that data source looks:

# Data Source created from Pipe 'dub_links_metadata_pipe'

SCHEMA >
    `timestamp` DateTime,
    `link_id` String,
    `domain` String,
    `key` String,
    `url` String,
    `tag_ids` Array(String),
    `workspace_id` String,
    `created_at` DateTime64(3),
    `deleted` UInt8,
    `program_id` String

ENGINE "ReplacingMergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "link_id"
ENGINE_VER "timestamp"

The full data lineage for a link metadata event looks like this:

Ingested from the backend into dub_links_metadata, the raw data source where links metadata are initially stored
Processed by a materialized view pipe that handles the deduplication logic
Written to a materialized view data source, dub_links_metadata_latest, which uses the RMT engine to perform the deduplications in streaming.

The result is a final table that maintains the (nearly) final state of the links metadata.

A data lineage for Dub's links metadata events — Dub uses a materialized view to deduplicate link metadata events

Why do I say "nearly?" Deduplication is a background process that happens every few minutes, with the final results being merged into the data source asynchronously. For queries that need the final merged results in real-time, Dub leverages the FINAL keyword in their queries to get the latest version of the data (more on that below).

How Dub models click events

Dub's link click events contain a timestamp, a link_id, and a bunch of properties containing the dimensions (e.g. browser, device, country, etc) on which a user might want to filter their analytics views. This is the information Dub collects from the browser and the user agent.

Here's the schema for the dub_click_events data source in Tinybird, which contains columns for all of these properties:

SCHEMA >
    `timestamp` DateTime64(3) `json:$.timestamp`,
    `click_id` String `json:$.click_id`,
    `link_id` String `json:$.link_id`,
    `alias_link_id` String `json:$.alias_link_id`,
    `url` String `json:$.url`,
    `country` LowCardinality(String) `json:$.country`,
    `city` String `json:$.city`,
    `region` String `json:$.region`,
    `latitude` String `json:$.latitude`,
    `longitude` String `json:$.longitude`,
    `device` LowCardinality(String) `json:$.device`,
    `device_model` LowCardinality(String) `json:$.device_model`,
    `device_vendor` LowCardinality(String) `json:$.device_vendor`,
    `browser` LowCardinality(String) `json:$.browser`,
    `browser_version` String `json:$.browser_version`,
    `os` LowCardinality(String) `json:$.os`,
    `os_version` String `json:$.os_version`,
    `engine` LowCardinality(String) `json:$.engine`,
    `engine_version` String `json:$.engine_version`,
    `cpu_architecture` LowCardinality(String) `json:$.cpu_architecture`,
    `ua` String `json:$.ua`,
    `bot` UInt8 `json:$.bot`,
    `referer` String `json:$.referer`,
    `referer_url` String `json:$.referer_url`,
    `user_id` Nullable(Int64) `json:$.user_id`,
    `identity_hash` Nullable(String) `json:$.identity_hash`,
    `ip` String `json:$.ip`,
    `qr` UInt8 `json:$.qr`,
    `continent` LowCardinality(String) `json:$.continent`

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "timestamp, link_id, click_id"

Dub.co also stores events to track conversions. These are modeled as separate data sources to store lead and sale events. These events are also stored in Tinybird and include references to the link_id and other metadata.

What you can learn from Dub's data model

Dub's data model teaches us some important things about building analytics for many concurrent users while optimizing for low-latency queries. After all, Dub amassed a log of over 150M events in 2024, and yet they serve analytics for thousands of users with near instantaneous chart refreshes.

When you design your analytical data model you should consider what works about Dub's approach and how it leads to fast queries even with hundreds of millions of events.

Dub stores its event logs in Tinybird, a columnar database. Columnar databases are more efficient when you unroll (denormalize) your data into columns, choose the right data types, and sort the data in a way that can be efficiently accessed according to your query patterns. If you'd like to learn more about how columnar databases are optimized for analytics, you can read this article.

Data types

When it comes to data types, Dub uses the smallest data type that fits the data. This reduces your storage overhead. For example, the deleted column is a simple 0 or 1, so UInt8 suffices.

Some general rules for choosing data types:

Use LowCardinality for columns that have a small number of distinct values.
Always choose UInt* for integers that will always be positive.
Use the right precision for timestamps, floats, and integers. For instance do not use UInt64 to store years, and do not use DateTime to store dates.
Use Nullable for columns that can be empty.
Use Array for columns that have a list of values.

Partition keys

In addition, choose a partition key for your data so that:

Ingested data writes to the minimum number of partitions possible.
Partitioning produces tens or hundreds of partitions (not millions).
For example, dub_click_events is partitioned by month, so you’ll have a maximum of 12 partitions per year. Since timestamps are incremental, you're guaranteed to always write to only one partition on ingestion.

Sorting keys

In columnar databases, sorting keys define how the data is stored together on disk. Dub chooses sorting keys that enable fast filtering on the most common columns, namely timestamp and link_id.

You want to choose a sorting key that matches the dimension on which you want to filter your data, preferably from low to high cardinality. For the case of Dub, sorting by timestamp, link_id, and click_id makes it efficient to filter a subset of links for a given time range.

Some other things to take into account when it comes to sorting keys:

The sorting key can have an impact on write throughput, and having more than two or three columns in the sorting key generally does not improve query performance, so you don't need more than two or three dimensions in your sorting key.
In general, your landing (or raw) data source isn't what you'll read from at query time. If you expect to have unstructured or semi-structured data, keep it normalized in the landing data source, and use materialized views or Lambda architecture to denormalize it as needed for querying. An added bonus is that in your transformed data sources, you can redefine the sorting keys to optimize for various query patterns.
In that case, you may want to reduce data retention in your landing data sources by using a Time-To-Live (TTL).

Ingestion

The data model sets the stage for ingestion, the process of writing data into the tables with schemas you've previously defined.

How Dub creates a link click event

Dub uses a Next.js middleware to log and redirect click events.

When someone clicks a link (e.g., tbrd.co/example), the middleware gets the domain (tbrd.co) and the key (example) from the URL.

Dub does a few things before redirecting, such as handling bot traffic, checking to make sure the link exists, and checking for password protection or expiration

To track the click, Dub gets or creates a unique click ID (stored in a cookie), and then records the click event to Tinybird for analytics. Dub's /track/click API route implements the recordClick logic. This is done asynchronously so it doesn't slow down the redirect. In addition, Dub caches the link metadata in Redis to avoid hitting the MySQL database for every click event.

fetch(
      `${process.env.TINYBIRD_API_URL}/v0/events?name=dub_click_events&wait=true`,
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${process.env.TINYBIRD_API_KEY}`,
        },
        body: JSON.stringify(clickData),
      },
    ).then((res) => res.json()),

A couple notes:

Using environment variables for the TINYBIRD_API_URL and TINYBIRD_API_KEY allows Dub to develop locally by referencing the localhost and a Tinybird local container token.
wait=true makes the client wait for acknowledgement that the data was written to the database. Dub can use this to trigger retries in the event of a network failure.

How Dub ingests click events at scale

Dub.co is a fast-growing company. It serves thousands of teams, creating millions of links, and analyzing hundreds of millions (soon to be billions) of click events.

Generally speaking, streaming events at that magnitude requires robust infrastructure. Some companies will choose to use Apache Kafka or some other message queue.

Dub, however, sends JSON events directly from the app to the Tinybird Events API.

The Events API is a piece of managed infrastructure designed for high-throughput streaming ingestion of JSON events, abstracted behind a simple HTTP endpoint.

Dub relies on this managed service to avoid setting up streaming infrastructure. Instead, they just send an HTTP request to the Events API.

As Tinybird data source schemas are labelled with jsonpaths, the Events API maps the corresponding JSON event attributes to columns in the target data source.

The Events API receives the events, balances ingestion load from the many concurrent client connections, and writes incoming events to the database, generally with write-to-read latency of a couple of seconds or less.

What you can learn from Dub's approach to ingestion

The takeaway here is pretty simple. Dub does not maintain sophisticated streaming infrastructure. Instead, they rely on a managed service that massively simplifies their backend architecture.

There are good reasons to choose infrastructure like Apache Kafka. Kafka provides certain guarantees (e.g. exactly once delivery) that Tinybird's Events API does not.

These are design choices you must make, being conscious of the tradeoffs. You need to understand the consequences of failed delivery. In Dub's case, one missing or duplicated event in 100,000 likely doesn't have a significant impact on the overall analytics, so occasional failed delivery is tolerable.

Whether you ingest from the browser or the backend (as Dub does), a service that supports JSON events through an HTTP interface simplifies your architecture. Sometimes, simplicity outweighs perfection.

Preparing data for consumption

I mentioned above the difference between a landing data source and an intermediate data source. When you're using event-driven architectures to maintain an append-only events log, you often want to process that data to prepare it for consumption. Dub pre-processes data in real-time to significantly improve query performance without limiting write throughput.

How Dub does it

Dub uses materialized views to transform a subset of the data sorted by different columns. For instance, Dub uses a pipe (dub_click_events_pipe) to transform the contents of the dub_click_events landing data source into a format to support a different query pattern:

NODE mv
SQL >

    SELECT
        timestamp,
        click_id,
        link_id,
        url,
        continent,
        country,
        city,
        device,
        browser,
        os,
        referer,
        qr,
        ip
    FROM dub_click_events

TYPE materialized
DATASOURCE dub_click_events_mv

The result is stored in a materialized view data source, dub_click_events_mv, which is sorted by link_id and timestamp, so it can be used to fetch the clicks for a subset of links and time range more performantly:

# Data Source created from Pipe 'dub_click_events_pipe'

SCHEMA >
    `timestamp` DateTime64(3),
    `click_id` String,
    `link_id` String,
    `url` String,
    `continent` LowCardinality(String),
    `country` LowCardinality(String),
    `city` String,
    `region` String,
    `latitude` String,
    `longitude` String,
    `device` LowCardinality(String),
    `browser` LowCardinality(String),
    `os` LowCardinality(String),
    `engine` LowCardinality(String),
    `ua` String,
    `identity_hash` String,
    `referer` String,
    `referer_url` String,
    `qr` UInt8,
    `ip` String

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(timestamp)"
ENGINE_SORTING_KEY "link_id, timestamp"

By using materialized views, Dub is able to process this transformation on ingestion, so the data is ready to be consumed with multiple query patterns even as it is being ingested.

What you can learn from Dub's data preparation

A core principle in real-time systems is differentiation of write patterns and read patterns. What is good for writing isn't always good for reading. In Dub's case, some transformations are required to create intermediate data sources optimized for different query patterns.

Use materialized views or Lambda architecture

In Tinybird, the most common pattern for real-time transformation is the use of materialized views. In Tinybird, as opposed to in OLTP databases, the materialization happens in real-time, triggered by writes to the database, not manually refreshes.

Of course, real-time transformation via materialized views isn't the only way to prepare data for consumption. Tinybird also enables batch operations (copies), and you can use Lambda architecture to differentiate real-time versus batch transformation operations.

Of course, when you are just starting a project with Tinybird, you often don't need to think about data preparation. Tinybird's fundamental performance characteristics will enable fast query patterns even if you do just query the landing data source.

When you're getting started, just ingest your data and build pipes over the landing data source to quickly prototype your APIs.

The typical cases when you do need to prepare data for consumption include:

You want to enrich your data with other sources.
You want to denormalize, deduplicate, or otherwise pre-process your data.
You want to pre-aggregate to minimize data processing (and latency) during query time.
You want to reorder the data to enable different filtering patterns.

A common mistake in software engineering (and also in data engineering) is optimizing too soon. If you look at Dub's data preparation, it is not particular complex. Even at their current scale, they are in some cases effectively querying raw data, ensuring that it is properly sorted.

What is important here is observability, which clues you into increased query latency or network failures.

In Tinybird you can track both write and read performance by means of service data sources. If you manage a Tinybird organization you can monitor organization metrics to identify performance bottlenecks and potential improvements.

For instance, Dub's analytics views provide daily, monthly, and annually aggregated metrics. In the future, they might consider materialized rollups to speed up certain queries.

Backwards compatibility

Data preparation isn't just about performance, however. Pre-processing data for querying such that you don't query over your landing data source gives you the added benefit of backwards compatibility.

In production event-driven architectures, you have a constant stream of events that you can't just… stop. Making a change to a landing data source schema can break ingestion. But, if you maintain a normalized data source, perhaps using a JSON column, you get some productivity and maintenance benefits:

You maintain backwards compatibility even when your data schema changes
It prevents regressions by keeping raw and new transformed data separate
You still can apply retention policies if needed

Publishing real-time APIs

With data prepared for querying, Dub must serve it to thousands of concurrent users. So how do you build query patterns that maintain sub-second latency for complex analytics with many active database connections?

How Dub does it

Dub generates many different metrics to analyze the performance of the links their users create, and, in my humble opinion, Dub Analytics is the most interesting part of the product.

Dub's Analytics APIs consist of:

Totals (counters, sums, etc) for each of their events: clicks, leads, and sales.
Aggregations and ranks by any of their metadata dimensions: link, url, countries, city, continent, browsers, devices, referrer, trigger, etc.
Aggregations filtered by selected metadata dimensions (e.g. show me the top referrers for clicks made on desktop)
Aggregations by time: last 24 hours, last 7 days, last 30 days, etc.

All Dub's APIs support filtering by any of the dimensions and selected time ranges.

You can see how this works in practice with two examples from the Dub codebase:

How Dub counts clicks by referrer (or any other dimension)

Dub builds their APIs using Tinybird pipes, which are made up of composable SQL nodes that define the transformation. Nodes help Dub organize their queries, not only making them more readable, but also improving performance.

This node, taken from the v2_referrers.pipe, returns only the link ids for a given Dub workspace, optionally applying additional defined filters for various metadata dimensions such as programs and tags:

NODE workspace_links
SQL >
    %
    SELECT link_id
    from dub_links_metadata_latest FINAL
    WHERE
        workspace_id
        = {{
            String(
                workspaceId,
                'ws_cl7pj5kq4006835rbjlt2ofka',
                description="The ID of the workspace",
                required=True,
            )
        }}
        AND deleted == 0
        {% if defined(programId) %} AND program_id = {{ programId }} {% end %}
        {% if defined(domain) %} AND domain IN {{ Array(domain, 'String') }} {% end %}
        {% if defined(tagIds) %} AND arrayIntersect(tag_ids, {{ Array(tagIds, 'String') }}) != [] {% end %}
        {% if defined(root) %}
            {% if Boolean(root) == 1 %} AND key = '_root' {% else %} AND key != '_root' {% end %}
        {% end %}

Remember the ReplacingMergeTree? This query selects from dub_links_metadata_latest, so it deduplicates the links by timestamp using the FINAL keyword and deleted == 0.

Note. Dub actually has an opportunity to improve performance here. Since dub_links_metadata_latest is filtered by the workspaceId in this query, it would benefit from having workspaceId in the sorting key. One approach to improve performance would be to create a data source to serve as an inverted index mapping workspace_id and link_id

Dub counts clicks by selecting from the dub_click_events_mv data source and filtering to include only the links returned in the workspace_links node. Note the use of the PREWHERE clause, which limits query processing by first filtering the subset of links.

This node uses Tinybird templating functions and parameters to add optional filters, and finally groups by referrer. The result gives the count of clicks for all links belonging to a workspace, grouped by referrer and sorted by count in descending order:

NODE referers_clicks
SQL >

    %
    SELECT referer, COUNT(referer) as clicks
    FROM
        dub_click_events_mv
        {% if not defined(linkId) and defined(workspaceId) %}
            PREWHERE link_id in (SELECT link_id from workspace_links)
        {% end %}
    WHERE
        true
        {% if defined(linkId) %} AND link_id = {{ String(linkId) }} {% end %}
        {% if defined(qr) %} AND qr = {{ Boolean(qr) }} {% end %}
        {% if defined(continent) %} AND continent = {{ continent }} {% end %}
        {% if defined(country) %} AND country = {{ country }} {% end %}
        {% if defined(region) %} AND region = {{ region }} {% end %}
        {% if defined(city) %} AND city = {{ city }} {% end %}
        {% if defined(device) %} AND device = {{ device }} {% end %}
        {% if defined(browser) %} AND browser = {{ browser }} {% end %}
        {% if defined(os) %} AND os = {{ os }} {% end %}
        {% if defined(referer) %} AND referer = {{ referer }} {% end %}
        {% if defined(refererUrl) %} AND splitByString('?', referer_url)[1] = {{ refererUrl }} {% end %}
        {% if defined(url) %} AND url = {{ url }} {% end %}
        {% if defined(start) %} AND timestamp >= {{ DateTime64(start) }} {% end %}
        {% if defined(end) %} AND timestamp <= {{ DateTime64(end) }} {% end %}
    GROUP BY referer
    ORDER BY clicks DESC

Finally the data is prepared for publication in two final nodes, reusing the results from previous nodes and allowing optional filtering by any of the event types.

NODE referers_composite
SQL >

    SELECT dce.referer AS referer, clicks, leads, sales, amount, saleAmount
    FROM (SELECT referer, clicks FROM referers_clicks) AS dce
    LEFT JOIN (SELECT * FROM referers_leads) AS dle ON dce.referer = dle.referer
    LEFT JOIN (SELECT * FROM referers_sales) AS dse ON dce.referer = dse.referer
    ORDER BY clicks DESC



NODE endpoint
SQL >

    %
    SELECT *
    FROM
        {% if eventType == 'clicks' %} referers_clicks
        {% elif eventType == 'leads' %} referers_leads
        {% elif eventType == 'sales' %} referers_sales
        {% else %} referers_composite
        {% end %}

When integrated into the frontend via a bar chart, the final result looks like this:

A screenshot of Dub link clicks grouped by referrer and orderd by count — The final UI component that fetches data from Tinybird APIs

How Dub visualizes time series data

Dub also visualizes click events on a time series, aggregating by day, week, or month and allowing filtering:

Multiple time ranges: last 24 hours, last 7 days, last 30 days, etc.
Arbitrary time ranges: from 2024-01-01 to 2024-01-31, etc.
Different time granularities: hourly, daily, weekly, monthly, etc.

Understanding that users may not have any clicks during a particular time period, Dub generates a time series to fill the gaps with zeros for a cleaner chart.

Here's an example:

NODE month_intervals
SQL >

    %
    WITH
        toStartOfMonth(
            toDateTime64({{ DateTime64(start, '2024-02-24 00:00:00.000') }}, 3),
            {{ String(timezone, 'UTC') }}
        ) AS start,
        toStartOfMonth(
            toDateTime64({{ DateTime64(end, '2024-05-23 00:00:00.000') }}, 3),
            {{ String(timezone, 'UTC') }}
        ) AS
    end,
    dateDiff('month', start, end) + 1 AS months_diff
    SELECT
        arrayJoin(
            arrayMap(
                x -> toDateTime64(start + toIntervalMonth(x), 3, {{ String(timezone, 'UTC') }}),
                range(0, months_diff)
            )
        ) as interval



NODE day_intervals
SQL >

    %
    WITH
        toStartOfDay(
            toDateTime64({{ DateTime64(start, '2024-02-24 00:00:00.000') }}, 3),
            {{ String(timezone, 'UTC') }}
        ) AS start,
        toStartOfDay(
            toDateTime64({{ DateTime64(end, '2024-05-23 00:00:00.000') }}, 3),
            {{ String(timezone, 'UTC') }}
        ) AS
    end
    SELECT
        arrayJoin(
            arrayMap(
                x -> toDateTime64(toStartOfDay(toDateTime64(x, 3), {{ String(timezone, 'UTC') }}), 3),
                range(toUInt32(start + 86400), toUInt32(end + 86400),
                86400
            )
        )
    ) as interval



NODE hour_intervals
SQL >

    %
    WITH
        toStartOfHour(
            toDateTime64({{ DateTime64(start, '2024-05-22 00:00:00.000') }}, 3),
            {{ String(timezone, 'UTC') }}
        ) AS start,
        toStartOfHour(
            toDateTime64({{ DateTime64(end, '2024-05-23 00:00:00.000') }}, 3),
            {{ String(timezone, 'UTC') }}
        ) AS
    end
    SELECT
        arrayJoin(
            arrayMap(x -> toDateTime64(x, 3), range(toUInt32(start + 3600), toUInt32(end + 3600), 3600)
        )
    ) as interval

Once the intervals are calculated, Dub retrieves the links as in the previous example:

NODE workspace_links
SQL >

    %
    SELECT link_id
    from dub_links_metadata_latest FINAL
    WHERE
        workspace_id
        = {{
            String(
                workspaceId,
                'ws_cl7pj5kq4006835rbjlt2ofka',
                description="The ID of the workspace",
                required=True,
            )
        }}
        AND deleted == 0
        {% if defined(programId) %} AND program_id = {{ programId }} {% end %}
        {% if defined(domain) %} AND domain IN {{ Array(domain, 'String') }} {% end %}
        {% if defined(tagIds) %} AND arrayIntersect(tag_ids, {{ Array(tagIds, 'String') }}) != [] {% end %}
        {% if defined(root) %}
            {% if Boolean(root) == 1 %} AND key = '_root' {% else %} AND key != '_root' {% end %}
        {% end %}

When filtering the events Dub takes into account the time range, intervals calculated, dimensions, and time granularity, using parameters to not only filter the data but also build the select part of the query, as in the example below:

NODE timeseries_clicks_data
SQL >

    %
    SELECT
        {% if granularity == "hour" %} toStartOfHour(timestamp, {{ String(timezone, 'UTC') }})
        {% elif granularity == "month" %}
            toDateTime64(
                toStartOfMonth(timestamp, {{ String(timezone, 'UTC') }}),
                3,
                {{ String(timezone, 'UTC') }}
            )
        {% else %} toDateTime64(toStartOfDay(timestamp, {{ String(timezone, 'UTC') }}), 3)
        {% end %} AS interval,
        uniq(*) as clicks
    FROM
        dub_click_events_mv
        {% if not defined(linkId) and defined(workspaceId) %}
            PREWHERE link_id in (SELECT link_id from workspace_links)
        {% end %}
    WHERE
        true
        {% if defined(linkId) %} AND link_id = {{ String(linkId) }} {% end %}
        {% if defined(qr) %} AND qr = {{ Boolean(qr) }} {% end %}
        {% if defined(continent) %} AND continent = {{ continent }} {% end %}
        {% if defined(country) %} AND country = {{ country }} {% end %}
        {% if defined(region) %} AND region = {{ region }} {% end %}
        {% if defined(city) %} AND city = {{ city }} {% end %}
        {% if defined(device) %} AND device = {{ device }} {% end %}
        {% if defined(browser) %} AND browser = {{ browser }} {% end %}
        {% if defined(os) %} AND os = {{ os }} {% end %}
        {% if defined(referer) %} AND referer = {{ referer }} {% end %}
        {% if defined(refererUrl) %} AND splitByString('?', referer_url)[1] = {{ refererUrl }} {% end %}
        {% if defined(url) %} AND url = {{ url }} {% end %}
        AND timestamp >= {{ DateTime(start, '2024-04-25 00:00:00') }}
        AND timestamp < {{ DateTime(end, '2024-05-23 00:00:00') }}
    GROUP BY interval
    ORDER BY interval



NODE timeseries_clicks
SQL >

    %
    SELECT formatDateTime(interval, '%FT%T.000%z') as start, clicks
    FROM
        {% if granularity == "minute" %} minute_intervals
        {% elif granularity == "hour" %} hour_intervals
        {% elif granularity == "month" %} month_intervals
        {% else %} day_intervals
        {% end %}
    LEFT JOIN timeseries_clicks_data USING interval

Finally it’s published as an endpoint:

NODE endpoint
SQL >

    %
    SELECT *
    FROM
        {% if eventType == 'clicks' %} timeseries_clicks
        {% elif eventType == 'leads' %} timeseries_leads
        {% elif eventType == 'sales' %} timeseries_sales
        {% elif eventType == 'billing' %} timeseries_billing
        {% else %} timeseries_composite
        {% end %}

How Dub handles API requests through a proxy

Tinybird provides the Events API for ingestion and REST APIs for consumption, both authenticated and rate limited by tokens (static or JWT).

While Tinybird APIs can be requested directly from the browser, Dub has chosen to proxy API requests through its own backend to more completely handle things like authentication, rate limiting, caching, and authorization.

In addition, a proxy API allows Dub to pre- and post-process the requests to the Tinybird APIs.

For instance, in the case of the APIs Dub has built to serve time series charts, Dub needs to map the selected time range to the appropriate Tinybird API parameters. If the user selects a yearly time range, Dub provides monthly granularity.

A screenshot of the Dub Analytics time range selector — Dub allows you to select a time range, and it uses a backend service to prepare the time range selection for the Tinybird APIs.

So the query to Tinybird looks like this:

BASE_URL="https://api.us-east.aws.tinybird.co/v0/pipes/v2_timeseries.json"

# Parameters:
# start       - Start date (2024-04-25 00:00:00)
# end         - End date (2024-05-23 00:00:00)
# timezone    - Timezone for the analysis (UTC)
# token       - Your Tinybird API token
# granularity - Time bucket size (month)
# eventType   - Type of events to analyze (clicks)

curl "$BASE_URL\
?start=2024-04-25 00:00:00\
&end=2024-05-23 00:00:00\
&timezone=UTC\
&token=$TINYBIRD_API_TOKEN\
&granularity=month\
&eventType=clicks"

To integrate the analytics APIs into the user application, Dub uses zod-bird as a typed client and a small library that proxies requests from their API to the corresponding Tinybird Pipes API and ensures type safety.

What you can learn from Dub's analytics APIs

Anytime you talk about analytics, it's important to draw a distinction between internal analytics views (business intelligence) and external views (user-facing analytics). Business intelligence has few concurrent users, static query patterns, and a general acceptance of data that may hours or days stale.

User-facing analytics, on the other hand, has many (thousands+) concurrent users, dynamic query patterns, and a demand for real-time (and historical) data.

Dub has chosen infrastructure that allows them to build real-time analytics APIs. Dub's event logs data is ingested and processed as it arrives, and Dub has built a flexible API that allows its many users to analyze click performance using dynamic filters without any noticeable hit to the frontend user experience.

If you're unaccustomed to working on real-time analytics projects, you may fall into a common trap: overly pre-processing the data for consumption to try to improve performance on the query side. While it may be tempting to use traditional batch OLAP infrastructure (such as a data warehouse) and a caching layer (such as Redis) to serve analytics to users, you'll end up with added lag and a loss of dynamic filtering.

In fact, Dub originally implemented an analytics proof of concept using Redis Sorted Sets, but the lack of filtering and poor performance encouraged them to find new infrastructure.

With Tinybird, Dub implements several best practices at query time that allows them to serve real-time, dynamic analytics to users without performance degradation:

Filtering first. The enemy of speed is processing too much data at query time. Dub filters by subqueries at the beginning of their pipes, so subsequent nodes that join and aggregate process only the data that is needed to return the final result.

FROM
        dub_click_events_mv
        {% if not defined(linkId) and defined(workspaceId) %}
            PREWHERE link_id in (SELECT link_id from workspace_links)
        {% end %}

Only pre-aggregating when necessary. Dub makes a design choice to perform some aggregations at query time, instead of pre-aggregating with materialized views. While pre-aggregating can improve performance, it also adds complexity. With effective filtering, Dub is able to performantly aggregate even at query time.

{% if granularity == "hour" %} toStartOfHour(timestamp, {{ String(timezone, 'UTC') }})
        {% elif granularity == "month" %}
            toDateTime64(
                toStartOfMonth(timestamp, {{ String(timezone, 'UTC') }}),
                3,
                {{ String(timezone, 'UTC') }}
            )
        {% else %} toDateTime64(toStartOfDay(timestamp, {{ String(timezone, 'UTC') }}), 3)
        {% end %} AS interval,
        uniq(*) as clicks

Deferring joins: Joins can be memory-intensive operations. So, it's generally good practice to perform joins last, after filtering and aggregating has been completed. This ensures that the tables on either side of the join are as small as possible.

SELECT dce.referer AS referer, clicks, leads, sales, amount, saleAmount
    FROM (SELECT referer, clicks FROM referers_clicks) AS dce
    LEFT JOIN (SELECT * FROM referers_leads) AS dle ON dce.referer = dle.referer
    LEFT JOIN (SELECT * FROM referers_sales) AS dse ON dce.referer = dse.referer
    ORDER BY clicks DESC

Learn on your own

As I mentioned, Dub is fully open source, and they have a great guide on setting up Dub for local env. If you really want to dive into how Dub built Analytics, I recommend working through the guide and seeing how all the data pieces work together. You can use the Tinybird Local container to implement the Tinybird APIs on your machine, which is great for local development and testing.

If you want to see how I did it, check out my video below:

DEV Community

Analyzing the Dub.co analytics playbook

Data Model

How Dub does it

How Dub handles updates or deletions of link metadata in Tinybird

How Dub models click events

What you can learn from Dub's data model

Data types

Partition keys

Sorting keys

Ingestion

How Dub creates a link click event

How Dub ingests click events at scale

What you can learn from Dub's approach to ingestion

Preparing data for consumption

How Dub does it

What you can learn from Dub's data preparation

Use materialized views or Lambda architecture

Backwards compatibility

Publishing real-time APIs

How Dub does it

How Dub counts clicks by referrer (or any other dimension)

How Dub visualizes time series data

How Dub handles API requests through a proxy

What you can learn from Dub's analytics APIs

Learn on your own

Top comments (0)

Read next

Lite version of Flyway migrator for PostgreSQL in TypeScript

Kubernetes Service Account and RBAC Tutorial

Kubero vs Coolify

Go vs Node.js vs PHP: Which Framework Outperforms the Other in Speed and Efficiency?