Cameron Archer for Tinybird

Posted on Feb 3 • Originally published at tinybird.co

Outgrowing Postgres: How to run OLAP workloads on Postgres

#postgres #database #learning

I used to think running large-scale analytics workloads on Postgres was like trying to use a screwdriver as a hammer - technically possible but probably not a great idea. But after years of running both OLTP and OLAP workloads on Postgres at scale, I've learned it's much more nuanced than that. Postgres can actually handle analytics workloads quite well, if you know its limits and how to work within them.

In my previous articles, we looked at handling growing data volumes and increasing user concurrency. Analytics workloads add another dimension of complexity - instead of simple lookups and CRUD operations, you're now dealing with complex aggregations across millions or billions of rows, joins across multiple large tables, and queries that need to scan huge chunks of your data.

The good news? With the right approach, Postgres actually can handle many analytics use cases, from basic reporting to complex business intelligence queries. I've seen it work well for datasets up to several terabytes, providing query response times from seconds to minutes. The bad news? You'll need to make some specific optimizations and architectural choices to get there, and it does require a lot of oversight, planning, and investment.

Understanding OLAP vs OLTP query patterns

Before we get into specific optimizations, let's understand what makes analytics queries fundamentally different from the transactional queries we've discussed in previous articles. Having run both types at scale, I can tell you the differences go way beyond just query complexity.

Here's what we're dealing with:

Dimension	OLTP Workloads	OLAP Workloads
Query Type	Single-record lookups and updates Example: Fetching a user's profile	Large dataset analysis Example: Monthly revenue across all customers
Query Frequency	High (100s-1000s/second)	Lower, unless user-facing dashboards
Query Complexity	Simple, few joins	Complex, multiple joins and aggregations
Performance Needs	Milliseconds or bust	Seconds OK for batch - Must match OLTP speed for user-facing
Resource Pattern	CPU-bound Heavy index usage	I/O-bound Large sequential scans
Data Access	< 1% of rows per query	Often > 10% of rows per query
Index Usage	Relies heavily on index seeks	Often bypasses indexes for full scans

I think this table neatly summarizes why we need different strategies for OLAP workloads in Postgres. When your analytics queries start competing with your OLTP workloads for resources, you'll need to get creative with your approach - which is exactly what we'll cover next.

Postgres OLAP optimization strategies

Separate your workloads

The first rule of running analytics on Postgres: don't run analytics queries on your primary OLTP database. This isn't just best practice - it's survival. One complex analytics query scanning millions of rows can bring your entire application to a halt. I've seen it happen, and the incident reviews aren't fun.

Instead, set up a dedicated analytics replica. But not just any replica - one specifically configured for analytical workloads. Here's what this gets you:

Complete isolation from production OLTP traffic
Freedom to tune and optimize specifically for analytics
Ability to scale your analytics infrastructure independently
Protection against runaway queries impacting production
Option to add more read replicas as analytics needs grow

Here's a basic setup using streaming replication:

Configure the primary and replicas by updating the postgresql.conf:

wal_level = replica
max_wal_senders = 10
max_replication_slots = 10 # This should generally be double the number of replicas

Update your application to point to the analytics replicas where appropriate.

Notice the differences from the suggested settings in the “Outgrowing Postgres: Handling increased user concurrency” article. The key difference is that you’ll want to optimize these replicas for analytics workloads. Skip to the Performance tuning for analytics section below for more find-tuned settings for your replicas.

Key things you’ll want to watch out for:

Replication lag: Analytics queries can cause the replica to fall behind
Resource contention: Even on a replica, poorly optimized queries can cause problems
Data freshness: Your analytics will be slightly behind production due to replication

A common question I get is "How much replication lag is acceptable?" For most analytics workloads, a few seconds or even minutes of lag is fine. But if you need real-time analytics, you'll need to either accept higher costs from a more powerful replica setup or start looking at specialized analytics solutions.

Design your schema for analytics

Your schema design can make or break analytics performance. Let's look at several techniques that can dramatically improve query speed, starting with the basics and moving to more advanced approaches.

Materialized views for pre-aggregation

If you've read my first article in this series on handling growing data volumes, you're already familiar with materialized views. They're particularly valuable for analytics workloads because they let you pre-calculate expensive aggregations:

CREATE MATERIALIZED VIEW hourly_stats AS
SELECT
     date_trunc('hour', event_time) as hour,
    count(*) as events,
    count(distinct user_id) as users
FROM events GROUP BY 1;
CREATE UNIQUE INDEX ON hourly_stats(hour);

For a more thorough treatment of materialized views, including concurrent refreshes and maintenance strategies, check out the materialized views section of that article.

Table partitioning

Partitioning is another fundamental technique we covered in the data volumes article. For analytics workloads, time-based partitioning is particularly effective:

CREATE TABLE events (
    event_time timestamptz,
    user_id int,
    event_type text
) PARTITION BY RANGE (event_time);

Take a look at the table partitioning section of that article for detailed partitioning strategies and maintenance approaches.

Vertical partitioning for analytics

Here's where things get interesting for analytics workloads. Vertical partitioning - splitting tables by columns rather than rows - can significantly improve analytics performance. While we touched on this in that same data volumes article, it deserves special attention for analytics.

Consider a wide events table:

CREATE TABLE events (
    -- Frequently queried columns
    event_time timestamptz,
    user_id int,
    event_type text,
    -- Rarely queried metadata
    user_agent text,
    ip_address inet,
    -- Large payload rarely used in aggregations
    event_data jsonb
);

You can split this into separate tables based on query patterns:

CREATE TABLE events_core (
    event_id bigserial PRIMARY KEY,
    event_time timestamptz,
    user_id int,
    event_type text);

CREATE TABLE events_metadata (
    event_id bigint PRIMARY KEY REFERENCES events_core,
    user_agent text,
    ip_address inet
);

CREATE TABLE events_payload (
    event_id bigint PRIMARY KEY REFERENCES events_core,
    event_data jsonb
);

This approach has several benefits for analytics:

Queries only need to scan the columns they need
Better cache utilization for frequently accessed columns
More efficient compression for similar data types
Ability to use different storage strategies per table

But - and this is a big but - vertical partitioning isn't free. Here's what you're signing up for:

Write complexity

Every insert now requires multiple queries
Transactions need to span multiple tables
More complex error handling and rollback scenarios
Higher chance of constraint violations
Need for careful transaction management to maintain data consistency

-- Instead of one simple insert
INSERT INTO events VALUES (...);

-- You now need transaction-wrapped multi-table inserts
 BEGIN;
     INSERT INTO events_core (...) RETURNING event_id;
     INSERT INTO events_metadata (...);
     INSERT INTO events_payload (...);
 COMMIT;

Read complexity

If you need to read columns across the partitions, simple queries become joins
Need to maintain indexes across multiple tables
More complex query planning
Potential for performance cliffs if joins aren't optimized

-- Simple query becomes a three-way join
SELECT e.event_time, e.event_type, m.user_agent, p.event_data
FROM events_core e
LEFT JOIN events_metadata m
  ON e.event_id = m.event_id
LEFT JOIN events_payload p
  ON e.event_id = p.event_id
WHERE e.event_time >= now() - interval '1 day';

Maintenance headaches

More tables to manage and monitor
Complex cleanup for orphaned records
Harder to modify schema (changes need to be coordinated)
More complex backup and restore procedures
Additional foreign key constraints to maintain

When should you actually use vertical partitioning? In my experience, it works best when:

Your analytical queries have clear patterns (some columns always used together)
You have very wide tables where most queries use a small subset of columns
You have different access patterns for different column groups
Your write volume is moderate (and you can handle the extra complexity)
You have columns with very different storage needs (like text vs numeric data)

If you're just starting out with analytics on Postgres, I'd recommend exhausting simpler options first. Materialized views and regular partitioning can take you quite far with less complexity. But when you're hitting scan performance issues and you see clear patterns in how columns are accessed, vertical partitioning might be worth the trade-offs.

Columnar storage with Postgres plugins

Postgres has a powerful extension system that lets you add new capabilities to your database. Think of extensions like browser plugins - they extend functionality but run with the same privileges as Postgres itself. That's both powerful and potentially risky.

Understanding Postgres extensions

Extensions are shared libraries that can:

Add new data types and functions
Create new table access methods
Implement new index types
Add foreign data wrappers

They're typically written in C and need to be compiled against your specific Postgres version. While the official Postgres extension repository (PGXN) has some quality controls, you should still evaluate extensions carefully - they run with full database privileges.

Columnar storage options

For analytics workloads, the following columnar storage extensions stand out:

cstore_fdw

The battle-tested option, developed by Citus Data (now part of Microsoft):

CREATE EXTENSION cstore_fdw;

CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

CREATE FOREIGN TABLE metrics_columnar (
    day date,
    metric_name text,
    value numeric) SERVER cstore_server
OPTIONS(compression 'pglz');

Pros:

Mature and stable
Used in production by many companies
Good documentation and community support
Solid compression ratios
Supports both read and write operations
Works with standard Postgres backup tools

Cons:

No support for indexes
Limited to PostgreSQL 12 and older in the open-source version
Writes are slower than regular tables
No concurrent write support
Limited query optimization capabilities

pg_analytics (formerly pg_lakehouse)

A DuckDB-powered option by ParadeDB focused on data lake integration:

CREATE EXTENSION pg_analytics;

CREATE FOREIGN TABLE metrics (
    day date,
    metric_name text,
    value numeric
) SERVER parquet_server
OPTIONS (files 's3://bucket/metrics/*.parquet');

Pros:

Native integration with data lakes and object stores
Supports Delta Lake and Apache Iceberg formats
Query pushdown to DuckDB for performance
Good compatibility with Postgres types

Cons:

Read-only access currently
Requires shared_preload_libraries configuration
Limited write capabilities
Early in development cycle

pg_duckdb

A relative newcomer by Hydra and MotherDuck that also embeds DuckDB in Postgres:

CREATE EXTENSION pg_duckdb;

CREATE TABLE metrics ( 
   day date,
    metric_name text,
    value numeric
) USING duckdb;

Pros:

Leverages DuckDB's high-performance analytics engine
Can directly query data in S3/object storage
Supports Parquet, CSV, JSON file formats
Integrates with MotherDuck cloud service
Active development from DuckDB team

Cons:

Currently read-only for object storage
Some Postgres data types not yet supported
Relatively new (v0.2.0 as of early 2024)
Limited production deployments

pg_mooncake

The newest kid on the block by Mooncake Labs combines DuckDB execution with Iceberg/Delta Lake storage and is available on Neon:

CREATE EXTENSION pg_mooncake;

CREATE TABLE metrics (
    day date,
    metric_name text,
    value numeric
) USING columnstore;

Pros:

Significantly faster analytics queries due to DuckDB's vectorized execution
Better compression ratios
Native integration with data lakes (can query Parquet files)
Modern columnar storage format
Active development

Cons:

Relatively new and less battle-tested
Limited write capabilities (no UPDATE/DELETE)
Some Postgres features not supported yet
More complex setup if using cloud storage
Smaller community and fewer production deployments

When to use a columnar storage plugin

There are definitely some risks you need to consider before going down this route*:*

Version Lock-in: Extensions need to match your Postgres version exactly
Operational Complexity: Need to maintain extension updates separately
Backup Considerations: Some backup tools may not handle extension data properly
Performance Overhead: Extensions can add CPU and memory overhead
Support Challenges: Limited commercial support options

Despite these challenges, columnar storage can be worth it when:

You have large analytics tables (>100GB)
Queries typically access a subset of columns
You need better compression ratios
Read performance is more important than write performance
You can manage your own Postgres infrastructure

If you're on a managed platform that doesn't support these extensions, you might need to:

Use regular Postgres tables with good indexing strategies
Consider a separate analytics database
Look at managed analytics solutions
Evaluate moving to self-hosted Postgres

Remember: adding extensions is a significant architectural decision. Start with standard Postgres features first, and only add extensions when you have clear evidence they'll solve specific problems in your workload.

While these advanced techniques can give you a huge boost in analytics performance, they also add complexity to your system. Start with the basics (materialized views, partitioning, and analytics replicas) and only move to columnar storage when you have clear evidence they'll help your specific workload.

Analytics query optimization techniques

Up to this point I’ve focused on infrastructure and schema changes - in many situations however, smart query optimization often beats brute force.

Understanding analytics query patterns

Analytics queries are fundamentally different from OLTP queries. They often:

Scan large portions of tables
Perform complex aggregations
Join multiple large tables
Use window functions for trend analysis
Need to handle significant intermediate results

Now I’ll walk you through how to optimize each of these patterns.

Window functions

Window functions are incredibly powerful for analytics. They let you perform calculations across sets of rows without the overhead of self-joins or complex subqueries:

-- Instead of this complex self-join
SELECT 
  month,
  revenue,
  (
    SELECT AVG(revenue) 
    FROM monthly_revenues m2 
    WHERE m2.month <= m1.month 
    AND m2.month > m1.month - interval '3 months'
  ) as rolling_avg
FROM monthly_revenues m1;

Pro tip: Window functions are processed after aggregations. This means you can combine them with GROUP BY:

SELECT
  date_trunc('month', timestamp) as month,
  customer_segment,
  SUM(revenue) as revenue,
  SUM(SUM(revenue)) OVER (
    PARTITION BY customer_segment
    ORDER BY date_trunc('month', timestamp)
  ) as running_total
FROM transactions
GROUP BY 1, 2;

Common table expressions (CTEs)

Common Table Expressions (CTEs) are often seen as just a way to make queries more readable. But they're also a powerful optimization tool because of how they work in Postgres but, that also makes them a double-edged sword:

-- Use CTEs to break down complex analytics
WITH monthly_stats AS (
  SELECT 
    date_trunc('month', timestamp) as month,
    COUNT(DISTINCT user_id) as users,
    SUM(revenue) as revenue
  FROM events 
  WHERE timestamp >= NOW() - interval '12 months'
  GROUP BY 1
),
user_segments AS (
  SELECT 
    user_id,
    CASE 
        WHEN lifetime_value > 1000 THEN 'high'
        WHEN lifetime_value > 100 THEN 'medium'
        ELSE 'low'
    END as segment
  FROM users
)
SELECT 
  month,
  segment,
  COUNT(DISTINCT e.user_id) as users,
  SUM(revenue) as revenue
FROM events e
JOIN user_segments s
  ON e.user_id = s.user_id
GROUP BY 1, 2;

The other side of the sword: By default, CTEs in Postgres are materialized - they're calculated once and reused. This is usually great for analytics but can backfire if your CTE returns a large result set that's only used for a small subset of rows.

Parallel query execution

Postgres can parallelize many analytics operations, but you need to tune it right:

EXPLAIN ANALYZE 
SELECT 
  date_trunc('month', timestamp) as month,
  COUNT(*) as events,
  COUNT(DISTINCT user_id) as users
FROM events
GROUP BY 1;

-- Key settings for parallel queries
SET max_parallel_workers_per_gather = 4;  -- Workers per query
SET parallel_setup_cost = 10;             -- Lower = more parallelism
SET parallel_tuple_cost = 0.001;          -- Lower = more parallelism
SET min_parallel_table_scan_size = '8MB'; -- Table size threshold

Some operations that benefit most from parallelism:

Sequential scans on large tables
Parallel joins (with appropriate settings)
Large aggregations
Window functions

Keep in mind: While parallel queries use more total CPU, they can finish much faster. Monitor your system resources carefully when enabling parallelism.

Essential EXPLAIN analysis

When optimizing analytics queries, EXPLAIN ANALYZE is your best friend. These are some of the specific things to watch for:

-- Get detailed buffering info
EXPLAIN (ANALYZE, BUFFERS) 
SELECT /* your query */;

-- Look for:
-- "Parallel Seq Scan" - Is parallelism being used?
-- "Hash Join" vs "Nested Loop" - Right join strategy?
-- "Sort Method: quicksort" vs "external sort" - Enough work_mem?
-- "Rows Removed by Filter" - Could an index help?

Common analytics query pitfalls

Unneeded ORDER BY in subqueries

-- Sorts intermediate results unnecessarily
SELECT user_id, COUNT(*) 
FROM (
  SELECT DISTINCT user_id 
  FROM events 
  ORDER BY timestamp
) t
GROUP BY 1;

Missing DISTINCT ON for latest values

-- Gets all rows then filters
SELECT user_id, status
FROM user_status
WHERE (user_id, timestamp) IN (
  SELECT user_id, MAX(timestamp)
  FROM user_status
  GROUP BY user_id
);

Not using date truncation indexes

CREATE INDEX idx_events_month ON events (
  date_trunc('month', timestamp)
);

Query optimization is iterative. Start with the basics, measure, then optimize further based on real usage patterns. And always test with realistic data volumes - what works for 1M rows might fall apart at 100M. Lastly, it’s amazing how far you can get by simply following “The 5 rules of writing faster SQL queries.”

Indexing for analytics

BRIN indexes are Postgres’ secret weapon for analytical workloads. Unlike B-tree indexes that track every single row, BRIN (Block Range INdex) creates a tiny index that stores metadata about blocks of data. Think of it like chapter summaries in a book instead of a detailed index of every word.

For time-series data or any naturally ordered columns (like IDs that only go up), BRIN shines. It's perfect when physical storage order and the indexed column are highly correlated for your data.

In Crunchy Data's testing, on a 42MB table, the B-tree indexes were 21MB while the BRIN indexes were just 24KB! And for large result sets (100K+ rows), BRIN consistently outperformed B-tree indexes.

Here's how to create one:

CREATE INDEX event_logs_ts_brin ON event_logs
USING brin(timestamp) WITH (pages_per_range = 128);

The pages_per_range parameter is key - it defaults to 128 but tuning it can really boost performance. For narrow queries (returning ~100 rows), smaller values like 8 or 16 work better. For broader analytical queries, stick with larger values.

But BRIN isn't magic. Skip it when:

Your data is randomly distributed
You need exact row lookups
Your queries are highly selective (returning <1% of rows)
Your table gets lots of UPDATEs in the middle

The sweet spot? Analytical queries that scan large chunks of naturally ordered data. Bonus points if you're tight on disk space - BRIN indexes are tiny.

Want to check if BRIN makes sense? Look at the correlation between storage order and your column:

SELECT corr(ctid::text::float8, timestamp::text::float8) FROM event_logs;

The closer to 1.0, the better BRIN will perform.

Performance tuning for analytics

No matter how well you write your queries, your postgres instance needs proper tuning to handle analytics workloads efficiently. While I covered basic configuration in previous articles, analytics workloads have specific needs that require different trade-offs.

Memory settings

Memory configuration is where analytics workloads differ most from OLTP. You'll need larger memory allocations for complex operations, but you'll also need to be careful not to overallocate:

# Base memory settings
shared_buffers = '8GB'               # More RAM for dedicated analytics instance
work_mem = '1GB'                     # Larger for complex sorts/joins
maintenance_work_mem = '2GB'         # Helps with vacuum on large tables
effective_cache_size = '24GB'        # ~75% of RAM for big queries

# Temporary file settings
temp_buffers = '512MB'              # More temp memory for large sorts
temp_file_limit = '5GB'             # Limit disk usage for temp files

# Setting this too high can cause out-of-memory errors!
max_stack_depth = '6MB'             # For deep recursive queries

I/O configuration

Analytics queries often read large amounts of data. These settings help optimize I/O:

# Sequential scan settings
effective_io_concurrency = 200      # Higher for SSDs/NVME
random_page_cost = 1.1              # Lower for SSDs/NVME

# Checkpoint settings
checkpoint_timeout = '30min'        # Longer for analytics
max_wal_size = '4GB'                # Larger for busy systems
checkpoint_completion_target = 0.9  # Spread checkpoints out

Query planning

The query planner needs different settings for analytics:

# Parallel query settings
max_parallel_workers = 8              # Total parallel workers
max_parallel_workers_per_gather = 4   # Workers per gather node
parallel_setup_cost = 10              # Lower = more parallelism
parallel_tuple_cost = 0.001           # Lower = more parallelism
# Cost settings
default_statistics_target = 500       # More statistics for better plans
from_collapse_limit = 12              # Higher for complex analytics|
join_collapse_limit = 12              # Higher for complex analytics

Vacuum settings

Large analytics tables need different vacuum settings:

# Vacuum settings for large tables
autovacuum_vacuum_scale_factor = 0.1   # More aggressive vacuum
autovacuum_analyze_scale_factor = 0.05
autovacuum_vacuum_cost_limit = 1000    # Higher for faster vacuum
autovacuum_vacuum_cost_delay = 10ms    # Lower for faster vacuum

Monitoring configuration

Monitoring becomes even more critical with analytics workloads:

# Statement tracking
track_io_timing = on                  # Track I/O statistics
track_functions = all                 # Track function calls
track_activity_query_size = 4096      # Longer for complex queries

# Logging settings
log_min_duration_statement = '10s'    # Log slow queries
log_temp_files = 0                    # Log all temp file usage
log_autovacuum_min_duration = 0       # Log vacuum activity

Real-world example

Here's what this might look like in practice. Let's say you have:

32 CPU cores
128GB RAM
NVMe storage
Mixed analytics workload

Your configuration might look like:

# Memory
shared_buffers = '32GB'              # 25% of RAM
work_mem = '512MB'                   # Conservative for multiple concurrent queries
maintenance_work_mem = '4GB'         # Large for vacuum operations
effective_cache_size = '96GB'        # 75% of RAM

# Parallelism
max_parallel_workers = 16            # Half the cores
max_parallel_workers_per_gather = 8  # Quarter the cores

# I/O and WAL
effective_io_concurrency = 200       # High for NVMe
random_page_cost = 1.1               # Low for NVMe
checkpoint_timeout = '30min'
max_wal_size = '16GB'                # Large for busy system

Don't just copy these numbers and YOLO it. Monitor your actual memory usage using tools like pg_stat_statements and adjust based on your workload. Below I’ll call out some gotchas and guidance for testing your settings.

Configuration gotchas

Watch out for these common issues:

Memory overallocation* Too high work_mem can cause OOM kills
Remember it's per operation, not per connection
Excessive parallelism* Too many parallel workers can cause thrashing
Monitor CPU usage and adjust
Aggressive vacuum* Too aggressive settings can impact query performance
Monitor vacuum timing and adjust cost limits
Checkpoint tuning* Too large max_wal_size can cause long recovery times
Too frequent checkpoints can impact performance

Testing your configuration

Always test configuration changes:

Start with a test environment
Benchmark before and after
Monitor system resources
Test with realistic data volumes
Test failure scenarios

Analytics workloads often reveal configuration problems that OLTP workloads miss. What works for millions of small transactions might fail spectacularly for one large analytical query.

Know your limits

Even with all these optimizations, you'll eventually hit limits running analytics on Postgres. Watch for:

Queries taking minutes or longer
High I/O wait times
Memory pressure from large sorts/joins
Vacuum struggling to keep up
Replication lag on analytics replicas

When you see these signs, it's time to start thinking about dedicated analytics solutions. But that's a topic for an upcoming article!

Takeaways

Postgres can handle many analytics workloads if you:

Isolate analytics traffic to dedicated replicas
Optimize your schema and configuration
Use materialized views strategically
Use techniques for optimal analytics querying
Leverage BRIN indexes where they make sense
Monitor and maintain carefully
Know when to quit and move to specialized tools

It's not a perfect analytics database. But with the right approach, it can take you surprisingly far. Just don't forget to keep an eye on those limits - they'll sneak up on you faster than an angry DBA during a production incident.

In the next article, I’ll walk you through when and how to move your analytics workloads off Postgres. Because sometimes, you really do need a hammer instead of that screwdriver.

Need to move your analytics off Postgres?

Tinybird is data infrastructure for software teams. You can query your Postgres tables with SQL and publish queries as dynamic, scalable REST API Endpoints to power in-product reporting, real-time dashboards, and more. You can try Tinybird for free with no time limit. Sign up here.

DEV Community

Outgrowing Postgres: How to run OLAP workloads on Postgres

Understanding OLAP vs OLTP query patterns

Postgres OLAP optimization strategies

Separate your workloads

Design your schema for analytics

Materialized views for pre-aggregation

Table partitioning

Vertical partitioning for analytics

Columnar storage with Postgres plugins

Understanding Postgres extensions

Columnar storage options

When to use a columnar storage plugin

Analytics query optimization techniques

Understanding analytics query patterns

Window functions

Common table expressions (CTEs)

Parallel query execution

Essential EXPLAIN analysis

Common analytics query pitfalls

Indexing for analytics

Performance tuning for analytics

Memory settings

I/O configuration

Query planning

Vacuum settings

Monitoring configuration

Real-world example

Configuration gotchas

Testing your configuration

Know your limits

Takeaways

Top comments (0)

Read next

Extract structured data using Python's advanced techniques

A New Experience

Firebase Alternatives to Consider in 2025

Why top AI architects are DITCHING relationalDBs for knowledge graphs