I used to think running large-scale analytics workloads on Postgres was like trying to use a screwdriver as a hammer - technically possible but probably not a great idea. But after years of running both OLTP and OLAP workloads on Postgres at scale, I've learned it's much more nuanced than that. Postgres can actually handle analytics workloads quite well, if you know its limits and how to work within them.
In my previous articles, we looked at handling growing data volumes and increasing user concurrency. Analytics workloads add another dimension of complexity - instead of simple lookups and CRUD operations, you're now dealing with complex aggregations across millions or billions of rows, joins across multiple large tables, and queries that need to scan huge chunks of your data.
The good news? With the right approach, Postgres actually can handle many analytics use cases, from basic reporting to complex business intelligence queries. I've seen it work well for datasets up to several terabytes, providing query response times from seconds to minutes. The bad news? You'll need to make some specific optimizations and architectural choices to get there, and it does require a lot of oversight, planning, and investment.
Understanding OLAP vs OLTP query patterns
Before we get into specific optimizations, let's understand what makes analytics queries fundamentally different from the transactional queries we've discussed in previous articles. Having run both types at scale, I can tell you the differences go way beyond just query complexity.
Here's what we're dealing with:
Dimension | OLTP Workloads | OLAP Workloads |
---|---|---|
Query Type | Single-record lookups and updates Example: Fetching a user's profile | Large dataset analysis Example: Monthly revenue across all customers |
Query Frequency | High (100s-1000s/second) | Lower, unless user-facing dashboards |
Query Complexity | Simple, few joins | Complex, multiple joins and aggregations |
Performance Needs | Milliseconds or bust | Seconds OK for batch - Must match OLTP speed for user-facing |
Resource Pattern | CPU-bound Heavy index usage | I/O-bound Large sequential scans |
Data Access | < 1% of rows per query | Often > 10% of rows per query |
Index Usage | Relies heavily on index seeks | Often bypasses indexes for full scans |
I think this table neatly summarizes why we need different strategies for OLAP workloads in Postgres. When your analytics queries start competing with your OLTP workloads for resources, you'll need to get creative with your approach - which is exactly what we'll cover next.
Postgres OLAP optimization strategies
Separate your workloads
The first rule of running analytics on Postgres: don't run analytics queries on your primary OLTP database. This isn't just best practice - it's survival. One complex analytics query scanning millions of rows can bring your entire application to a halt. I've seen it happen, and the incident reviews aren't fun.
Instead, set up a dedicated analytics replica. But not just any replica - one specifically configured for analytical workloads. Here's what this gets you:
- Complete isolation from production OLTP traffic
- Freedom to tune and optimize specifically for analytics
- Ability to scale your analytics infrastructure independently
- Protection against runaway queries impacting production
- Option to add more read replicas as analytics needs grow
Here's a basic setup using streaming replication:
- Configure the primary and replicas by updating the postgresql.conf:
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10 # This should generally be double the number of replicas
- Update your application to point to the analytics replicas where appropriate.
Notice the differences from the suggested settings in the “Outgrowing Postgres: Handling increased user concurrency” article. The key difference is that you’ll want to optimize these replicas for analytics workloads. Skip to the Performance tuning for analytics section below for more find-tuned settings for your replicas.
Key things you’ll want to watch out for:
- Replication lag: Analytics queries can cause the replica to fall behind
- Resource contention: Even on a replica, poorly optimized queries can cause problems
- Data freshness: Your analytics will be slightly behind production due to replication
A common question I get is "How much replication lag is acceptable?" For most analytics workloads, a few seconds or even minutes of lag is fine. But if you need real-time analytics, you'll need to either accept higher costs from a more powerful replica setup or start looking at specialized analytics solutions.
Design your schema for analytics
Your schema design can make or break analytics performance. Let's look at several techniques that can dramatically improve query speed, starting with the basics and moving to more advanced approaches.
Materialized views for pre-aggregation
If you've read my first article in this series on handling growing data volumes, you're already familiar with materialized views. They're particularly valuable for analytics workloads because they let you pre-calculate expensive aggregations:
CREATE MATERIALIZED VIEW hourly_stats AS
SELECT
date_trunc('hour', event_time) as hour,
count(*) as events,
count(distinct user_id) as users
FROM events GROUP BY 1;
CREATE UNIQUE INDEX ON hourly_stats(hour);
For a more thorough treatment of materialized views, including concurrent refreshes and maintenance strategies, check out the materialized views section of that article.
Table partitioning
Partitioning is another fundamental technique we covered in the data volumes article. For analytics workloads, time-based partitioning is particularly effective:
CREATE TABLE events (
event_time timestamptz,
user_id int,
event_type text
) PARTITION BY RANGE (event_time);
Take a look at the table partitioning section of that article for detailed partitioning strategies and maintenance approaches.
Vertical partitioning for analytics
Here's where things get interesting for analytics workloads. Vertical partitioning - splitting tables by columns rather than rows - can significantly improve analytics performance. While we touched on this in that same data volumes article, it deserves special attention for analytics.
Consider a wide events table:
CREATE TABLE events (
-- Frequently queried columns
event_time timestamptz,
user_id int,
event_type text,
-- Rarely queried metadata
user_agent text,
ip_address inet,
-- Large payload rarely used in aggregations
event_data jsonb
);
You can split this into separate tables based on query patterns:
CREATE TABLE events_core (
event_id bigserial PRIMARY KEY,
event_time timestamptz,
user_id int,
event_type text);
CREATE TABLE events_metadata (
event_id bigint PRIMARY KEY REFERENCES events_core,
user_agent text,
ip_address inet
);
CREATE TABLE events_payload (
event_id bigint PRIMARY KEY REFERENCES events_core,
event_data jsonb
);
This approach has several benefits for analytics:
- Queries only need to scan the columns they need
- Better cache utilization for frequently accessed columns
- More efficient compression for similar data types
- Ability to use different storage strategies per table
But - and this is a big but - vertical partitioning isn't free. Here's what you're signing up for:
Write complexity
- Every insert now requires multiple queries
- Transactions need to span multiple tables
- More complex error handling and rollback scenarios
- Higher chance of constraint violations
- Need for careful transaction management to maintain data consistency
-- Instead of one simple insert
INSERT INTO events VALUES (...);
-- You now need transaction-wrapped multi-table inserts
BEGIN;
INSERT INTO events_core (...) RETURNING event_id;
INSERT INTO events_metadata (...);
INSERT INTO events_payload (...);
COMMIT;
Read complexity
- If you need to read columns across the partitions, simple queries become joins
- Need to maintain indexes across multiple tables
- More complex query planning
- Potential for performance cliffs if joins aren't optimized
-- Simple query becomes a three-way join
SELECT e.event_time, e.event_type, m.user_agent, p.event_data
FROM events_core e
LEFT JOIN events_metadata m
ON e.event_id = m.event_id
LEFT JOIN events_payload p
ON e.event_id = p.event_id
WHERE e.event_time >= now() - interval '1 day';
Maintenance headaches
- More tables to manage and monitor
- Complex cleanup for orphaned records
- Harder to modify schema (changes need to be coordinated)
- More complex backup and restore procedures
- Additional foreign key constraints to maintain
When should you actually use vertical partitioning? In my experience, it works best when:
- Your analytical queries have clear patterns (some columns always used together)
- You have very wide tables where most queries use a small subset of columns
- You have different access patterns for different column groups
- Your write volume is moderate (and you can handle the extra complexity)
- You have columns with very different storage needs (like text vs numeric data)
If you're just starting out with analytics on Postgres, I'd recommend exhausting simpler options first. Materialized views and regular partitioning can take you quite far with less complexity. But when you're hitting scan performance issues and you see clear patterns in how columns are accessed, vertical partitioning might be worth the trade-offs.
Columnar storage with Postgres plugins
Postgres has a powerful extension system that lets you add new capabilities to your database. Think of extensions like browser plugins - they extend functionality but run with the same privileges as Postgres itself. That's both powerful and potentially risky.
Understanding Postgres extensions
Extensions are shared libraries that can:
- Add new data types and functions
- Create new table access methods
- Implement new index types
- Add foreign data wrappers
They're typically written in C and need to be compiled against your specific Postgres version. While the official Postgres extension repository (PGXN) has some quality controls, you should still evaluate extensions carefully - they run with full database privileges.
Columnar storage options
For analytics workloads, the following columnar storage extensions stand out:
The battle-tested option, developed by Citus Data (now part of Microsoft):
CREATE EXTENSION cstore_fdw;
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;
CREATE FOREIGN TABLE metrics_columnar (
day date,
metric_name text,
value numeric) SERVER cstore_server
OPTIONS(compression 'pglz');
Pros:
- Mature and stable
- Used in production by many companies
- Good documentation and community support
- Solid compression ratios
- Supports both read and write operations
- Works with standard Postgres backup tools
Cons:
- No support for indexes
- Limited to PostgreSQL 12 and older in the open-source version
- Writes are slower than regular tables
- No concurrent write support
- Limited query optimization capabilities
pg_analytics (formerly pg_lakehouse)
A DuckDB-powered option by ParadeDB focused on data lake integration:
CREATE EXTENSION pg_analytics;
CREATE FOREIGN TABLE metrics (
day date,
metric_name text,
value numeric
) SERVER parquet_server
OPTIONS (files 's3://bucket/metrics/*.parquet');
Pros:
- Native integration with data lakes and object stores
- Supports Delta Lake and Apache Iceberg formats
- Query pushdown to DuckDB for performance
- Good compatibility with Postgres types
Cons:
- Read-only access currently
- Requires shared_preload_libraries configuration
- Limited write capabilities
- Early in development cycle
A relative newcomer by Hydra and MotherDuck that also embeds DuckDB in Postgres:
CREATE EXTENSION pg_duckdb;
CREATE TABLE metrics (
day date,
metric_name text,
value numeric
) USING duckdb;
Pros:
- Leverages DuckDB's high-performance analytics engine
- Can directly query data in S3/object storage
- Supports Parquet, CSV, JSON file formats
- Integrates with MotherDuck cloud service
- Active development from DuckDB team
Cons:
- Currently read-only for object storage
- Some Postgres data types not yet supported
- Relatively new (v0.2.0 as of early 2024)
- Limited production deployments
The newest kid on the block by Mooncake Labs combines DuckDB execution with Iceberg/Delta Lake storage and is available on Neon:
CREATE EXTENSION pg_mooncake;
CREATE TABLE metrics (
day date,
metric_name text,
value numeric
) USING columnstore;
Pros:
- Significantly faster analytics queries due to DuckDB's vectorized execution
- Better compression ratios
- Native integration with data lakes (can query Parquet files)
- Modern columnar storage format
- Active development
Cons:
- Relatively new and less battle-tested
- Limited write capabilities (no UPDATE/DELETE)
- Some Postgres features not supported yet
- More complex setup if using cloud storage
- Smaller community and fewer production deployments
When to use a columnar storage plugin
There are definitely some risks you need to consider before going down this route*:*
- Version Lock-in: Extensions need to match your Postgres version exactly
- Operational Complexity: Need to maintain extension updates separately
- Backup Considerations: Some backup tools may not handle extension data properly
- Performance Overhead: Extensions can add CPU and memory overhead
- Support Challenges: Limited commercial support options
Despite these challenges, columnar storage can be worth it when:
- You have large analytics tables (>100GB)
- Queries typically access a subset of columns
- You need better compression ratios
- Read performance is more important than write performance
- You can manage your own Postgres infrastructure
If you're on a managed platform that doesn't support these extensions, you might need to:
- Use regular Postgres tables with good indexing strategies
- Consider a separate analytics database
- Look at managed analytics solutions
- Evaluate moving to self-hosted Postgres
Remember: adding extensions is a significant architectural decision. Start with standard Postgres features first, and only add extensions when you have clear evidence they'll solve specific problems in your workload.
While these advanced techniques can give you a huge boost in analytics performance, they also add complexity to your system. Start with the basics (materialized views, partitioning, and analytics replicas) and only move to columnar storage when you have clear evidence they'll help your specific workload.
Analytics query optimization techniques
Up to this point I’ve focused on infrastructure and schema changes - in many situations however, smart query optimization often beats brute force.
Understanding analytics query patterns
Analytics queries are fundamentally different from OLTP queries. They often:
- Scan large portions of tables
- Perform complex aggregations
- Join multiple large tables
- Use window functions for trend analysis
- Need to handle significant intermediate results
Now I’ll walk you through how to optimize each of these patterns.
Window functions
Window functions are incredibly powerful for analytics. They let you perform calculations across sets of rows without the overhead of self-joins or complex subqueries:
-- Instead of this complex self-join
SELECT
month,
revenue,
(
SELECT AVG(revenue)
FROM monthly_revenues m2
WHERE m2.month <= m1.month
AND m2.month > m1.month - interval '3 months'
) as rolling_avg
FROM monthly_revenues m1;
Pro tip: Window functions are processed after aggregations. This means you can combine them with GROUP BY
:
SELECT
date_trunc('month', timestamp) as month,
customer_segment,
SUM(revenue) as revenue,
SUM(SUM(revenue)) OVER (
PARTITION BY customer_segment
ORDER BY date_trunc('month', timestamp)
) as running_total
FROM transactions
GROUP BY 1, 2;
Common table expressions (CTEs)
Common Table Expressions (CTEs) are often seen as just a way to make queries more readable. But they're also a powerful optimization tool because of how they work in Postgres but, that also makes them a double-edged sword:
-- Use CTEs to break down complex analytics
WITH monthly_stats AS (
SELECT
date_trunc('month', timestamp) as month,
COUNT(DISTINCT user_id) as users,
SUM(revenue) as revenue
FROM events
WHERE timestamp >= NOW() - interval '12 months'
GROUP BY 1
),
user_segments AS (
SELECT
user_id,
CASE
WHEN lifetime_value > 1000 THEN 'high'
WHEN lifetime_value > 100 THEN 'medium'
ELSE 'low'
END as segment
FROM users
)
SELECT
month,
segment,
COUNT(DISTINCT e.user_id) as users,
SUM(revenue) as revenue
FROM events e
JOIN user_segments s
ON e.user_id = s.user_id
GROUP BY 1, 2;
The other side of the sword: By default, CTEs in Postgres are materialized - they're calculated once and reused. This is usually great for analytics but can backfire if your CTE returns a large result set that's only used for a small subset of rows.
Parallel query execution
Postgres can parallelize many analytics operations, but you need to tune it right:
EXPLAIN ANALYZE
SELECT
date_trunc('month', timestamp) as month,
COUNT(*) as events,
COUNT(DISTINCT user_id) as users
FROM events
GROUP BY 1;
-- Key settings for parallel queries
SET max_parallel_workers_per_gather = 4; -- Workers per query
SET parallel_setup_cost = 10; -- Lower = more parallelism
SET parallel_tuple_cost = 0.001; -- Lower = more parallelism
SET min_parallel_table_scan_size = '8MB'; -- Table size threshold
Some operations that benefit most from parallelism:
- Sequential scans on large tables
- Parallel joins (with appropriate settings)
- Large aggregations
- Window functions
Keep in mind: While parallel queries use more total CPU, they can finish much faster. Monitor your system resources carefully when enabling parallelism.
Essential EXPLAIN analysis
When optimizing analytics queries, EXPLAIN ANALYZE
is your best friend. These are some of the specific things to watch for:
-- Get detailed buffering info
EXPLAIN (ANALYZE, BUFFERS)
SELECT /* your query */;
-- Look for:
-- "Parallel Seq Scan" - Is parallelism being used?
-- "Hash Join" vs "Nested Loop" - Right join strategy?
-- "Sort Method: quicksort" vs "external sort" - Enough work_mem?
-- "Rows Removed by Filter" - Could an index help?
Common analytics query pitfalls
- Unneeded
ORDER BY
in subqueries
-- Sorts intermediate results unnecessarily
SELECT user_id, COUNT(*)
FROM (
SELECT DISTINCT user_id
FROM events
ORDER BY timestamp
) t
GROUP BY 1;
- Missing
DISTINCT ON
for latest values
-- Gets all rows then filters
SELECT user_id, status
FROM user_status
WHERE (user_id, timestamp) IN (
SELECT user_id, MAX(timestamp)
FROM user_status
GROUP BY user_id
);
- Not using date truncation indexes
CREATE INDEX idx_events_month ON events (
date_trunc('month', timestamp)
);
Query optimization is iterative. Start with the basics, measure, then optimize further based on real usage patterns. And always test with realistic data volumes - what works for 1M rows might fall apart at 100M. Lastly, it’s amazing how far you can get by simply following “The 5 rules of writing faster SQL queries.”
Indexing for analytics
BRIN indexes are Postgres’ secret weapon for analytical workloads. Unlike B-tree indexes that track every single row, BRIN (Block Range INdex) creates a tiny index that stores metadata about blocks of data. Think of it like chapter summaries in a book instead of a detailed index of every word.
For time-series data or any naturally ordered columns (like IDs that only go up), BRIN shines. It's perfect when physical storage order and the indexed column are highly correlated for your data.
In Crunchy Data's testing, on a 42MB table, the B-tree indexes were 21MB while the BRIN indexes were just 24KB! And for large result sets (100K+ rows), BRIN consistently outperformed B-tree indexes.
Here's how to create one:
CREATE INDEX event_logs_ts_brin ON event_logs
USING brin(timestamp) WITH (pages_per_range = 128);
The pages_per_range
parameter is key - it defaults to 128 but tuning it can really boost performance. For narrow queries (returning ~100 rows), smaller values like 8 or 16 work better. For broader analytical queries, stick with larger values.
But BRIN isn't magic. Skip it when:
- Your data is randomly distributed
- You need exact row lookups
- Your queries are highly selective (returning <1% of rows)
- Your table gets lots of
UPDATE
s in the middle
The sweet spot? Analytical queries that scan large chunks of naturally ordered data. Bonus points if you're tight on disk space - BRIN indexes are tiny.
Want to check if BRIN makes sense? Look at the correlation between storage order and your column:
SELECT corr(ctid::text::float8, timestamp::text::float8) FROM event_logs;
The closer to 1.0, the better BRIN will perform.
Performance tuning for analytics
No matter how well you write your queries, your postgres instance needs proper tuning to handle analytics workloads efficiently. While I covered basic configuration in previous articles, analytics workloads have specific needs that require different trade-offs.
Memory settings
Memory configuration is where analytics workloads differ most from OLTP. You'll need larger memory allocations for complex operations, but you'll also need to be careful not to overallocate:
# Base memory settings
shared_buffers = '8GB' # More RAM for dedicated analytics instance
work_mem = '1GB' # Larger for complex sorts/joins
maintenance_work_mem = '2GB' # Helps with vacuum on large tables
effective_cache_size = '24GB' # ~75% of RAM for big queries
# Temporary file settings
temp_buffers = '512MB' # More temp memory for large sorts
temp_file_limit = '5GB' # Limit disk usage for temp files
# Setting this too high can cause out-of-memory errors!
max_stack_depth = '6MB' # For deep recursive queries
I/O configuration
Analytics queries often read large amounts of data. These settings help optimize I/O:
# Sequential scan settings
effective_io_concurrency = 200 # Higher for SSDs/NVME
random_page_cost = 1.1 # Lower for SSDs/NVME
# Checkpoint settings
checkpoint_timeout = '30min' # Longer for analytics
max_wal_size = '4GB' # Larger for busy systems
checkpoint_completion_target = 0.9 # Spread checkpoints out
Query planning
The query planner needs different settings for analytics:
# Parallel query settings
max_parallel_workers = 8 # Total parallel workers
max_parallel_workers_per_gather = 4 # Workers per gather node
parallel_setup_cost = 10 # Lower = more parallelism
parallel_tuple_cost = 0.001 # Lower = more parallelism
# Cost settings
default_statistics_target = 500 # More statistics for better plans
from_collapse_limit = 12 # Higher for complex analytics|
join_collapse_limit = 12 # Higher for complex analytics
Vacuum settings
Large analytics tables need different vacuum settings:
# Vacuum settings for large tables
autovacuum_vacuum_scale_factor = 0.1 # More aggressive vacuum
autovacuum_analyze_scale_factor = 0.05
autovacuum_vacuum_cost_limit = 1000 # Higher for faster vacuum
autovacuum_vacuum_cost_delay = 10ms # Lower for faster vacuum
Monitoring configuration
Monitoring becomes even more critical with analytics workloads:
# Statement tracking
track_io_timing = on # Track I/O statistics
track_functions = all # Track function calls
track_activity_query_size = 4096 # Longer for complex queries
# Logging settings
log_min_duration_statement = '10s' # Log slow queries
log_temp_files = 0 # Log all temp file usage
log_autovacuum_min_duration = 0 # Log vacuum activity
Real-world example
Here's what this might look like in practice. Let's say you have:
- 32 CPU cores
- 128GB RAM
- NVMe storage
- Mixed analytics workload
Your configuration might look like:
# Memory
shared_buffers = '32GB' # 25% of RAM
work_mem = '512MB' # Conservative for multiple concurrent queries
maintenance_work_mem = '4GB' # Large for vacuum operations
effective_cache_size = '96GB' # 75% of RAM
# Parallelism
max_parallel_workers = 16 # Half the cores
max_parallel_workers_per_gather = 8 # Quarter the cores
# I/O and WAL
effective_io_concurrency = 200 # High for NVMe
random_page_cost = 1.1 # Low for NVMe
checkpoint_timeout = '30min'
max_wal_size = '16GB' # Large for busy system
Don't just copy these numbers and YOLO it. Monitor your actual memory usage using tools like pg_stat_statements and adjust based on your workload. Below I’ll call out some gotchas and guidance for testing your settings.
Configuration gotchas
Watch out for these common issues:
- Memory overallocation* Too high work_mem can cause OOM kills
Remember it's per operation, not per connection
Excessive parallelism* Too many parallel workers can cause thrashing
Monitor CPU usage and adjust
Aggressive vacuum* Too aggressive settings can impact query performance
Monitor vacuum timing and adjust cost limits
Checkpoint tuning* Too large max_wal_size can cause long recovery times
Too frequent checkpoints can impact performance
Testing your configuration
Always test configuration changes:
- Start with a test environment
- Benchmark before and after
- Monitor system resources
- Test with realistic data volumes
- Test failure scenarios
Analytics workloads often reveal configuration problems that OLTP workloads miss. What works for millions of small transactions might fail spectacularly for one large analytical query.
Know your limits
Even with all these optimizations, you'll eventually hit limits running analytics on Postgres. Watch for:
- Queries taking minutes or longer
- High I/O wait times
- Memory pressure from large sorts/joins
- Vacuum struggling to keep up
- Replication lag on analytics replicas
When you see these signs, it's time to start thinking about dedicated analytics solutions. But that's a topic for an upcoming article!
Takeaways
Postgres can handle many analytics workloads if you:
- Isolate analytics traffic to dedicated replicas
- Optimize your schema and configuration
- Use materialized views strategically
- Use techniques for optimal analytics querying
- Leverage BRIN indexes where they make sense
- Monitor and maintain carefully
- Know when to quit and move to specialized tools
It's not a perfect analytics database. But with the right approach, it can take you surprisingly far. Just don't forget to keep an eye on those limits - they'll sneak up on you faster than an angry DBA during a production incident.
In the next article, I’ll walk you through when and how to move your analytics workloads off Postgres. Because sometimes, you really do need a hammer instead of that screwdriver.
Need to move your analytics off Postgres?
Tinybird is data infrastructure for software teams. You can query your Postgres tables with SQL and publish queries as dynamic, scalable REST API Endpoints to power in-product reporting, real-time dashboards, and more. You can try Tinybird for free with no time limit. Sign up here.
Top comments (0)