Team Timescale for Timescale

Posted on Mar 3

How Hopthru powers real-time transit analytics from a 1 TB table

#database #devops #postgres #performance

TL;DR
Hopthru uses TimescaleDB to transform minutes-long queries into seconds, enabling real-time transit analytics from a 1 TB hypertable. Their three-person engineering team leverages continuous aggregates and tiered storage to build powerful visualizations that help transit agencies optimize public transportation networks without breaking the bank.

🔗 Jump to the full article

Public transit is the backbone of urban mobility, but how can agencies know if their routes are truly efficient? Enter Hopthru, a Seattle-based startup using time-series data to revolutionize how transit agencies analyze and optimize their networks.

We spoke with Brock Haywood, CTO at Hopthru, about how his team tackles the challenge of analyzing massive datasets from transit systems across North America. Here's how they're doing it with a tiny team and some smart database choices.

The Challenge: Making Sense of Automated Passenger Counting (APC) Data

Most buses in North America have little sensors above the doors that count passengers getting on and off. This data, combined with vehicle location information, creates a goldmine of insights—if you can process it efficiently.

"With this data, we can paint a picture for the agency: 'This is what people were doing on your network at any given time," explains Haywood. "The planners at that agency can use that to understand which routes are underserved because they have particularly high bus loads."

But there's a problem: this data accumulates quickly, and performing analytics across temporal ranges becomes painfully slow with traditional databases.

How They Built It: A Time-Series Data Pipeline for Transit

Hopthru's data ingestion process is surprisingly straightforward. Since most transit agencies don't have large IT teams, they've built a flexible process:

Vehicles return to depots at night and connect to WiFi
Onboard computers upload their data to agency databases
Hopthru's software queries these databases and uploads data to their platform
Data lands in their primary hypertable (now ~1 TB in size)

Before implementing TimescaleDB, Hopthru struggled with a standard PostgreSQL setup on Heroku that couldn't handle their analytics workload.

The Chosen Tech Stack

Python + Django + PostgreSQL + TimescaleDB + Redis

Database Solution for Real-Time Analytics

"Timescale immediately offered a performance boost for these queries that are run over quite large ranges," Haywood explains.

For their small team, the ability to run standard PostgreSQL queries on time-series data while supporting real-time analytics dashboards was the decisive factor for the team.

Their database implementation centers around:

Two hypertables: The primary one for boarding data and a secondary one for a separate pipeline
Five continuous aggregates: Focused on different query patterns that customers typically use
Compression and tiered storage: On their second database instance to manage costs

Here's an example of one of their critical queries against a continuous aggregate:

SELECT
   sum(hopthru_ca_boarding_routes_trips_time_ranges.ons),
   sum(hopthru_ca_boarding_routes_trips_time_ranges.offs),
   max(hopthru_ca_boarding_routes_trips_time_ranges.maxload),
   sum(hopthru_ca_boarding_routes_trips_time_ranges.load) / sum(hopthru_ca_boarding_routes_trips_time_ranges.load_count) AS avg_load,
   sum(hopthru_ca_boarding_routes_trips_time_ranges.passenger_distance_travelled),
   count(DISTINCT hopthru_ca_boarding_routes_trips_time_ranges.service_date) AS total_days
FROM
 hopthru_ca_boarding_routes_trips_time_ranges
 JOIN hopthru_routemaster ON hopthru_ca_boarding_routes_trips_time_ranges.route_master_id = hopthru_routemaster.id
WHERE
 (
   (
     NOT hopthru_routemaster.hidden
     AND hopthru_ca_boarding_routes_trips_time_ranges.trip_master_id IS NOT NULL
   )
   OR hopthru_routemaster.non_gtfs
 )
 AND hopthru_ca_boarding_routes_trips_time_ranges.route_master_id IN ($1 :: uuid)
 AND hopthru_ca_boarding_routes_trips_time_ranges.service_date >= $2 :: date
 AND hopthru_ca_boarding_routes_trips_time_ranges.service_date <= $3 :: date
 AND hopthru_ca_boarding_routes_trips_time_ranges.agency_id = $4 :: uuid
 AND hopthru_ca_boarding_routes_trips_time_ranges.stop_hour >= $5 :: time
 AND hopthru_ca_boarding_routes_trips_time_ranges.stop_hour < $6 :: time;

💥 The impact?

"Implementing continuous aggregates reduced many mission-critical queries from minutes to seconds."

Managed Services for Small Engineering Teams

With just three engineers, Hopthru prioritizes application development over infrastructure management.

"In the 12 years of my startup career, I have almost exclusively worked with PostgreSQL. Having something like Timescale in the cloud allows us to focus on building the application rather than building the infrastructure," Haywood emphasizes.

For performance optimization, they combine database monitoring metrics (execution time and P95) with New Relic data to identify and reproduce bottlenecks and issues for debugging in their user flows.

Cost Optimization with Smart Storage Strategies

For their secondary database that ingests data from sensors communicating directly via MQTT, they've implemented a clever storage strategy:

~2 GB of uncompressed data (recent, frequently accessed)
~2 GB of compressed data (less frequently accessed)
~1 GB in low-cost tiered storage (historical data)

"Tiered storage is incredibly effective," notes Haywood. Since they primarily query the last month of data but need to maintain the full history, this approach significantly reduces costs.

Next Challenge: Raw Sensor Data Processing

The team is now developing "Hopthru Cleanse," which bypasses pre-processed data in favor of directly ingesting and processing raw sensor data.

"We're building a pipeline that takes raw data almost directly from the sensors," explains Haywood. The technical challenge involves data validation to meet the 95% accuracy threshold required for National Transit Database certification.

This shift requires rethinking their data pipeline to handle uncorrelated sensor readings and implement statistical validation—expanding their time-series infrastructure to solve data quality problems.

Lessons for Dev Teams

Choose tools that fit your team size and skills: For a small team without database experts, the ability to use familiar PostgreSQL queries on time-series data was crucial.
Leverage managed services: Focus engineering resources on your core product, not infrastructure management.
Design for your access patterns: Understanding that most queries target recent data allowed them to implement effective tiered storage strategies.
Use continuous aggregates for common query patterns:
Identifying and pre-aggregating for the most common query patterns dramatically improved performance.
Monitor and optimize progressively: Tools like Timescale Insights help identify the queries most in need of optimization.

Would you like to share your project or success story? What challenges are you facing with time-series data or real-time analytics? Let's discuss in the comments!

_This article is part of our "Community Member Spotlight" series, where we invite practitioners to share their work and inspire others with new ways to use technology to solve problems. _

Read the full article.

DEV Community

How Hopthru powers real-time transit analytics from a 1 TB table

The Challenge: Making Sense of Automated Passenger Counting (APC) Data

How They Built It: A Time-Series Data Pipeline for Transit

The Chosen Tech Stack

Database Solution for Real-Time Analytics

Managed Services for Small Engineering Teams

Cost Optimization with Smart Storage Strategies

Next Challenge: Raw Sensor Data Processing

Lessons for Dev Teams

Top comments (0)

Read next

HOW TO ATTACH A DATA DISK TO A VIRTUAL MACHINE AND INITIALIZE IT FOR USE

Supercharging React Native Performance: A Comprehensive Guide to Frontend Databases and MMKV

iPhone 17 design may be drastically different

gradle build failed with unknown error. see logs for the run gradlew . I have faced this sort of error while bundling any suggestions on how to solve this error.