TL;DR
Hopthru uses TimescaleDB to transform minutes-long queries into seconds, enabling real-time transit analytics from a 1 TB hypertable. Their three-person engineering team leverages continuous aggregates and tiered storage to build powerful visualizations that help transit agencies optimize public transportation networks without breaking the bank.
đź”— Jump to the full article
Public transit is the backbone of urban mobility, but how can agencies know if their routes are truly efficient? Enter Hopthru, a Seattle-based startup using time-series data to revolutionize how transit agencies analyze and optimize their networks.
We spoke with Brock Haywood, CTO at Hopthru, about how his team tackles the challenge of analyzing massive datasets from transit systems across North America. Here's how they're doing it with a tiny team and some smart database choices.
The Challenge: Making Sense of Automated Passenger Counting (APC) Data
Most buses in North America have little sensors above the doors that count passengers getting on and off. This data, combined with vehicle location information, creates a goldmine of insights—if you can process it efficiently.
"With this data, we can paint a picture for the agency: 'This is what people were doing on your network at any given time," explains Haywood. "The planners at that agency can use that to understand which routes are underserved because they have particularly high bus loads."
But there's a problem: this data accumulates quickly, and performing analytics across temporal ranges becomes painfully slow with traditional databases.
How They Built It: A Time-Series Data Pipeline for Transit
Hopthru's data ingestion process is surprisingly straightforward. Since most transit agencies don't have large IT teams, they've built a flexible process:
Vehicles return to depots at night and connect to WiFi
Onboard computers upload their data to agency databases
Hopthru's software queries these databases and uploads data to their platform
Data lands in their primary hypertable (now ~1 TB in size)
Before implementing TimescaleDB, Hopthru struggled with a standard PostgreSQL setup on Heroku that couldn't handle their analytics workload.
The Chosen Tech Stack
Python + Django + PostgreSQL + TimescaleDB + Redis
Database Solution for Real-Time Analytics
"Timescale immediately offered a performance boost for these queries that are run over quite large ranges," Haywood explains.
For their small team, the ability to run standard PostgreSQL queries on time-series data while supporting real-time analytics dashboards was the decisive factor for the team.
Their database implementation centers around:
Two hypertables: The primary one for boarding data and a secondary one for a separate pipeline
Five continuous aggregates: Focused on different query patterns that customers typically use
Compression and tiered storage: On their second database instance to manage costs
Here's an example of one of their critical queries against a continuous aggregate:
SELECT
sum(hopthru_ca_boarding_routes_trips_time_ranges.ons),
sum(hopthru_ca_boarding_routes_trips_time_ranges.offs),
max(hopthru_ca_boarding_routes_trips_time_ranges.maxload),
sum(hopthru_ca_boarding_routes_trips_time_ranges.load) / sum(hopthru_ca_boarding_routes_trips_time_ranges.load_count) AS avg_load,
sum(hopthru_ca_boarding_routes_trips_time_ranges.passenger_distance_travelled),
count(DISTINCT hopthru_ca_boarding_routes_trips_time_ranges.service_date) AS total_days
FROM
hopthru_ca_boarding_routes_trips_time_ranges
JOIN hopthru_routemaster ON hopthru_ca_boarding_routes_trips_time_ranges.route_master_id = hopthru_routemaster.id
WHERE
(
(
NOT hopthru_routemaster.hidden
AND hopthru_ca_boarding_routes_trips_time_ranges.trip_master_id IS NOT NULL
)
OR hopthru_routemaster.non_gtfs
)
AND hopthru_ca_boarding_routes_trips_time_ranges.route_master_id IN ($1 :: uuid)
AND hopthru_ca_boarding_routes_trips_time_ranges.service_date >= $2 :: date
AND hopthru_ca_boarding_routes_trips_time_ranges.service_date <= $3 :: date
AND hopthru_ca_boarding_routes_trips_time_ranges.agency_id = $4 :: uuid
AND hopthru_ca_boarding_routes_trips_time_ranges.stop_hour >= $5 :: time
AND hopthru_ca_boarding_routes_trips_time_ranges.stop_hour < $6 :: time;
đź’Ą The impact?
"Implementing continuous aggregates reduced many mission-critical queries from minutes to seconds."
Managed Services for Small Engineering Teams
With just three engineers, Hopthru prioritizes application development over infrastructure management.
"In the 12 years of my startup career, I have almost exclusively worked with PostgreSQL. Having something like Timescale in the cloud allows us to focus on building the application rather than building the infrastructure," Haywood emphasizes.
For performance optimization, they combine database monitoring metrics (execution time and P95) with New Relic data to identify and reproduce bottlenecks and issues for debugging in their user flows.
Cost Optimization with Smart Storage Strategies
For their secondary database that ingests data from sensors communicating directly via MQTT, they've implemented a clever storage strategy:
~2 GB of uncompressed data (recent, frequently accessed)
~2 GB of compressed data (less frequently accessed)
~1 GB in low-cost tiered storage (historical data)
"Tiered storage is incredibly effective," notes Haywood. Since they primarily query the last month of data but need to maintain the full history, this approach significantly reduces costs.
Next Challenge: Raw Sensor Data Processing
The team is now developing "Hopthru Cleanse," which bypasses pre-processed data in favor of directly ingesting and processing raw sensor data.
"We're building a pipeline that takes raw data almost directly from the sensors," explains Haywood. The technical challenge involves data validation to meet the 95% accuracy threshold required for National Transit Database certification.
This shift requires rethinking their data pipeline to handle uncorrelated sensor readings and implement statistical validation—expanding their time-series infrastructure to solve data quality problems.
Lessons for Dev Teams
Choose tools that fit your team size and skills: For a small team without database experts, the ability to use familiar PostgreSQL queries on time-series data was crucial.
Leverage managed services: Focus engineering resources on your core product, not infrastructure management.
Design for your access patterns: Understanding that most queries target recent data allowed them to implement effective tiered storage strategies.
Use continuous aggregates for common query patterns:
Identifying and pre-aggregating for the most common query patterns dramatically improved performance.Monitor and optimize progressively: Tools like Timescale Insights help identify the queries most in need of optimization.
Would you like to share your project or success story? What challenges are you facing with time-series data or real-time analytics? Let's discuss in the comments!
_This article is part of our "Community Member Spotlight" series, where we invite practitioners to share their work and inspire others with new ways to use technology to solve problems. _
Top comments (0)