This week, I took a deep dive into workflow orchestration with Kestra as part of the Data Engineering Zoomcamp by DataTalks.Club. It’s been an insightful journey, even though it started off a bit rough with me battling malaria. Despite that, I pushed through, and I’m really glad I did!
What is Kestra?
Kestra is an open-source, event-driven orchestration platform designed to simplify building scheduled and event-driven workflows. It uses Infrastructure as Code (IaC) practices, which makes creating reliable workflows as easy as writing a few lines of YAML. Think of it like Airflow but with a different flavor. I’m used to Airflow, but I wanted to follow along with Kestra to add another orchestration tool to my repertoire.
Hands-On with Kestra
Setting up Kestra was straightforward thanks to Docker Compose. With just a few commands, I had a Kestra server and a Postgres database up and running locally.
Once up, accessing Kestra’s intuitive UI at http://localhost:8080 made managing workflows a breeze. A huge shoutout to Will Russel and his videos — they were instrumental in helping me navigate through Kestra’s features.
Building ETL Pipelines
One of the main tasks this week was building ETL pipelines for NYC’s Yellow and Green Taxi data. Here’s what we did:
- Extracted data from CSV files.
- Loaded it into Postgres and later into Google Cloud Storage (GCS) and BigQuery.
- Explored scheduling and backfilling workflows.
It was fascinating to see how easily Kestra could schedule tasks and backfill historical data. The YAML configurations were easy to understand, making the flows straightforward to grasp.
Using dbt for Transformation
We also touched on dbt for transforming data in Postgres and BigQuery. Although it was optional, I gave it a shot to see how Kestra handles dbt models. It’s pretty cool that Kestra can sync dbt models from Git and execute them, streamlining the transformation process.
Taking it to the Cloud
Moving the ETL pipelines from a local Postgres database to Google Cloud Platform (GCP) was another highlight. Using GCS as a data lake and BigQuery as a data warehouse felt like a natural progression from local development to scalable cloud infrastructure.
Wrapping Up
I’m thrilled to add Kestra to my orchestration toolkit. It’s been a rewarding week, and I’m happy to see how versatile this tool is.
Looking forward to Week 3 and sharing more of my journey with you all. If you’re also taking the DE Zoomcamp, let’s connect and learn together!
Check out my code for Week 2 on GitHub and feel free to drop your thoughts in the comments!
Top comments (0)