Introduction
- The session covers automation of data pipelines using schedules and backfills in Kestra.
- Previously, a Postgres pipeline was created to:
- Load taxi data from a CSV file into a staging table.
- Merge it into a main table, ensuring no duplicate records.
- The focus now is on automating this process with scheduled triggers and handling historical data gaps with backfills.
Scheduling in Kestra
- A schedule is implemented using triggers.
- Instead of manually specifying the month and year, the trigger automatically determines these values.
- The workflow includes:
- Two triggers set up at different times (9 AM and 10 AM on the 1st of each month).
- Configuration using a Cron expression.
- The schedule ensures automatic data retrieval as new data is available monthly from New York taxis.
Understanding Backfills
- Problem: The available data is from 2019 and 2020, but the pipeline is set up to run from 2025 onwards.
- Solution: Use backfills to execute past-due schedules and fill in missing data.
- Backfills allow running the pipeline for past dates by selecting a start date (Jan 1, 2019) and an end date (Dec 31, 2019).
- Important considerations:
- The scheduled execution time (9 AM on the 1st of each month) must be adhered to.
- Setting a backfill label helps differentiate historical runs from real-time executions.
- Execution logs show which dataset (e.g., Green Taxi Data) is being backfilled.
Managing Concurrency and Staging Tables
- By default, a single staging table is used, limiting execution to one month at a time.
- To run multiple backfills simultaneously, consider:
- Creating separate staging tables for each month.
- Using date triggers to generate dynamic table names.
- Setting a concurrency limit of one to avoid conflicts.
- Potential issues:
- Truncation conflicts: If multiple processes modify the same table, it can lead to data corruption.
- Large number of tables: Long-running backfills may create excessive tables, which may need to be dropped afterward.
Verifying Backfill Execution
- After execution, verify data integrity by:
- Checking PG Admin for new tables (e.g.,
green_trip_data
andgreen_trip_data_staging
). - Ensuring all 12 months of data have been processed correctly.
- Validating the row count (e.g., ~6 million rows for 2019).
- Confirming unique row IDs exist despite some missing vendor IDs.
- Checking PG Admin for new tables (e.g.,
Future Enhancements
-
Automating pipelines further:
- Using schedules to process new data as it arrives.
- Running backfills dynamically to fill in historical data.
- Integrating DBT (Data Build Tool) with Postgres to improve workflow automation.
- Upcoming lessons will focus on enhancing the efficiency and scalability of these automated workflows.
Conclusion
- Kestra scheduling and backfills help automate data pipelines effectively.
- Triggers allow data ingestion without manual intervention.
- Backfills help populate missing historical data.
- Proper concurrency management prevents conflicts in data processing.
- Future improvements will involve DBT integration and more robust automation techniques.
Top comments (0)