DEV Community

Cover image for Study Notes 2.2.4: Managing Scheduling and Backfills with Postgres in Kestra
Pizofreude
Pizofreude

Posted on

Study Notes 2.2.4: Managing Scheduling and Backfills with Postgres in Kestra

Introduction

  • The session covers automation of data pipelines using schedules and backfills in Kestra.
  • Previously, a Postgres pipeline was created to:
    • Load taxi data from a CSV file into a staging table.
    • Merge it into a main table, ensuring no duplicate records.
  • The focus now is on automating this process with scheduled triggers and handling historical data gaps with backfills.

Scheduling in Kestra

  • A schedule is implemented using triggers.
  • Instead of manually specifying the month and year, the trigger automatically determines these values.
  • The workflow includes:
    • Two triggers set up at different times (9 AM and 10 AM on the 1st of each month).
    • Configuration using a Cron expression.
    • The schedule ensures automatic data retrieval as new data is available monthly from New York taxis.

Understanding Backfills

  • Problem: The available data is from 2019 and 2020, but the pipeline is set up to run from 2025 onwards.
  • Solution: Use backfills to execute past-due schedules and fill in missing data.
  • Backfills allow running the pipeline for past dates by selecting a start date (Jan 1, 2019) and an end date (Dec 31, 2019).
  • Important considerations:
    • The scheduled execution time (9 AM on the 1st of each month) must be adhered to.
    • Setting a backfill label helps differentiate historical runs from real-time executions.
    • Execution logs show which dataset (e.g., Green Taxi Data) is being backfilled.

Managing Concurrency and Staging Tables

  • By default, a single staging table is used, limiting execution to one month at a time.
  • To run multiple backfills simultaneously, consider:
    • Creating separate staging tables for each month.
    • Using date triggers to generate dynamic table names.
    • Setting a concurrency limit of one to avoid conflicts.
  • Potential issues:
    • Truncation conflicts: If multiple processes modify the same table, it can lead to data corruption.
    • Large number of tables: Long-running backfills may create excessive tables, which may need to be dropped afterward.

Verifying Backfill Execution

  • After execution, verify data integrity by:
    • Checking PG Admin for new tables (e.g., green_trip_data and green_trip_data_staging).
    • Ensuring all 12 months of data have been processed correctly.
    • Validating the row count (e.g., ~6 million rows for 2019).
    • Confirming unique row IDs exist despite some missing vendor IDs.

Future Enhancements

  • Automating pipelines further:
    • Using schedules to process new data as it arrives.
    • Running backfills dynamically to fill in historical data.
  • Integrating DBT (Data Build Tool) with Postgres to improve workflow automation.
  • Upcoming lessons will focus on enhancing the efficiency and scalability of these automated workflows.

Conclusion

  • Kestra scheduling and backfills help automate data pipelines effectively.
  • Triggers allow data ingestion without manual intervention.
  • Backfills help populate missing historical data.
  • Proper concurrency management prevents conflicts in data processing.
  • Future improvements will involve DBT integration and more robust automation techniques.

Top comments (0)