JOOJO DONTOH

Posted on Jan 20 • Edited on Jan 22

Building a Scalable, Reliable, and Cost-Effective Event Scheduler for Asynchronous Jobs

Introduction

Welcome back to my blog! 😁 This is where I talk to myself—and hopefully, to you—about the engineering problems I solve at work. I do this mainly because finding solutions excites me. My journey of identifying inefficiencies, bottlenecks, and challenges has led me to tackle a common yet critical problem in software engineering.

That problem is the need to execute actions asynchronously—often with precise timing and sometimes on a recurring basis. Following my core approach to problem-solving (across space and time), I decided to build a solution that wasn’t just tailored to a single action but was extendable to various use cases. Whether it's sending notifications, processing transactions, or triggering system workflows, many tasks require scheduled execution. Without a robust scheduling mechanism, handling these jobs efficiently can quickly become complex, unreliable, and costly.

To address this, I set out to build a scalable, reliable, and cost-effective event scheduler—one that could manage delayed, immediate and recurring actions seamlessly.

In this article, I’ll walk you through:

The problem that led to the need for an event scheduler
The functional and non-functional requirements for an ideal solution
The system design and architecture decisions behind the implementation

By the end, you’ll have a clear understanding of how to build a serverless scheduled actions system that ensures accuracy, durability, and scalability while keeping costs in check. Let’s dive in!

The Problem: Managing Subscription Changes Across a Calendar Cycle

Subscription management comes with unique challenges, especially when handling cancellations or downgrades 😭. Users can request these changes at any time during their billing cycle, but due to the prepaid nature of subscriptions, such modifications can only take effect at the end of the cycle. This delay introduces a need for asynchronous execution—a system that can record these requests immediately but defer their execution until the appropriate time.

The Solution: A proper scheduling mechanism

Without a proper scheduling mechanism, managing these deferred actions efficiently becomes complex. The system must ensure that every request is executed at the right time while preventing missed or duplicate actions. Furthermore, frequent executions—such as batch processing of multiple scheduled changes—must be handled without overwhelming the system. To address this, we needed a reliable, scalable, and cost-effective scheduler capable of handling delayed and recurring execution seamlessly.

Functional Requirements: Defining the Core Capabilities

A robust and scalable scheduled actions system must be able to efficiently schedule, execute, update, monitor, and retry actions while ensuring reliability and flexibility.

1. Scheduling and Creating Actions

The system must allow users to schedule actions with:

Action type, execution time, execution data, and metadata as required fields.
Optional fields like repeat, frequency, and execution remainder.
Early validation to ensure actions conform to their fulfillment requirements.

2. Updating or Deleting an Action

Users can update or delete an action before it is locked (2 minutes before execution). Once locked, no external changes are allowed.

3. Action Status Management

Each action must have an internally managed status that reflects its execution progress. Status transitions and results must be logged in metadata for tracking.

4. Action Fulfillment Mapping

Every action must map to a specific fulfillment service responsible for its execution. Actions without a matching fulfillment service must be flagged to prevent execution errors.

5. Retrying Failed Actions

Failed actions must retry using exponential backoff to handle temporary failures. Actions that exceed the maximum retry limit must be flagged for manual intervention.

6. Handling Immediate vs. Delayed Actions vs. Repeated Actions

The system must distinguish between immediate and delayed actions to ensure timely execution:

Immediate actions (execution within 2 minutes) must be processed in real-time without scheduling delays.
Delayed actions (execution after 2 minutes) must be scheduled and processed at the correct time.
Repeated actions must be proceed for the required number of times at the required frequency

Non-Functional Requirements (NFR): Ensuring a Reliable and Scalable System

A scheduled actions system must meet key NFRs to guarantee reliability, scalability, security, and maintainability.

1. Reliability and Durability

Actions must execute correctly and on time (±2 minutes).
Repeating actions must execute exactly as scheduled with the correct frequency.
Failed actions must retry with exponential backoff, and non-repeating actions must execute only once.

2. Scalability

The system must scale dynamically to handle high request loads.
A serverless architecture ensures cost-efficiency and flexibility.
A queue-based approach (e.g., AWS SQS) must regulate execution frequency to prevent overloading downstream services.

3. Availability

The system must be always available with no cold starts, ensuring immediate execution when needed. A serverless architecture supports this with reasonable cost

4. Security

Signature-based validation must secure requests and prevent unauthorized execution.

5. Maintainability

The system must be modular, encapsulated, and organized within a single repository.
Infrastructure and database indexing rules must be codified.
A typed language must be used for better reliability.
Local testing must be enabled with encrypted environment variables.
A startup script can automate package installation and environment setup.
Comprehensive tests must ensure safe changes and integration.

6. Observability

API endpoints must expose:
- All scheduled actions.
- Actions filtered by status.
- Retry functionality for failed actions.
A centralized logging system must track execution issues consistently.

Tools: Powering the Scheduled Actions System

AWS Lambda: Serverless Compute for Execution

Enables event-driven execution without managing servers.
Handles action scheduling and validation.
Processes immediate actions using real-time event streams.
Executes delayed actions at the scheduled time.
Manages fulfillment tasks based on the action type.

Amazon EventBridge: Managing Scheduled Execution

Acts as a scheduler for delayed actions.
Polls for due pending actions every 5 minutes and enqueues them for processing.
Ensures execution happens within ±2 minutes of the scheduled time.

Amazon SQS: Queueing Actions for Scalability

Decouples execution workloads by handling scheduled actions asynchronously.
Controls fulfillment request frequency to prevent system overload.
Uses FIFO (First-In-First-Out) processing to maintain execution order and prevent duplicate executions.

Amazon DynamoDB: Storing Scheduled Actions

Serves as the primary database for storing scheduled actions.
Provides fast read/write operations for handling high workloads.
Stores metadata for tracking execution status, retries, and results.
Uses DynamoDB Streams to trigger immediate executions.

Amazon API Gateway: Exposing Endpoints for Management

Provides HTTP endpoints for creating, updating, and deleting scheduled actions.
Exposes monitoring endpoints to retrieve actions by status and retry failed actions.
Ensures secure access with authentication and authorization mechanisms.

System Design: Database Schema for Scheduled Actions

Field	Description
`id`	Unique identifier for each scheduled action.
`data`	Stores execution-specific details.
`action`	Defines the type of action to execute.
`executionTime`	Specifies when the action should run.
`repeat`	Indicates if the action should repeat.
`frequency`	Defines the interval for recurring actions.
`executionRemainder`	Tracks the remaining number of executions.
`status`	Execution state ("PENDING", "IN_PROGRESS", "COMPLETED", "FAILED").
`createdAt`	Timestamp when the action was created.
`updatedAt`	Last modified timestamp.
`retryCount`	Counts failed execution retries.
`metadata`	Stores logs and additional execution details.

Example: Scheduled Notification Action

{
    "data": {
        "mobile": "60123456789",
        "subject": "Test",
        "name": "Joojo",
        "templateType": "USER_LATE_PAYMENT_NOTIFICATION",
        "notificationType": "SMS"
    },
    "repeat": true,
    "frequency": "DAILY",
    "executionRemainder": 5,
    "action": "SEND_NOTIFICATION",
    "executionTime": 1736930117120
}

Project structure

I used a layered-modular approach for maintainability, scalability, and ease of change. Many times, different teams may want to extend changes in a service without introducing unintended side effects. I tried to achieve this by organizing components into distinct modules. Let's dive deeper below

1. Single Application with a Modular Design

The entire system is built as a single application, but with a modular structure that separates concerns. Each module is responsible for a specific aspect of the system, making the codebase easier to navigate and modify.

./src
├── app.ts
├── clients
├── config
├── controllers
├── handlers
├── helpers
├── middleware
├── models
├── routes
├── service
├── types
└── utils

2. Serverless Handlers for Distributed Execution

The project is designed around AWS Lambda, with different handlers exported and structured to allow seamless execution of scheduled actions. These handlers ensure that various tasks are processed independently, improving fault tolerance and scalability.

Action Handlers: Manage creating, scheduling, retrieving, updating, deleting, and processing scheduled actions. This keeps all action-related logic centralized, making it easy to modify without affecting other parts of the system.
Delayed Action Handlers: Specifically handle actions that need to be initiated at a later time. This separation ensures that delayed actions are efficiently scheduled and processed without interfering with real-time execution.
Immediate Action Handlers: Trigger execution for actions that must start within 2 minutes, using DynamoDB Streams to detect changes and initiate execution instantly. This ensures timely processing of urgent tasks.
Fulfillment Handlers: Ensure that scheduled actions are executed properly by interacting with the appropriate fulfillment services. This design allows fulfillment logic to evolve independently of action scheduling.

├── handlers
│   ├── fulfillment.ts
│   ├── initiate-scheduled-actions.ts
│   ├── initiate-stream-actions.ts
│   └── process.ts
       ├── http-apis.ts

3. Maintainability through Separation of Concerns

Each module in the project is self-contained, meaning changes to one component do not directly impact others. This reduces the risk of breaking existing functionality and simplifies debugging.

Controllers handle request routing and execution logic.
Services manage business logic and data interactions.
Clients interact with external services like databases, queues, and APIs.
Models define the data structures used across the system.
Middleware ensures that requests pass through validation and authentication layers.
Utilities provide reusable helper functions for logging, error handling, and retries.

./src
├── clients
├── controllers
├── middleware
├── models
├── service
└── utils

4. Ease of Extensibility

With a modular design, new features can be added without modifying core components. For example:

A new type of scheduled action can be introduced by adding a new action in the fulfillment service without modifying the existing scheduling or queuing logic.
A new external service integration can be implemented by extending the clients module, ensuring seamless communication with third-party systems.

Delayed Execution: Ensuring Timely Execution

The system efficiently processes scheduled actions through periodic execution, ensuring that all pending actions are executed at the right time without delays.

Periodic Execution for Scheduled Actions

A Lambda function periodically scans the database for actions with PENDING status and an executionTime that is due.
Amazon EventBridge acts as a scheduler, triggering this Lambda function every 5 minutes to ensure that actions are picked up on time.
The function enqueues these pending actions into Amazon SQS, ensuring a reliable and scalable execution pipeline.

Why This Approach Works

Efficient batch processing ensures that multiple actions can be picked up at once.
Scalability is maintained by decoupling execution with SQS, preventing system overload. Queues are extremely critical to handling load towards downstream systems.
State Management: Actions follow a lifecycle (PENDING → IN_PROGRESS → COMPLETED/FAILED/NO_ACTION), with each state persisted in the database for tracking and recovery.
Execution Handling: Successful executions are marked COMPLETED, failures are marked FAILED, and recurring actions update their execution remainder before resetting to PENDING.
Automatic Retries: Failed actions use exponential backoff for retries. If retries exceed the limit, the action remains FAILED until manually reset.
Idempotency & Data Integrity: Execution remainders prevent duplicate executions, and invalid operations (e.g., negative remainders) are blocked.
Observability: Metadata stores logs, execution timestamps, API responses, and failure reasons for easy debugging.

Immediate Execution: Handling Time-Sensitive Actions Using DynamoDB Streams

Some scheduled actions require immediate execution if their execution time is within 2 minutes of creation. To handle these efficiently, the system leverages DynamoDB Streams and AWS Lambda for real-time processing.

1. How Immediate Execution Works

DynamoDB Streams detect changes in the database when a new action is inserted or modified.
A Lambda function listens to these changes, processes new actions, and determines whether they require immediate execution.
If an action is scheduled to execute within 2 minutes, the Lambda function enqueues it into Amazon SQS for execution.

2. Breakdown of the Processing Logic

Listening to DynamoDB Stream Events

The function initiateProcessFromDynamoStream is triggered whenever a new record is INSERTED or MODIFIED in DynamoDB.

export const initiateProcessFromDynamoStream = async (
  event: DynamoDBStreamEvent,
): Promise<void> => {
  try {
    const { Records } = event;
    if (!Records || Records.length === 0) {
      console.log("No records to process in DynamoDB stream event.");
      return;
    }
    console.log(`${Records.length} records received.`);

The function checks if there are new records in the event.
If no records exist, the function exits early.

Processing Each Record

The function loops through each record, extracts its details, and determines whether it needs to be processed.

const processingPromises = Records.map(async (record: DynamoDBRecord) => {
  const { eventName, dynamodb } = record;

  if (!dynamodb?.NewImage) {
    console.log("Skipping record: Missing NewImage.");
    return Promise.resolve();
  }

  const cleanedImage = unmarshall(dynamodb.NewImage as Record<string, any>);
  console.log(
    "Cleaned NewImage object:",
    JSON.stringify(cleanedImage, null, 2),
  );

  if (!["INSERT", "MODIFY"].includes(eventName || "")) {
    console.log(`Skipping record with eventName ${eventName}.`);
    return Promise.resolve();
  }
  console.log(`Processing record with eventName: ${eventName}`);

The function extracts newly inserted or modified data from the DynamoDB stream.
It filters out irrelevant records (i.e., records that don’t have a NewImage or are not newly inserted/modified).

Checking for Immediate Execution

The function then checks whether the action needs immediate execution by calculating the time difference.

const { status, retryCount, id, executionTime } = cleanedImage;

// Check buffer time logic
const currentTime = Date.now();
const timeUntilExecution = executionTime - currentTime;

if (timeUntilExecution > TWO_MINUTES_IN_MS) {
  console.log(
    `Skipping record with id ${id}: Execution time is outside the 2-minute buffer window.`,
  );
  return Promise.resolve();
}

The current time is compared with the action’s executionTime.
If the action is more than 2 minutes away, it is skipped (it will be picked up later by the periodic execution).
If the action needs immediate execution, it continues processing.

Ensuring Valid Status and Retry Limits

if (!status || ![STATUSES.PENDING].includes(status)) {
  console.log("Skipping record: Missing or invalid status.");
  return Promise.resolve();
}

if (retryCount !== undefined && retryCount > CONSTANTS.MAX_RETRY) {
  console.log(
    `Skipping record with retryCount exceeding limit: ${retryCount}`,
  );
  return Promise.resolve();
}

if (!id) {
  console.log("Skipping record: Missing id.");
  return Promise.resolve();
}

Ensures the action has a valid PENDING status before processing.
Checks whether the retry limit has been exceeded to prevent infinite retries.
Ensures the action has a valid ID before sending it to the queue.

Sending the Action to SQS for Execution

try {
  await sendMessage(cleanedImage);
} catch (error: any) {
  await fail(id, `Failed to add action to the queue: ${error.message}`);
}

If the action qualifies for immediate execution, it is sent to SQS, where it will be processed by the fulfillment service.
If SQS fails, the action is marked as FAILED and logged for debugging.

3. Why This Approach is Reliable

Real-Time Execution: Actions scheduled within 2 minutes execute immediately instead of waiting for periodic polling.
Automatic Filtering: Actions scheduled for later execution are skipped and processed by EventBridge at the right time.
Error Handling: If an action fails to enqueue in SQS, it is marked as FAILED instead of being lost.
Scalability: The Lambda function can process multiple events concurrently, ensuring no action is delayed.

Repeated Execution: Managing Recurring Actions

Some scheduled actions need to execute multiple times at fixed intervals. The system handles repeated execution using three key fields:

repeat – Indicates whether the action should run multiple times.
executionRemainder – Tracks how many more times the action should execute.
frequency – Defines the time interval between executions.

1. Handling Repeated Execution

The function complete(id, notes) is responsible for managing the completion of actions. If an action is set to repeat, it updates the execution time and tracks how many executions remain.

Deducting Execution Remainder

const newExecutionRemainder = repeat && executionRemainder > 0 ? executionRemainder - 1 : 0;
if (newExecutionRemainder < 0) {
  throw new Error("Execution remainder cannot be negative.");
}

If the action is repeating, the execution remainder decreases by 1.
If the remainder is less than 0, an error is thrown to prevent unintended behavior.

2. Completing the Final Execution

if (repeat && newExecutionRemainder === 0) {
  await update(id, {
    status: STATUSES.COMPLETED,
    ttl: calculateTTL(),
    executionRemainder: 0,
    metadata: {
      ...metadata,
      executionResponses: [...(metadata?.executionResponses || []), notes],
    },
  });

  console.log("Final execution completed successfully:", id);
  return;
}

If no executions remain, the action is marked as COMPLETED.
The TTL (Time-To-Live) is set to delete the record after two weeks.
Execution metadata is updated for tracking and observability.

3. Scheduling the Next Execution

If the action still has remaining executions, the function schedules the next execution.

if (repeat && newExecutionRemainder > 0 && frequency) {
  const frequencyInMs = getFrequencyInMilliseconds(frequency);
  if (!frequencyInMs) {
    throw new Error(`Invalid frequency: ${frequency}`);
  }

  await update(id, {
    status: STATUSES.PENDING,
    executionTime: executionTime + frequencyInMs,
    executionRemainder: newExecutionRemainder,
    metadata: {
      ...metadata,
      executionResponses: [...(metadata?.executionResponses || []), notes],
    },
  });

  console.log("Recurring action updated successfully:", id);
  return;
}

The execution time is updated by adding the interval from the frequency field.
The action status is set to PENDING so it can be picked up again.
The metadata is updated to log execution history.

4. Frequency Conversion

The system converts predefined frequencies into milliseconds to update the execution time.

const frequencyDurations: Record<string, number> = {
  TEN_MINS: 10 * 60 * 1000,
  HOURLY: 60 * 60 * 1000,
  DAILY: 24 * 60 * 60 * 1000,
  WEEKLY: 7 * 24 * 60 * 60 * 1000,
  MONTHLY: 30 * 24 * 60 * 60 * 1000, //Not accurate and for demonstration purposes
};

This allows flexible scheduling based on predefined intervals.

5. Ensuring Idempotency and Data Integrity

For non-repeating actions, the system ensures that execution happens only once.

await update(id, {
  status: STATUSES.COMPLETED,
  ttl: calculateTTL(),
  metadata: { ...metadata, notes },
});
console.log("Action completed successfully with TTL:", id);

Actions that do not repeat are marked COMPLETED immediately.
The TTL ensures that data is retained for a limited time before deletion.

Why This Approach Works

Automated Rescheduling – The system automatically sets the next execution time.
Preventing Overexecution – Execution stops when the remainder reaches zero.
Efficient Tracking – Each execution updates metadata for debugging and observability.
Data Integrity – Ensures that frequency values are valid and that the execution remainder is correctly decremented.

Action Processing: Ensuring Reliable Execution with Deduplication

The system processes scheduled actions using Amazon SQS FIFO Queues or Redis-based deduplication to ensure each action is executed only once, preventing duplicate processing.

1. Handling Action Processing with SQS FIFO

Messages are sent to an Amazon SQS FIFO queue, ensuring actions are processed in first-in-first-out order.
FIFO queues guarantee deduplication, preventing the same message from being processed multiple times.
This approach is ideal for strict ordering and exactly-once processing.

2. Alternative Deduplication Using Redis

If a FIFO queue is not used, the system leverages Redis to manage deduplication before sending messages to a standard SQS queue.

How Redis Deduplication Works

Each message is assigned a deduplication ID based on:
- The action’s unique ID
- The status of the action
- The retry count (if applicable)

const deduplicationId = `${messageBody.id}-${messageBody.status}-${messageBody.retryCount || 0}`;
const redisKey = `sqs-deduplication:${deduplicationId}`;

Before sending a message to SQS, Redis checks if the deduplication ID exists.

const redisCheck = await RedisClient.get(redisKey);
if (redisCheck.success && redisCheck.data) {
  console.log(
    `Duplicate message detected. Skipping send for ID: ${deduplicationId}`,
  );
  return;
}

If a duplicate is detected, the message is not sent, avoiding redundant processing.

3. Sending Messages to SQS

If the action is not a duplicate, it is sent to the SQS queue for processing.

const command = new SendMessageCommand({
  QueueUrl: queueUrl,
  MessageBody: JSON.stringify(messageBody),
});
await executeSQSCommand(command);
console.log(`Message sent successfully to ${queueUrl}`);

The system ensures messages are delivered without unnecessary duplicates.
Actions proceed to the fulfillment stage after entering the queue.

4. Storing Deduplication Data in Redis

After sending a message, the deduplication ID is stored in Redis with a TTL of 5 minutes to ensure temporary deduplication.

await RedisClient.set(redisKey, true, 300); // 300 seconds = 5 minutes

The short expiration time ensures that retried actions are still processed if needed.
Redis helps manage temporary deduplication without affecting long-term action execution.

5. Handling Errors Gracefully

If sending a message to SQS fails, errors are handled based on the failure type:

if (error.name === "TimeoutError") {
  throw new AppError({
    ...CommonErrors.REQUEST_TIMEOUT,
    message: "Timeout occurred while sending the message to SQS.",
    metadata: { queueUrl, messageBody, error: error.message },
  });
}
throw new AppError({
  ...CommonErrors.INTERNAL_SERVER_ERROR,
  message: "Failed to send message to SQS.",
  metadata: { queueUrl, messageBody, error: error.message },
});

Timeouts trigger a specific retry strategy.
Other failures log metadata to help diagnose issues.

Why This Approach Works

FIFO queues ensure strict ordering and deduplication for time-sensitive tasks.
Redis deduplication prevents unnecessary duplicate processing when a FIFO queue is not available.
Error handling mechanisms ensure messages are retried when necessary.

Action Fulfillment: Processing Scheduled Actions from SQS

Once scheduled actions reach their execution time, they are processed by a Lambda function that reads messages from Amazon SQS. The function ensures that actions are executed correctly, updates their status accordingly, and handles errors or retries when needed.

1. Processing Actions from SQS

The fulfill function listens for SQS events, where each record represents a scheduled action that needs execution.

export const fulfill = async (event: { Records: SQSRecord[] }): Promise<void> => {
  const { Records } = event;

  if (!Records || Records.length === 0) {
    console.log("No records to process in SQS event.");
    return;
  }

The function checks if there are new records to process.
If no records exist, it exits early.

2. Ensuring Actions are Executed Correctly

For each action in the queue:

If the action has exceeded the maximum retry attempts, it is marked as FAILED.

if (retryCount >= CONSTANTS.MAX_RETRY) {
  await handleFailure(
    id,
    receiptHandle,
    metadata?.retryReason || "Exceeded maximum retry attempts",
  );
  continue;
}

If the action is being executed for the first time, its status is set to IN_PROGRESS.

if (retryCount === 0) {
  await start(id);
}

3. Handling Different Types of Actions

Even though the scheduler does not allow an action to be scheduled without a valid fulfillment service, there is a defensive mechanism (NO_ACTION) to handle cases where an action is manually altered or corrupted in the database.

If the action type is not recognized, it is marked as NO_ACTION and removed from the queue.

if (!Object.values(ACTIONS).includes(scheduledAction?.action as Actions)) {
  await noAction(id);
  await deleteMessage(receiptHandle);
  continue;
}

If the action is valid, it is processed based on its type.

Processing a General Task

case ACTIONS.EXECUTE_TASK:
  result = await taskExecutionService.performTask(scheduledAction.data);
  await complete(id, result);
  break;

Calls an external service to execute a generic task (e.g., processing a user request).
Marks the action as COMPLETED once finished.

Processing a Notification

case ACTIONS.SEND_ALERT: {
  const { recipient, messageType, ...messageData } = scheduledAction.data;

  const processedMessageData = processMessage(messageData);

  result = await notificationService.send(
    recipient,
    messageType,
    processedMessageData,
  );

  await complete(id, result);
  break;
}

Sends an alert or notification using a notification service.
Marks the action as COMPLETED after execution.

4. Handling Failures and Retries

If an action fails, the system applies exponential backoff and retries the execution before marking it as permanently failed.

Marking an Action as Failed

const handleFailure = async (id: string, receiptHandle: string, reason: string): Promise<void> => {
  console.error(`Action with id: ${id} failed after maximum retries. Reason: ${reason}`);
  await fail(id, reason);
  await deleteMessage(receiptHandle);
};

If an action exceeds retry limits, it is marked as FAILED.
The message is removed from the queue to prevent further processing.

Retrying an Action with Backoff

const handleProcessingError = async (
  id: string,
  receiptHandle: string,
  retryCount: number,
  error: any,
): Promise<void> => {
  console.error(`Error processing message with id: ${id}. Error: ${error.message || error}`);

  await applyExponentialBackoff(retryCount, id);
  const actionMarkedForRetry = await retry(
    id,
    error instanceof AppError
      ? JSON.stringify(error?.metadata) || error?.message || error?.code || error?.name
      : `Processing error : ${error}`,
  );
  await sendMessage(actionMarkedForRetry);
  await deleteMessage(receiptHandle);
};

If an error occurs, the system applies exponential backoff and increments the retry count.
The failed action is re-enqueued into SQS for a retry.

5. Finalizing Execution

Once an action completes successfully, it is removed from SQS.

await deleteMessage(receiptHandle);
console.log(`Message with id: ${id} processed successfully.`);

If all records from the SQS event are processed, a summary log is printed.

console.log(`${Records.length} records from the SQS event have been processed.`);

Why This Approach Works

Ensures Actions are Always Executed – Each action is retried with backoff before failing permanently.
Handles Different Action Types – Supports notifications, tasks, subscription updates, and other scheduled jobs.
Prevents Duplicate Execution – Uses SQS FIFO or Redis deduplication to avoid duplicate processing.
Reliable State Management – Updates the database with IN_PROGRESS, COMPLETED, FAILED, or NO_ACTION statuses.
Defensive Handling – NO_ACTION is a safeguard in case an action is altered manually in the database.

Presentation Layer: Exposing Endpoints for Observability

The presentation layer consists of a single Lambda function that serves as an API, exposing HTTP endpoints through Amazon API Gateway. These endpoints allow users to observe, manage, and interact with scheduled actions, ensuring real-time monitoring and control.

1. Exposing HTTP Endpoints via API Gateway

A serverless API is built using AWS Lambda and API Gateway, providing access to key functionalities related to scheduled actions.

Get Actions by Status and Counts
- Fetches actions grouped by their current status (e.g., pending, completed, failed).
- Provides count summaries to track execution trends.
Initiate Failed Actions
- Allows users to retry failed actions manually.
- Ensures failed jobs can be reprocessed without waiting for an automated retry cycle.
Delete Actions
- Provides an endpoint to remove old or unnecessary actions.
- Helps maintain a clean database by managing expired records.

2. Integrating with a Monitoring Dashboard

The data exposed by these endpoints can be visualized on a dashboard for real-time observability.

Displays action counts by status to track performance.
Allows users to manually retry or delete actions via an interface.
Provides insights into system health and execution reliability.

Challenges in Building the Scheduled Actions System

Developing a reliable and scalable scheduled actions system comes with several challenges that need to be carefully addressed.

Race Conditions
- When multiple processes attempt to update or execute the same action simultaneously, inconsistencies can occur.
- Proper locking mechanisms, deduplication, and FIFO queues help prevent duplicate execution.
Load Testing at Every Point of the Cycle
- The system must be tested under high loads to ensure that scheduling, execution, retries, and fulfillment scale properly.
- Testing includes database performance, SQS message handling, Lambda execution limits, and API response times.
Jobs Being Picked Up by Both Streams and the Scheduler
- Actions scheduled within 2 minutes of execution are processed by DynamoDB Streams, while others rely on EventBridge.
- Without proper coordination, duplicate executions may occur. Ensuring actions transition correctly between pending, in-progress, and completed states prevents this issue.
Understanding Limitations of Tools
- Concurrency Handling: AWS Lambda scales automatically, but high concurrency can lead to throttling and delays in processing.
- Lambda Runtime Limits: Since Lambda has a max execution time, long-running tasks must be broken into smaller executions or offloaded to a worker service.

Future Improvements

Webhooks for Real-Time Notifications
- Implementing webhooks would allow external services that schedule actions to receive real-time updates when an action executes, fails, or retries.
- This reduces the need for polling and improves system responsiveness.
Handler for Missed Actions
- A dedicated handler to detect and process actions that are still in a PENDING state but have an execution time in the past.
- This ensures no scheduled action is permanently missed due to system failures, delays, or scaling issues.

These improvements would enhance reliability, observability, and integration with external systems, making the scheduling system even more robust.

Conclusion: Scheduling, Scaling, and Sanity 🚀

Building a reliable, scalable, and fault-tolerant scheduled actions system isn’t just about setting up a cron job and hoping for the best—it’s about engineering resilience into every step of the process. From scheduling and execution to retries and observability, every component must work together to ensure that no action is lost, no notification is forgotten, and no subscription goes unmanaged.

Through this journey, we've tackled:

✅ Dynamic scheduling with DynamoDB Streams, EventBridge, and SQS.

✅ Precise execution using a mix of immediate and delayed actions.

✅ Reliable processing with deduplication, retries, and exponential backoff.

✅ Scalability by leveraging serverless architecture to handle high loads.

✅ Observability with APIs to monitor, retry, and delete scheduled actions.

Of course, every system has its quirks and challenges—race conditions, tool limitations, and unexpected failures—but with the right design patterns, defensive coding, and future improvements (like webhooks and missed action recovery), this system can evolve into an even more powerful, intelligent, and autonomous scheduler.

At the end of the day, automation is about making life easier—whether it’s managing user subscriptions, sending notifications, or processing time-sensitive transactions. And while computers never sleep, we certainly need to, which is why designing a system that can handle its own problems before waking us up at 3 AM is always worth the effort.

So here’s to building systems that work while we don’t! 🎉

Top comments (6)

Truong Fiu • Jan 22

Stop using SQS and Using an Open Source Message Queue

JOOJO DONTOH • Jan 22

Why?

Daniel McLaughlin • Jan 23

I enjoyed this read, I recently built a system that has similar capabilities but I didn't have the foresight to include observability and it's kicking me everyday lol

JOOJO DONTOH • Jan 24

Lol I also initially didn’t but I got sick of having to go through the logs or the db to track actions and manage them. I’m sure there are ways to extend your system for observability which you will figure out.

Shushant Arora • Jan 22

Design has fundamental flow of scalability when using SQS for message store.
SQS has limit of 12k inflight messages and max time delay of 15min and starts dropping messages after that.
AWS eventbridgescheduler has max limit of 1000/s , so no new schedules are allowed in that second, which could be bottleneck for notification service, alert service or other time based alerts like new calendar/new subscriptions, during heavy peek load.

JOOJO DONTOH • Jan 22

Luckily for me, these limits are more than good enough for the amount of load my services receive. Like most tools, SQS and EB have their limitations depending on the patterns of use and possible scale. A good repository layer for data access allows for easy interchangeability when the need arises. Also depending on AWS account access class, some of these limitations can be raised by an account manager.

Introduction

The Problem: Managing Subscription Changes Across a Calendar Cycle

The Solution: A proper scheduling mechanism

Functional Requirements: Defining the Core Capabilities

1. Scheduling and Creating Actions

2. Updating or Deleting an Action

3. Action Status Management

4. Action Fulfillment Mapping

5. Retrying Failed Actions

6. Handling Immediate vs. Delayed Actions vs. Repeated Actions

Non-Functional Requirements (NFR): Ensuring a Reliable and Scalable System

1. Reliability and Durability

2. Scalability

3. Availability

4. Security

5. Maintainability

6. Observability

Tools: Powering the Scheduled Actions System

AWS Lambda: Serverless Compute for Execution

Amazon EventBridge: Managing Scheduled Execution

Amazon SQS: Queueing Actions for Scalability

Amazon DynamoDB: Storing Scheduled Actions

Amazon API Gateway: Exposing Endpoints for Management

System Design: Database Schema for Scheduled Actions

Example: Scheduled Notification Action

Project structure

1. Single Application with a Modular Design

2. Serverless Handlers for Distributed Execution

3. Maintainability through Separation of Concerns

4. Ease of Extensibility

Delayed Execution: Ensuring Timely Execution

Immediate Execution: Handling Time-Sensitive Actions Using DynamoDB Streams

1. How Immediate Execution Works

2. Breakdown of the Processing Logic

Listening to DynamoDB Stream Events

Processing Each Record

Checking for Immediate Execution

Ensuring Valid Status and Retry Limits

Sending the Action to SQS for Execution

3. Why This Approach is Reliable

Repeated Execution: Managing Recurring Actions

1. Handling Repeated Execution

Deducting Execution Remainder

2. Completing the Final Execution

3. Scheduling the Next Execution

4. Frequency Conversion

5. Ensuring Idempotency and Data Integrity

Why This Approach Works

Action Processing: Ensuring Reliable Execution with Deduplication

1. Handling Action Processing with SQS FIFO

2. Alternative Deduplication Using Redis

How Redis Deduplication Works

3. Sending Messages to SQS

4. Storing Deduplication Data in Redis

5. Handling Errors Gracefully

Why This Approach Works

Action Fulfillment: Processing Scheduled Actions from SQS

1. Processing Actions from SQS

2. Ensuring Actions are Executed Correctly

3. Handling Different Types of Actions

Processing a General Task

Processing a Notification

4. Handling Failures and Retries

Marking an Action as Failed

Retrying an Action with Backoff

5. Finalizing Execution

Why This Approach Works

Presentation Layer: Exposing Endpoints for Observability

1. Exposing HTTP Endpoints via API Gateway

2. Integrating with a Monitoring Dashboard

Challenges in Building the Scheduled Actions System

Future Improvements

Conclusion: Scheduling, Scaling, and Sanity 🚀

Read next

Expo + Next.js + React Native + TailwindCSS Boilerplate

HNG Stage 2: Deploying a FastAPI Application with a CI/CD Pipeline

How to use Shadcn with React (JavaScript, No TypeScript)

Working with DD utility in Linux command line interface