Introduction
Welcome back to my blog! 😁 This is where I talk to myself—and hopefully, to you—about the engineering problems I solve at work. I do this mainly because finding solutions excites me. My journey of identifying inefficiencies, bottlenecks, and challenges has led me to tackle a common yet critical problem in software engineering.
That problem is the need to execute actions asynchronously—often with precise timing and sometimes on a recurring basis. Following my core approach to problem-solving (across space and time), I decided to build a solution that wasn’t just tailored to a single action but was extendable to various use cases. Whether it's sending notifications, processing transactions, or triggering system workflows, many tasks require scheduled execution. Without a robust scheduling mechanism, handling these jobs efficiently can quickly become complex, unreliable, and costly.
To address this, I set out to build a scalable, reliable, and cost-effective event scheduler—one that could manage delayed, immediate and recurring actions seamlessly.
In this article, I’ll walk you through:
- The problem that led to the need for an event scheduler
- The functional and non-functional requirements for an ideal solution
- The system design and architecture decisions behind the implementation
By the end, you’ll have a clear understanding of how to build a serverless scheduled actions system that ensures accuracy, durability, and scalability while keeping costs in check. Let’s dive in!
The Problem: Managing Subscription Changes Across a Calendar Cycle
Subscription management comes with unique challenges, especially when handling cancellations or downgrades 😭. Users can request these changes at any time during their billing cycle, but due to the prepaid nature of subscriptions, such modifications can only take effect at the end of the cycle. This delay introduces a need for asynchronous execution—a system that can record these requests immediately but defer their execution until the appropriate time.
The Solution: A proper scheduling mechanism
Without a proper scheduling mechanism, managing these deferred actions efficiently becomes complex. The system must ensure that every request is executed at the right time while preventing missed or duplicate actions. Furthermore, frequent executions—such as batch processing of multiple scheduled changes—must be handled without overwhelming the system. To address this, we needed a reliable, scalable, and cost-effective scheduler capable of handling delayed and recurring execution seamlessly.
Functional Requirements: Defining the Core Capabilities
A robust and scalable scheduled actions system must be able to efficiently schedule, execute, update, monitor, and retry actions while ensuring reliability and flexibility.
1. Scheduling and Creating Actions
The system must allow users to schedule actions with:
- Action type, execution time, execution data, and metadata as required fields.
- Optional fields like repeat, frequency, and execution remainder.
- Early validation to ensure actions conform to their fulfillment requirements.
2. Updating or Deleting an Action
Users can update or delete an action before it is locked (2 minutes before execution). Once locked, no external changes are allowed.
3. Action Status Management
Each action must have an internally managed status that reflects its execution progress. Status transitions and results must be logged in metadata for tracking.
4. Action Fulfillment Mapping
Every action must map to a specific fulfillment service responsible for its execution. Actions without a matching fulfillment service must be flagged to prevent execution errors.
5. Retrying Failed Actions
Failed actions must retry using exponential backoff to handle temporary failures. Actions that exceed the maximum retry limit must be flagged for manual intervention.
6. Handling Immediate vs. Delayed Actions vs. Repeated Actions
The system must distinguish between immediate and delayed actions to ensure timely execution:
- Immediate actions (execution within 2 minutes) must be processed in real-time without scheduling delays.
- Delayed actions (execution after 2 minutes) must be scheduled and processed at the correct time.
- Repeated actions must be proceed for the required number of times at the required frequency
Non-Functional Requirements (NFR): Ensuring a Reliable and Scalable System
A scheduled actions system must meet key NFRs to guarantee reliability, scalability, security, and maintainability.
1. Reliability and Durability
- Actions must execute correctly and on time (±2 minutes).
- Repeating actions must execute exactly as scheduled with the correct frequency.
- Failed actions must retry with exponential backoff, and non-repeating actions must execute only once.
2. Scalability
- The system must scale dynamically to handle high request loads.
- A serverless architecture ensures cost-efficiency and flexibility.
- A queue-based approach (e.g., AWS SQS) must regulate execution frequency to prevent overloading downstream services.
3. Availability
- The system must be always available with no cold starts, ensuring immediate execution when needed. A serverless architecture supports this with reasonable cost
4. Security
- Signature-based validation must secure requests and prevent unauthorized execution.
5. Maintainability
- The system must be modular, encapsulated, and organized within a single repository.
- Infrastructure and database indexing rules must be codified.
- A typed language must be used for better reliability.
- Local testing must be enabled with encrypted environment variables.
- A startup script can automate package installation and environment setup.
- Comprehensive tests must ensure safe changes and integration.
6. Observability
- API endpoints must expose:
- All scheduled actions.
- Actions filtered by status.
- Retry functionality for failed actions.
- A centralized logging system must track execution issues consistently.
Tools: Powering the Scheduled Actions System
AWS Lambda: Serverless Compute for Execution
- Enables event-driven execution without managing servers.
- Handles action scheduling and validation.
- Processes immediate actions using real-time event streams.
- Executes delayed actions at the scheduled time.
- Manages fulfillment tasks based on the action type.
Amazon EventBridge: Managing Scheduled Execution
- Acts as a scheduler for delayed actions.
- Polls for due pending actions every 5 minutes and enqueues them for processing.
- Ensures execution happens within ±2 minutes of the scheduled time.
Amazon SQS: Queueing Actions for Scalability
- Decouples execution workloads by handling scheduled actions asynchronously.
- Controls fulfillment request frequency to prevent system overload.
- Uses FIFO (First-In-First-Out) processing to maintain execution order and prevent duplicate executions.
Amazon DynamoDB: Storing Scheduled Actions
- Serves as the primary database for storing scheduled actions.
- Provides fast read/write operations for handling high workloads.
- Stores metadata for tracking execution status, retries, and results.
- Uses DynamoDB Streams to trigger immediate executions.
Amazon API Gateway: Exposing Endpoints for Management
- Provides HTTP endpoints for creating, updating, and deleting scheduled actions.
- Exposes monitoring endpoints to retrieve actions by status and retry failed actions.
- Ensures secure access with authentication and authorization mechanisms.
System Design: Database Schema for Scheduled Actions
Field | Description |
---|---|
id |
Unique identifier for each scheduled action. |
data |
Stores execution-specific details. |
action |
Defines the type of action to execute. |
executionTime |
Specifies when the action should run. |
repeat |
Indicates if the action should repeat. |
frequency |
Defines the interval for recurring actions. |
executionRemainder |
Tracks the remaining number of executions. |
status |
Execution state ("PENDING", "IN_PROGRESS", "COMPLETED", "FAILED"). |
createdAt |
Timestamp when the action was created. |
updatedAt |
Last modified timestamp. |
retryCount |
Counts failed execution retries. |
metadata |
Stores logs and additional execution details. |
Example: Scheduled Notification Action
{
"data": {
"mobile": "60123456789",
"subject": "Test",
"name": "Joojo",
"templateType": "USER_LATE_PAYMENT_NOTIFICATION",
"notificationType": "SMS"
},
"repeat": true,
"frequency": "DAILY",
"executionRemainder": 5,
"action": "SEND_NOTIFICATION",
"executionTime": 1736930117120
}
Project structure
I used a layered-modular approach for maintainability, scalability, and ease of change. Many times, different teams may want to extend changes in a service without introducing unintended side effects. I tried to achieve this by organizing components into distinct modules. Let's dive deeper below
1. Single Application with a Modular Design
The entire system is built as a single application, but with a modular structure that separates concerns. Each module is responsible for a specific aspect of the system, making the codebase easier to navigate and modify.
./src
├── app.ts
├── clients
├── config
├── controllers
├── handlers
├── helpers
├── middleware
├── models
├── routes
├── service
├── types
└── utils
2. Serverless Handlers for Distributed Execution
The project is designed around AWS Lambda, with different handlers exported and structured to allow seamless execution of scheduled actions. These handlers ensure that various tasks are processed independently, improving fault tolerance and scalability.
- Action Handlers: Manage creating, scheduling, retrieving, updating, deleting, and processing scheduled actions. This keeps all action-related logic centralized, making it easy to modify without affecting other parts of the system.
- Delayed Action Handlers: Specifically handle actions that need to be initiated at a later time. This separation ensures that delayed actions are efficiently scheduled and processed without interfering with real-time execution.
- Immediate Action Handlers: Trigger execution for actions that must start within 2 minutes, using DynamoDB Streams to detect changes and initiate execution instantly. This ensures timely processing of urgent tasks.
- Fulfillment Handlers: Ensure that scheduled actions are executed properly by interacting with the appropriate fulfillment services. This design allows fulfillment logic to evolve independently of action scheduling.
├── handlers
│ ├── fulfillment.ts
│ ├── initiate-scheduled-actions.ts
│ ├── initiate-stream-actions.ts
│ └── process.ts
├── http-apis.ts
3. Maintainability through Separation of Concerns
Each module in the project is self-contained, meaning changes to one component do not directly impact others. This reduces the risk of breaking existing functionality and simplifies debugging.
- Controllers handle request routing and execution logic.
- Services manage business logic and data interactions.
- Clients interact with external services like databases, queues, and APIs.
- Models define the data structures used across the system.
- Middleware ensures that requests pass through validation and authentication layers.
- Utilities provide reusable helper functions for logging, error handling, and retries.
./src
├── clients
├── controllers
├── middleware
├── models
├── service
└── utils
4. Ease of Extensibility
With a modular design, new features can be added without modifying core components. For example:
- A new type of scheduled action can be introduced by adding a new action in the fulfillment service without modifying the existing scheduling or queuing logic.
- A new external service integration can be implemented by extending the clients module, ensuring seamless communication with third-party systems.
Delayed Execution: Ensuring Timely Execution
The system efficiently processes scheduled actions through periodic execution, ensuring that all pending actions are executed at the right time without delays.
Periodic Execution for Scheduled Actions
- A Lambda function periodically scans the database for actions with PENDING status and an executionTime that is due.
- Amazon EventBridge acts as a scheduler, triggering this Lambda function every 5 minutes to ensure that actions are picked up on time.
- The function enqueues these pending actions into Amazon SQS, ensuring a reliable and scalable execution pipeline.
Why This Approach Works
- Efficient batch processing ensures that multiple actions can be picked up at once.
- Scalability is maintained by decoupling execution with SQS, preventing system overload. Queues are extremely critical to handling load towards downstream systems.
-
State Management: Actions follow a lifecycle (
PENDING → IN_PROGRESS → COMPLETED/FAILED/NO_ACTION
), with each state persisted in the database for tracking and recovery. - Execution Handling: Successful executions are marked COMPLETED, failures are marked FAILED, and recurring actions update their execution remainder before resetting to PENDING.
- Automatic Retries: Failed actions use exponential backoff for retries. If retries exceed the limit, the action remains FAILED until manually reset.
- Idempotency & Data Integrity: Execution remainders prevent duplicate executions, and invalid operations (e.g., negative remainders) are blocked.
- Observability: Metadata stores logs, execution timestamps, API responses, and failure reasons for easy debugging.
Immediate Execution: Handling Time-Sensitive Actions Using DynamoDB Streams
Some scheduled actions require immediate execution if their execution time is within 2 minutes of creation. To handle these efficiently, the system leverages DynamoDB Streams and AWS Lambda for real-time processing.
1. How Immediate Execution Works
- DynamoDB Streams detect changes in the database when a new action is inserted or modified.
- A Lambda function listens to these changes, processes new actions, and determines whether they require immediate execution.
- If an action is scheduled to execute within 2 minutes, the Lambda function enqueues it into Amazon SQS for execution.
2. Breakdown of the Processing Logic
Listening to DynamoDB Stream Events
The function initiateProcessFromDynamoStream
is triggered whenever a new record is INSERTED or MODIFIED in DynamoDB.
export const initiateProcessFromDynamoStream = async (
event: DynamoDBStreamEvent,
): Promise<void> => {
try {
const { Records } = event;
if (!Records || Records.length === 0) {
console.log("No records to process in DynamoDB stream event.");
return;
}
console.log(`${Records.length} records received.`);
- The function checks if there are new records in the event.
- If no records exist, the function exits early.
Processing Each Record
The function loops through each record, extracts its details, and determines whether it needs to be processed.
const processingPromises = Records.map(async (record: DynamoDBRecord) => {
const { eventName, dynamodb } = record;
if (!dynamodb?.NewImage) {
console.log("Skipping record: Missing NewImage.");
return Promise.resolve();
}
const cleanedImage = unmarshall(dynamodb.NewImage as Record<string, any>);
console.log(
"Cleaned NewImage object:",
JSON.stringify(cleanedImage, null, 2),
);
if (!["INSERT", "MODIFY"].includes(eventName || "")) {
console.log(`Skipping record with eventName ${eventName}.`);
return Promise.resolve();
}
console.log(`Processing record with eventName: ${eventName}`);
- The function extracts newly inserted or modified data from the DynamoDB stream.
- It filters out irrelevant records (i.e., records that don’t have a
NewImage
or are not newly inserted/modified).
Checking for Immediate Execution
The function then checks whether the action needs immediate execution by calculating the time difference.
const { status, retryCount, id, executionTime } = cleanedImage;
// Check buffer time logic
const currentTime = Date.now();
const timeUntilExecution = executionTime - currentTime;
if (timeUntilExecution > TWO_MINUTES_IN_MS) {
console.log(
`Skipping record with id ${id}: Execution time is outside the 2-minute buffer window.`,
);
return Promise.resolve();
}
- The current time is compared with the action’s executionTime.
- If the action is more than 2 minutes away, it is skipped (it will be picked up later by the periodic execution).
- If the action needs immediate execution, it continues processing.
Ensuring Valid Status and Retry Limits
if (!status || ![STATUSES.PENDING].includes(status)) {
console.log("Skipping record: Missing or invalid status.");
return Promise.resolve();
}
if (retryCount !== undefined && retryCount > CONSTANTS.MAX_RETRY) {
console.log(
`Skipping record with retryCount exceeding limit: ${retryCount}`,
);
return Promise.resolve();
}
if (!id) {
console.log("Skipping record: Missing id.");
return Promise.resolve();
}
- Ensures the action has a valid
PENDING
status before processing. - Checks whether the retry limit has been exceeded to prevent infinite retries.
- Ensures the action has a valid ID before sending it to the queue.
Sending the Action to SQS for Execution
try {
await sendMessage(cleanedImage);
} catch (error: any) {
await fail(id, `Failed to add action to the queue: ${error.message}`);
}
- If the action qualifies for immediate execution, it is sent to SQS, where it will be processed by the fulfillment service.
- If SQS fails, the action is marked as FAILED and logged for debugging.
3. Why This Approach is Reliable
- Real-Time Execution: Actions scheduled within 2 minutes execute immediately instead of waiting for periodic polling.
- Automatic Filtering: Actions scheduled for later execution are skipped and processed by EventBridge at the right time.
- Error Handling: If an action fails to enqueue in SQS, it is marked as FAILED instead of being lost.
- Scalability: The Lambda function can process multiple events concurrently, ensuring no action is delayed.
Repeated Execution: Managing Recurring Actions
Some scheduled actions need to execute multiple times at fixed intervals. The system handles repeated execution using three key fields:
- repeat – Indicates whether the action should run multiple times.
- executionRemainder – Tracks how many more times the action should execute.
- frequency – Defines the time interval between executions.
1. Handling Repeated Execution
The function complete(id, notes)
is responsible for managing the completion of actions. If an action is set to repeat, it updates the execution time and tracks how many executions remain.
Deducting Execution Remainder
const newExecutionRemainder = repeat && executionRemainder > 0 ? executionRemainder - 1 : 0;
if (newExecutionRemainder < 0) {
throw new Error("Execution remainder cannot be negative.");
}
- If the action is repeating, the execution remainder decreases by 1.
- If the remainder is less than 0, an error is thrown to prevent unintended behavior.
2. Completing the Final Execution
if (repeat && newExecutionRemainder === 0) {
await update(id, {
status: STATUSES.COMPLETED,
ttl: calculateTTL(),
executionRemainder: 0,
metadata: {
...metadata,
executionResponses: [...(metadata?.executionResponses || []), notes],
},
});
console.log("Final execution completed successfully:", id);
return;
}
- If no executions remain, the action is marked as COMPLETED.
- The TTL (Time-To-Live) is set to delete the record after two weeks.
- Execution metadata is updated for tracking and observability.
3. Scheduling the Next Execution
If the action still has remaining executions, the function schedules the next execution.
if (repeat && newExecutionRemainder > 0 && frequency) {
const frequencyInMs = getFrequencyInMilliseconds(frequency);
if (!frequencyInMs) {
throw new Error(`Invalid frequency: ${frequency}`);
}
await update(id, {
status: STATUSES.PENDING,
executionTime: executionTime + frequencyInMs,
executionRemainder: newExecutionRemainder,
metadata: {
...metadata,
executionResponses: [...(metadata?.executionResponses || []), notes],
},
});
console.log("Recurring action updated successfully:", id);
return;
}
- The execution time is updated by adding the interval from the frequency field.
- The action status is set to PENDING so it can be picked up again.
- The metadata is updated to log execution history.
4. Frequency Conversion
The system converts predefined frequencies into milliseconds to update the execution time.
const frequencyDurations: Record<string, number> = {
TEN_MINS: 10 * 60 * 1000,
HOURLY: 60 * 60 * 1000,
DAILY: 24 * 60 * 60 * 1000,
WEEKLY: 7 * 24 * 60 * 60 * 1000,
MONTHLY: 30 * 24 * 60 * 60 * 1000,
};
This allows flexible scheduling based on predefined intervals.
5. Ensuring Idempotency and Data Integrity
For non-repeating actions, the system ensures that execution happens only once.
await update(id, {
status: STATUSES.COMPLETED,
ttl: calculateTTL(),
metadata: { ...metadata, notes },
});
console.log("Action completed successfully with TTL:", id);
- Actions that do not repeat are marked COMPLETED immediately.
- The TTL ensures that data is retained for a limited time before deletion.
Why This Approach Works
- Automated Rescheduling – The system automatically sets the next execution time.
- Preventing Overexecution – Execution stops when the remainder reaches zero.
- Efficient Tracking – Each execution updates metadata for debugging and observability.
- Data Integrity – Ensures that frequency values are valid and that the execution remainder is correctly decremented.
Action Processing: Ensuring Reliable Execution with Deduplication
The system processes scheduled actions using Amazon SQS FIFO Queues or Redis-based deduplication to ensure each action is executed only once, preventing duplicate processing.
1. Handling Action Processing with SQS FIFO
- Messages are sent to an Amazon SQS FIFO queue, ensuring actions are processed in first-in-first-out order.
- FIFO queues guarantee deduplication, preventing the same message from being processed multiple times.
- This approach is ideal for strict ordering and exactly-once processing.
2. Alternative Deduplication Using Redis
If a FIFO queue is not used, the system leverages Redis to manage deduplication before sending messages to a standard SQS queue.
How Redis Deduplication Works
- Each message is assigned a deduplication ID based on:
- The action’s unique ID
- The status of the action
- The retry count (if applicable)
const deduplicationId = `${messageBody.id}-${messageBody.status}-${messageBody.retryCount || 0}`;
const redisKey = `sqs-deduplication:${deduplicationId}`;
- Before sending a message to SQS, Redis checks if the deduplication ID exists.
const redisCheck = await RedisClient.get(redisKey);
if (redisCheck.success && redisCheck.data) {
console.log(
`Duplicate message detected. Skipping send for ID: ${deduplicationId}`,
);
return;
}
- If a duplicate is detected, the message is not sent, avoiding redundant processing.
3. Sending Messages to SQS
If the action is not a duplicate, it is sent to the SQS queue for processing.
const command = new SendMessageCommand({
QueueUrl: queueUrl,
MessageBody: JSON.stringify(messageBody),
});
await executeSQSCommand(command);
console.log(`Message sent successfully to ${queueUrl}`);
- The system ensures messages are delivered without unnecessary duplicates.
- Actions proceed to the fulfillment stage after entering the queue.
4. Storing Deduplication Data in Redis
After sending a message, the deduplication ID is stored in Redis with a TTL of 5 minutes to ensure temporary deduplication.
await RedisClient.set(redisKey, true, 300); // 300 seconds = 5 minutes
- The short expiration time ensures that retried actions are still processed if needed.
- Redis helps manage temporary deduplication without affecting long-term action execution.
5. Handling Errors Gracefully
If sending a message to SQS fails, errors are handled based on the failure type:
if (error.name === "TimeoutError") {
throw new AppError({
...CommonErrors.REQUEST_TIMEOUT,
message: "Timeout occurred while sending the message to SQS.",
metadata: { queueUrl, messageBody, error: error.message },
});
}
throw new AppError({
...CommonErrors.INTERNAL_SERVER_ERROR,
message: "Failed to send message to SQS.",
metadata: { queueUrl, messageBody, error: error.message },
});
- Timeouts trigger a specific retry strategy.
- Other failures log metadata to help diagnose issues.
Why This Approach Works
- FIFO queues ensure strict ordering and deduplication for time-sensitive tasks.
- Redis deduplication prevents unnecessary duplicate processing when a FIFO queue is not available.
- Error handling mechanisms ensure messages are retried when necessary.
Action Fulfillment: Processing Scheduled Actions from SQS
Once scheduled actions reach their execution time, they are processed by a Lambda function that reads messages from Amazon SQS. The function ensures that actions are executed correctly, updates their status accordingly, and handles errors or retries when needed.
1. Processing Actions from SQS
The fulfill
function listens for SQS events, where each record represents a scheduled action that needs execution.
export const fulfill = async (event: { Records: SQSRecord[] }): Promise<void> => {
const { Records } = event;
if (!Records || Records.length === 0) {
console.log("No records to process in SQS event.");
return;
}
- The function checks if there are new records to process.
- If no records exist, it exits early.
2. Ensuring Actions are Executed Correctly
For each action in the queue:
- If the action has exceeded the maximum retry attempts, it is marked as FAILED.
if (retryCount >= CONSTANTS.MAX_RETRY) {
await handleFailure(
id,
receiptHandle,
metadata?.retryReason || "Exceeded maximum retry attempts",
);
continue;
}
- If the action is being executed for the first time, its status is set to IN_PROGRESS.
if (retryCount === 0) {
await start(id);
}
3. Handling Different Types of Actions
Even though the scheduler does not allow an action to be scheduled without a valid fulfillment service, there is a defensive mechanism (NO_ACTION) to handle cases where an action is manually altered or corrupted in the database.
- If the action type is not recognized, it is marked as NO_ACTION and removed from the queue.
if (!Object.values(ACTIONS).includes(scheduledAction?.action as Actions)) {
await noAction(id);
await deleteMessage(receiptHandle);
continue;
}
- If the action is valid, it is processed based on its type.
Processing a General Task
case ACTIONS.EXECUTE_TASK:
result = await taskExecutionService.performTask(scheduledAction.data);
await complete(id, result);
break;
- Calls an external service to execute a generic task (e.g., processing a user request).
- Marks the action as COMPLETED once finished.
Processing a Notification
case ACTIONS.SEND_ALERT: {
const { recipient, messageType, ...messageData } = scheduledAction.data;
const processedMessageData = processMessage(messageData);
result = await notificationService.send(
recipient,
messageType,
processedMessageData,
);
await complete(id, result);
break;
}
- Sends an alert or notification using a notification service.
- Marks the action as COMPLETED after execution.
4. Handling Failures and Retries
If an action fails, the system applies exponential backoff and retries the execution before marking it as permanently failed.
Marking an Action as Failed
const handleFailure = async (id: string, receiptHandle: string, reason: string): Promise<void> => {
console.error(`Action with id: ${id} failed after maximum retries. Reason: ${reason}`);
await fail(id, reason);
await deleteMessage(receiptHandle);
};
- If an action exceeds retry limits, it is marked as FAILED.
- The message is removed from the queue to prevent further processing.
Retrying an Action with Backoff
const handleProcessingError = async (
id: string,
receiptHandle: string,
retryCount: number,
error: any,
): Promise<void> => {
console.error(`Error processing message with id: ${id}. Error: ${error.message || error}`);
await applyExponentialBackoff(retryCount, id);
const actionMarkedForRetry = await retry(
id,
error instanceof AppError
? JSON.stringify(error?.metadata) || error?.message || error?.code || error?.name
: `Processing error : ${error}`,
);
await sendMessage(actionMarkedForRetry);
await deleteMessage(receiptHandle);
};
- If an error occurs, the system applies exponential backoff and increments the retry count.
- The failed action is re-enqueued into SQS for a retry.
5. Finalizing Execution
- Once an action completes successfully, it is removed from SQS.
await deleteMessage(receiptHandle);
console.log(`Message with id: ${id} processed successfully.`);
- If all records from the SQS event are processed, a summary log is printed.
console.log(`${Records.length} records from the SQS event have been processed.`);
Why This Approach Works
- Ensures Actions are Always Executed – Each action is retried with backoff before failing permanently.
- Handles Different Action Types – Supports notifications, tasks, subscription updates, and other scheduled jobs.
- Prevents Duplicate Execution – Uses SQS FIFO or Redis deduplication to avoid duplicate processing.
- Reliable State Management – Updates the database with IN_PROGRESS, COMPLETED, FAILED, or NO_ACTION statuses.
- Defensive Handling – NO_ACTION is a safeguard in case an action is altered manually in the database.
Presentation Layer: Exposing Endpoints for Observability
The presentation layer consists of a single Lambda function that serves as an API, exposing HTTP endpoints through Amazon API Gateway. These endpoints allow users to observe, manage, and interact with scheduled actions, ensuring real-time monitoring and control.
1. Exposing HTTP Endpoints via API Gateway
A serverless API is built using AWS Lambda and API Gateway, providing access to key functionalities related to scheduled actions.
-
Get Actions by Status and Counts
- Fetches actions grouped by their current status (e.g., pending, completed, failed).
- Provides count summaries to track execution trends.
-
Initiate Failed Actions
- Allows users to retry failed actions manually.
- Ensures failed jobs can be reprocessed without waiting for an automated retry cycle.
-
Delete Actions
- Provides an endpoint to remove old or unnecessary actions.
- Helps maintain a clean database by managing expired records.
2. Integrating with a Monitoring Dashboard
The data exposed by these endpoints can be visualized on a dashboard for real-time observability.
- Displays action counts by status to track performance.
- Allows users to manually retry or delete actions via an interface.
- Provides insights into system health and execution reliability.
Challenges in Building the Scheduled Actions System
Developing a reliable and scalable scheduled actions system comes with several challenges that need to be carefully addressed.
-
Race Conditions
- When multiple processes attempt to update or execute the same action simultaneously, inconsistencies can occur.
- Proper locking mechanisms, deduplication, and FIFO queues help prevent duplicate execution.
-
Load Testing at Every Point of the Cycle
- The system must be tested under high loads to ensure that scheduling, execution, retries, and fulfillment scale properly.
- Testing includes database performance, SQS message handling, Lambda execution limits, and API response times.
-
Jobs Being Picked Up by Both Streams and the Scheduler
- Actions scheduled within 2 minutes of execution are processed by DynamoDB Streams, while others rely on EventBridge.
- Without proper coordination, duplicate executions may occur. Ensuring actions transition correctly between pending, in-progress, and completed states prevents this issue.
-
Understanding Limitations of Tools
- Concurrency Handling: AWS Lambda scales automatically, but high concurrency can lead to throttling and delays in processing.
- Lambda Runtime Limits: Since Lambda has a max execution time, long-running tasks must be broken into smaller executions or offloaded to a worker service.
Future Improvements
-
Webhooks for Real-Time Notifications
- Implementing webhooks would allow external services that schedule actions to receive real-time updates when an action executes, fails, or retries.
- This reduces the need for polling and improves system responsiveness.
-
Handler for Missed Actions
- A dedicated handler to detect and process actions that are still in a PENDING state but have an execution time in the past.
- This ensures no scheduled action is permanently missed due to system failures, delays, or scaling issues.
These improvements would enhance reliability, observability, and integration with external systems, making the scheduling system even more robust.
Conclusion: Scheduling, Scaling, and Sanity 🚀
Building a reliable, scalable, and fault-tolerant scheduled actions system isn’t just about setting up a cron job and hoping for the best—it’s about engineering resilience into every step of the process. From scheduling and execution to retries and observability, every component must work together to ensure that no action is lost, no notification is forgotten, and no subscription goes unmanaged.
Through this journey, we've tackled:
✅ Dynamic scheduling with DynamoDB Streams, EventBridge, and SQS.
✅ Precise execution using a mix of immediate and delayed actions.
✅ Reliable processing with deduplication, retries, and exponential backoff.
✅ Scalability by leveraging serverless architecture to handle high loads.
✅ Observability with APIs to monitor, retry, and delete scheduled actions.
Of course, every system has its quirks and challenges—race conditions, tool limitations, and unexpected failures—but with the right design patterns, defensive coding, and future improvements (like webhooks and missed action recovery), this system can evolve into an even more powerful, intelligent, and autonomous scheduler.
At the end of the day, automation is about making life easier—whether it’s managing user subscriptions, sending notifications, or processing time-sensitive transactions. And while computers never sleep, we certainly need to, which is why designing a system that can handle its own problems before waking us up at 3 AM is always worth the effort.
So here’s to building systems that work while we don’t! 🎉
Top comments (0)