제민욱

Posted on Mar 4 • Edited on Mar 7

Slack Rate Limiting with SQS FIFO and Lambda

#sqs #aws #devops #programming

SQS FIFO와 Lambda를 사용한 Slack Rate Limiting

Arch link below

Amazon SQS delay queues

Amazon SQS FIFO queue and Lambda concurrency behavior

slack rate limit

Overview (개요)

이 문서는 AWS SQS FIFO와 Lambda를 사용해 Slack 채널들에 메시지를 보내는 시스템에서 Slack API의 rate limit()을 관리하는 방법을 설명합니다. 태스크 5000개가 메시지를 발행하며, DynamoDB 없이 효율적으로 처리하는 방식을 다룹니다.

This document explains how to manage Slack API rate limits (1 message per second per channel) in a system sending messages to Slack channels (2~5) using AWS SQS FIFO and Lambda. It covers an efficient approach without DynamoDB, with 5000 tasks publishing messages.

Architecture (구조)

Components (구성 요소)

SQS FIFO Queue (slack-queue.fifo)
- 태스크 5000개가 메시지를 직접 발행.
- MessageGroupId를 Slack 채널 ID로 설정해 채널별 순서 보장.
- Messages are directly published by 5000 tasks.
- MessageGroupId is set to Slack channel IDs to ensure order per channel.
Lambda
- SQS FIFO에서 메시지를 디큐(dequeue).
- Slack API로 메시지 전송, 429 오류 시 재큐잉.
- Dequeues messages from SQS FIFO.
- Sends messages to Slack API, re-queues on 429 errors.

Code (코드)

1. SQS FIFO Publishing (Fargate Task)

1. SQS FIFO 발행 (Fargate 태스크)

import boto3
import json
import os

sqs = boto3.client('sqs')

def publish_messages(channels, message):
    queue_url = os.environ['SQS_QUEUE_URL']  # slack-queue.fifo
    for channel in channels:
        sqs.send_message(
            QueueUrl=queue_url,
            MessageBody=json.dumps({"channel": channel, "message": message}),
            MessageGroupId=channel
        )

if __name__ == "__main__":
    channels = ['channel1', 'channel2', 'channel3']
    publish_messages(channels, "Test message from Fargate")

2. Lambda (SQS FIFO → Slack with Rate Limit Handling)

2. Lambda (SQS FIFO → Slack, Rate Limit 처리 포함)

from dataclasses import dataclass
import logging
import json
import os
from slack_sdk.webhook.client import WebhookClient
from slack_sdk.http_retry.builtin_handlers import (
    ConnectionErrorRetryHandler,
    RateLimitErrorRetryHandler,
    ServerErrorRetryHandler,
)
import boto3


logger = logging.getLogger()
logger.setLevel("INFO")

sqs = boto3.client("sqs")
queue_url = "slack-queue.fifo"  # batch 1


WORKSPACE_HOOK_URL = os.environ.get("WORKSPACE_HOOK_URL")

MAX_RETRY_COUNT = 3
client = WebhookClient(
    url=WORKSPACE_HOOK_URL,
    retry_handlers=[
        ConnectionErrorRetryHandler(max_retry_count=MAX_RETRY_COUNT),
        RateLimitErrorRetryHandler(max_retry_count=MAX_RETRY_COUNT),  # 429
        ServerErrorRetryHandler(max_retry_count=MAX_RETRY_COUNT),  # 500, 503
    ],
    logger=logger,
)


@dataclass
class SlackPayload:
    channel: str
    title: str
    body: str
    username: str
    icon_emoji: str
    extra_blocks: list

    @classmethod
    def from_body(cls, body: dict):
        return cls(
            channel=body["channel"],
            title=body["title"],
            body=body["body"],
            username=body.get("username", "NotificationBot"),
            icon_emoji=body.get("icon_emoji", ":satellite:"),
            extra_blocks=body.get("extra_blocks", []),
        )

    def _build_blocks(self):
        blocks = [
            {"type": "divider"},
            {"type": "section", "text": {"type": "mrkdwn", "text": f"*{self.title}*"}},
            {"type": "section", "text": {"type": "mrkdwn", "text": self.body}},
        ]
        if self.extra_blocks:
            blocks.extend(self.extra_blocks)

        blocks.append({"type": "divider"})
        return blocks

    def to_dto(self):
        return {
            "channel": self.channel,
            "username": self.username,
            "icon_emoji": self.icon_emoji,
            "blocks": self.build_blocks(),
        }


def handle_event(record):
    body = json.loads(record["body"])

    if body is None:
        logger.error(f"Empty body received")
        # Dequeue event when None
        return

    payload = SlackPayload.from_body(body)
    logger.info(payload)

    response = client.send_dict(
        headers={"Content-Type": "application/json;charset=utf-8"},
        body=payload.to_dto(),
    )
    if response.status_code != 200:
        logger.error(f"Webhook failed: {response.status_code} - {response.body}")
        raise Exception(f"Webhook failed with status {response.status_code}")

    logger.info(f"Message sent to {payload.channel}")


def lambda_handler(event, context):
    """
    Lambda handler for processing SQS FIFO queue events and sending Slack messages.

    Intent:
    - This function processes records from an SQS FIFO queue with BatchSize=1. Each invocation
      handles a single message, which is acknowledged (ACKed) and deleted from the queue only
      when the function returns {"statusCode": 200}. If an exception occurs, the function fails,
      and the single message remains in the queue (NACKed) for retry after VisibilityTimeout.

    Behavior:
    - Success: "When your function successfully processes a batch, Lambda deletes
                its messages from the queue."
    - Failure: "If your function fails to process a batch (by throwing an exception or timing out),
                Lambda retries the entire batch until it succeeds or the messages expire
                and are sent to a dead-letter queue (DLQ), if configured."

    Relevant AWS Documentation:
    - [SQS-Lambda Integration](https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html)
    - [FIFO Queue Behavior](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/FIFO-queues.html)
        - Ensures message order within the same MessageGroupId and supports parallel processing.

    Notes:
    - BatchSize=1 ensures failures affect only the individual message, not a group of messages.
    """
    logger.info(f"Received event: {json.dumps(event)}")
    records = event.get("Records", [])
    if not records:
        logger.warning("No records found in event")
        return {"statusCode": 200, "body": "No records to process"}

    for record in records:  # With BatchSize=1, this loop processes exactly one record
        handle_event(record)
    return {"statusCode": 200, "body": "Processed all records"}

Configuration (설정)

SQS FIFO Queue

SQS FIFO 큐

Name (이름): slack-queue.fifo
VisibilityTimeout: 5초 (처리 시간 확보)
- 5 seconds (to ensure processing time)
Delivery Delay: 0초 (큐 수준 지연 불필요)
- 0 seconds (no queue-level delay needed)
WaitTimeSeconds: 20초 (Long Polling으로 빈 호출 최소화)
- 20 seconds (Long Polling to minimize empty calls)

Lambda

Trigger (트리거): slack-queue.fifo
Batch Size (배치 크기): 1 (초당 1개 메시지 처리)
Environment Variables (환경 변수):
- SLACK_BOT_TOKEN: Slack Bot 토큰
- SQS_QUEUE_URL: slack-queue.fifo의 URL

IAM Permissions (IAM 권한)

sqs:SendMessage, sqs:ReceiveMessage, sqs:DeleteMessage

Rate Limit Management (Rate Limit 관리)

How It Works (동작 방식)

SQS FIFO 직렬화:
- MessageGroupId를 채널 ID로 설정 → 동일 채널 메시지는 순차 처리.
- 서로 다른 채널은 병렬 처리 → 채널 5개면 초당 최대 5개.
- Messages with the same MessageGroupId (channel ID) are processed sequentially.
- Different channels are processed in parallel, up to 5 messages per second for 5 channels.
Lambda 처리:
- 배치 크기 1 → 한 번에 1개 메시지 디큐.
- Slack API 호출 후 응답이 1초 이내(200~500ms) → 초당 2~3개 호출 시도 가능.
- Batch size 1 ensures one message at a time.
- API response within 1 second (200~500ms) allows 2~3 attempts per second.
429 오류 처리:
- Slack에서 429(Too Many Requests) 발생 시 Retry-After 헤더 값(기본 1초)으로 재큐잉.
- 손실 없이 초당 1회 제한 준수.
- On 429 error, re-queue with Retry-After delay (default 1 second).
- Ensures no message loss while respecting the 1 msg/sec limit.

Performance (성능)

처리량: 채널 5개 → 초당 5개 메시지.
- Throughput: 5 channels → 5 messages per second.
총 처리 시간: 25,000개 메시지 → 약 5000초 (83분).
- Total processing time: 25,000 messages → ~5000 seconds (83 minutes).
429 최소화: 초과 호출(2~3개 중 1~2개)이 소수 → 재큐잉으로 효율적 처리.
- Minimal 429s: Only excess calls (1~2 out of 2~3) trigger re-queueing, handled efficiently.

Considerations (고려 사항)

Slack 정책: 빈번한 429는 피하는 것이 이상적이나, 소수 발생은 문제 없음.
- Slack Policy: Frequent 429s should be avoided, but occasional occurrences are fine.
효율성: Retry-After 활용으로 지연 최소화 → 안정성과 성능 균형.
- Efficiency: Using Retry-After minimizes delays, balancing stability and performance.
확장성: 채널 수가 늘면 429 빈도 증가 가능 → 모니터링 필요.
- Scalability: More channels may increase 429 frequency → monitoring required.

Conclusion (결론)

SQS FIFO와 Lambda를 사용하면 DynamoDB 없이도 Slack rate limit을 효과적으로 관리할 수 있습니다. MessageGroupId로 채널별 직렬화를 보장하고, 429 오류 시 Retry-After 기반 재큐잉으로 초당 1회 제한을 준수합니다. 이 구조는 태스크 5000개와 채널 2~5개 환경에서 안정적이고 효율적입니다.

Using SQS FIFO and Lambda, Slack rate limits can be managed effectively without DynamoDB. MessageGroupId ensures channel-specific serialization, and 429 errors are handled with Retry-After-based re-queueing to enforce the 1 msg/sec limit. This setup is stable and efficient for 5000 tasks and 2~5 channels.

sqs

slack_sdk

lambda

Generate layer

mkdir python
cd python
pip install slack_sdk -t .
cd ..
zip -r slack_sdk_layer.zip python

Upload to lambda
Lambda → "Layers" → "Create layer".

DEV Community