TLDR: I explored three serverless deployment strategies to deploy GenAI applications in AWS. Check out this GitHub repository https://github.com/subeshb1/aws-gen-ai-lambda-deployment-strategies for the comparison and deployment strategy.
It’s the start of 2025, GenAI hype is at its peak, and I wanted to jump on the train. Having built with serverless for a while, I was exploring ways to deploy a simple GenAI application using serverless in AWS.
So, I decided to embark on an experiment. I pitted three deployment strategies against each other to see which would triumph: WebSocket streaming, Server-Sent Events (SSE) using Function URLs, and good ol’ REST APIs. Think of it as a reality show for serverless architectures: “Lambda vs Lambda vs Lambda”. Who will take the crown? Let’s find out.
The Three Contestants
Contestant #1: WebSocket
WebSocket is the real-time communication champion. Built for bidirectional, persistent connections, it’s perfect for token-by-token streaming with minimal latency.
How It Works for LLMs:
Persistent Connection: Establishes a WebSocket connection via API Gateway, maintaining a live channel between the client and server.
Data Flow: Tokens are streamed in real-time as they are generated by the LLM, enabling a seamless, interactive user experience.
Serverless Integration: API Gateway routes messages to AWS Lambda, which processes and streams LLM responses.
Advantages:
Low Latency: Real-time token updates ensure instant user feedback.
Interactive Applications: Ideal for live chat interfaces or collaborative tools requiring back-and-forth communication.
Scalable: API Gateway dynamically handles thousands of concurrent WebSocket connections.
Challenges:
State Management: Requires tracking connection IDs using a stateful solution like DynamoDB.
Cost Considerations: Persistent connections incur higher charges based on connection duration and message volume.
Contestant #2: Server-Sent Events using Function URLs
SSE offers a lightweight, unidirectional streaming solution that’s simple to implement.
With the introduction of AWS Lambda Function URLs, you can create efficient endpoints for sequential data delivery. Notably, response streaming for LLMs is only supported via Lambda Function URLs in the serverless ecosystem.
How It Works for LLMs:
Simple Streaming: A client makes a GET request to a Lambda Function URL, and tokens are streamed incrementally as they’re generated.
Unidirectional Flow: Tokens flow from server to client without the need for bidirectional communication.
Lambda Function URLs: These provide a built-in HTTPS endpoint for Lambda functions, removing the need for an API Gateway. With support for response streaming, they enable efficient delivery of tokenized LLM outputs in real-time.
Advantages:
Ease of Use: It requires minimal setup; no state management is needed.
Cost-Effective: Only incurs Lambda execution charges; no persistent connection fees.
Browser Compatibility: Built-in browser support makes it easy to integrate.
Streaming Support: Lambda Function URLs uniquely enable response streaming, making them indispensable for SSE-based LLM streaming.
Challenges:
Limited to Node.js: Streaming is currently supported only on Node.js runtimes.
Limited AWS Integration: Integration with AWS services is restricted. For example, origin access control with CloudFront is recommended for enhanced security.
Unidirectional Only: Not suitable for use cases requiring client-to-server communication.
Contestant #3: REST API
REST API is the veteran of the serverless ecosystem. While it lacks streaming capabilities, it excels in one-shot LLM queries.
How It Works for LLMs:
Request/Response Model: The client sends a request, and the server processes it and returns the complete response.
Serverless Integration: API Gateway invokes AWS Lambda to handle business logic and return results.
Advantages:
Simplicity: Easy to implement and widely supported.
Stateless Design: Each request is independent, simplifying scaling.
Cost-Efficiency: Ideal for single queries without maintaining persistent connections.
Challenges:
No Streaming: Responses are delivered only after the entire request is processed.
Latency: Users must wait for the full LLM response, which can be slow for large outputs.
Timeouts: API Gateway has a hard timeout limit of 29 seconds, making it unsuitable for long-running requests.
The Experiment Setup
Here’s where things got spicy. I designed an architecture to test all three approaches side-by-side, using Amazon Bedrock as the LLM backend. My setup included:
AWS Lambda: Core compute service for handling requests.
Amazon API Gateway: For WebSocket and REST communication.
Lambda Function URL: For SSE endpoints.
Amazon S3 and CloudFront: For serving static assets and proxying endpoints.
Amazon Bedrock: Backend for LLM operations using Claude Haiku 3.5.
AWS CDK: Infrastructure-as-Code for deployment.
React: Frontend framework.
The Architecture Diagram
Imagine a flowchart so glorious that even an AWS Solutions Architect would weep tears of joy. (Okay, maybe not, but it’s neat.)
Here’s the GitHub repository that you can deploy and play around with
Github Repo: https://github.com/subeshb1/aws-gen-ai-lambda-deployment-strategies
Invoking the LLM
Here’s a snippet of how the lambda invokes LLM using bedrock
export interface GenAIRequest {
prompt: string;
maxTokens?: number;
temperature?: number;
}
export class GenAIService {
private readonly client: BedrockRuntimeClient;
constructor() {
this.client = new BedrockRuntimeClient({ region: 'us-west-2' });
}
async *streamResponse(
request: GenAIRequest
): AsyncGenerator<string, GenAIResponse, unknown> {
try {
const command = new ConverseStreamCommand({
modelId: 'anthropic.claude-3-5-haiku-20241022-v1:0',
messages: [
{
role: 'user',
content: [
{
text: request.prompt,
},
],
},
],
});
const response = await this.client.send(command);
if (!response.stream) {
throw new Error('No response stream received');
}
let totalText = '';
for await (const event of response.stream) {
if (event.contentBlockDelta?.delta?.text) {
const textDelta = event.contentBlockDelta.delta.text;
totalText += textDelta;
yield textDelta;
}
}
return {
text: totalText.trim(),
usage: {
promptTokens: request.prompt.length,
completionTokens: totalText.length,
totalTokens: request.prompt.length + totalText.length,
},
};
} catch (error) {
const apiError: GenAIError = {
message: (error as Error).message || 'Failed to generate AI response',
code: 'GENAI_ERROR',
statusCode: 500,
};
throw apiError;
}
}
Usage:
const genAIService = new GenAIService();
const response = await genAIService.generateResponse({
prompt: request.prompt,
});
Battle Results
Performance Comparison
First Chunk
🥇 WebSocket: Tokens started flowing quickly, making it the fastest option for initial responses.
🥈 SSE: Slightly slower than WebSocket for the first chunk but offered consistent performance overall.
🥉 REST: Delivered the first response only after processing was complete, resulting in the longest wait time.
Average Latency
🥇 SSE: Consistently offered the lowest latency due to its efficient incremental delivery.
🥈 WebSocket: While rapid overall, its average latency increased over time as more chunks were streamed.
🥉 REST: The slowest option, as it delivers results only after full processing.
Total Duration
🥇 SSE: Finished faster in scenarios where quick incremental responses are valuable.
🥈 REST: Although not possible for streaming, it performed on par with SSE for the fastest total duration.
🥉 WebSocket: Took slightly more time than SSE due to chunked delivery.
Complexity
🥇 REST: Straightforward, making it ideal for quick prototypes.
🥈 SSE: Easiest to set up with no state management and native browser support.
🥉 WebSocket: Moderate complexity, requiring additional effort for state and connection tracking.
The Winner and Why
🎉 Drumroll, please… 🥇 The crown goes to WebSocket (with a caveat).
For real-time applications and streaming LLM responses, WebSocket stands out as the top choice. However, if simplicity and browser compatibility are priorities, SSE (Server-Sent Events) emerges as a strong alternative.
Key Takeaways:
WebSocket: Ideal for performance-intensive, real-time use cases.
SSE: Perfect for incremental LLM responses when simplicity is preferred over the complexity of WebSocket. As runtime support for response streaming expands and AWS services better integrate this feature, SSE with Function URLs might become my go-to solution.
REST: Best for proof-of-concept stages. It’s simple, flexible, and covers most needs. I started with REST and transitioned to WebSocket once the concept was validated.
Practical Tips:
Authentication and Security: The examples were created without any authentication. Ensure your application is fully authenticated and secure.
Optimize for Cost: Keep your Lambda function duration in check, especially for streaming approaches. Consider batching.
Monitor Everything: Use CloudWatch to track metrics like latency, error rates, and invocation counts.
Future Possibilities:
I’m diving deeper into building GenAI applications. Topics I’m exploring include:
- Building Autonomous Agents
Implementing complex conversation flows
Managing state across multiple Lambda functions
Optimizing token usage
2. RAG Deployments
Vector database integration
Document processing pipelines
Hybrid search approaches
3. Advanced Features
Multi-model orchestration
Fallback strategies
Cost optimization techniques
Join the Discussion
What’s your experience with serverless GenAI deployments? I’m particularly interested in:
Your preferred deployment strategy and why
Challenges you’ve faced with LLM response streaming
Cost optimization techniques you’ve discovered
Connect with me:
GitHub: subeshb1
LinkedIn: Subesh Bhandari
Leave a ⭐ on the repository if you found this helpful!
Top comments (0)