Looking into the AWS ecosystem of serverless services, AWS Step Functions is one of my personal most favorite services. I recently had a chat with some colleagues about a potential use case of Step functions in favor of AWS Lambda. While we discussed the general concept of AWS Step Functions, one of my beloved colleagues argued towards the usage of AWS Lambda like
Let us use AWS Lambda because a workflow described as a state machine sounds like it is much slower.
I could neither substantiate this statement nor could I contradict it. So I started to examine the original assumption "Step Functions is slower than Lambda" with facts. Time for a benchmark!
For me the results were crystal clear 😆
Just kidding! Let us first get a common understanding what AWS Step Functions and AWS Lambda is. If you are familiar with these services, you can jump right into the section about the test setup and results.
By the way: the source code is also available for you on Github.
🤹 What is AWS Step Functions?
AWS Step Functions was published in 2016 as a serverless orchestration service. I think the following definition of AWS Step Functions explains very well, what kind of problems AWS Step Functions solves:
Step Functions is a serverless orchestration service that lets you combine […] AWS services to build business-critical applications. Through Step Functions’ graphical console, you see your application’s workflow as a series of event-driven steps.
Step Functions is based on state machines and tasks. A state machine is a workflow. A task is a state in a workflow that represents a single unit of work that another AWS service performs. Each step in a workflow is a state.
Source: What is AWS Step Functions? - AWS Step Functions
State machines can be invoked both asynchronously and synchronously. Step Functions itself offers several ways to invoke you state machine, for example:
- via an explicit
StartExecution
call using your favourite AWS SDK, - on each http request hitting your AWS API Gateway,
- as a destination in your Amazon EventBridge event bus
Typical use cases for AWS Step Functions cover data processing, machine learning, microservices orchestration or governance and security automation. Since the launch of the AWS SDK service integrations, you can use out of the box working integrations with every service that is supported by the AWS SDK. This offers you a huge number of new opportunities to integrate with AWS services without writing a single line of code.
While creating a new state machine you can decide between two execution types named “Standard” or “Express”. Each type has several characteristics and strengths. While standard workflows are a good fit for long-running workflows, Express workflows are a good fit for high-traffic workloads, data streaming or mobile application backends.
⚡️ What is AWS Lambda?
Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, code monitoring and logging. With Lambda, you can run code for virtually any type of application or backend service.
Source: https://docs.aws.amazon.com/lambda/latest/dg/welcome.html
Don’t get me wrong, I am also a big fan of AWS Lambda. But since AWS announced the game changing SDK service integrations for Step Functions, I start to think more about what are typical use cases for AWS Lambda. To use AWS Lambda more for the things that it is amazing at in the future.
Or to quote Eric Johnson at the serverless office hours:
Use Lambda to transform not to transport
⏰ Benchmarking latencies
The goal of this benchmark is not to say that service A is better/worse than service B. Each service has its strengths and weaknesses.
What we want to achieve is, getting a better understanding what kind of latencies we can measure for AWS Step Functions and how this compares to a similar integration based on AWS Lambda.
General setup
We want to measure the time it takes to read from and write data to Amazon S3 both from a state machine and an AWS Lambda function.
We test the behavior in two different versions. Version 1 simply writes to S3. Version 2 extends this by executing a GetObject
operation afterwards. The code of the Lambda function is written in javascript.
const AWSXRay = require("aws-xray-sdk-core");
const AWS = AWSXRay.captureAWS(require("aws-sdk"));
const S3 = new AWS.S3();
const bucketName = process.env.DestinationBucketName;
exports.lambdaHandler = async (event, context) => {
try {
console.log("EVENT: " + JSON.stringify(event));
const key = "lambda/" + event.requestContext.requestId;
await S3.putObject({
Bucket: bucketName,
Key: key,
Body: new Date().toISOString(),
}).promise();
await S3.getObject({
Bucket: bucketName,
Key: key,
}).promise();
const response = {
statusCode: 200,
isBase64Encoded: false,
};
return response;
} catch (err) {
console.log(err);
return err;
}
};
The state machine workflow is similarly straight forward and chains the same Amazon S3 calls as the AWS Lambda function.
Both the AWS Lambda function and the state machine can be invoked via an API Gateway. All experiments are triggered using Apache Bench with the following parameters.
ab -n 15000 -c 1 https://hash.execute-api.eu-central-1.amazonaws.com/Prod/invoke-lambda/
-n
configures the total amount of requests that are triggered - in our case 15.000
-c
is the number of concurrent requests - in our setup 1
I decided to use this setting because I want to generate a moderate stream of load for both integrations.
X-Ray is activated on all integration layers so that we are able to get a complete trace from the API-Gateway down to S3.
Experiment 1 - Writing to S3
The first experiment focuses only on the execution of a PutObject
without reading the files afterwards. The automatic Amazon CloudWatch dashboards for AWS Lambda, AWS API Gateway and AWS Step Functions are a good starting point to provide us valuable insights.
Let us first start with analyzing the Apache Bench reports. The complete reporting is available on GitHub. Here some highlights:
- The state machine was able to process all requests 539 seconds faster compared to the lambda function.
- The state machine was able to process 2.07 more requests per second
- The mean time per request for the state machine is 35.92 ms lower than the lambda based integration
API Gateway latencies
A closer look into the Amazon CloudWatch dashboard underlines what Apache Bench tells us. While observing the complete length of the benchmark we see that the average latency of Step Functions is constantly below AWS Lambda.
Both integration types indicate a drop in latencies indicating some kind of cold start behavior. While the drop of Step Functions on average is more significant compared to AWS Lambda.
When we take a closer look into the 99th percentile, we see some more spikes but in general a similar result over time.
Statemachine and AWS Lambda function execution
Let us now jump into the next integration layer and take a look at the duration of the AWS Lambda function and the state machine itself. Not very surprisingly that the the state machine is very much faster - in the end round about 60% compared to the duration of the Lambda function.
The AWS Lambda function runs with the default memory settings of 128MB and a default timeout of 3 seconds. Depending on the concrete use case, fine-tuning your memory settings might have a significant impact on the lambda metrics.
Downstream service latencies
I was very much surprised to see, that the connection between Step-Functions and S3 seems to be much more efficient. Looking at our X-Ray service map and traces the average latency between Lambda and S3 is 63ms compared to the integration with Step Functions of 28ms. It may be a coincidence that the relatively difference is also almost 60%. Or it might reveal, that Step Functions does some optimization handling the AWS client SDK under the hood.
Experiment 2 - Write and read from S3
I was interested to know if the amount of work a statemachine has to cover, impacts latencies and execution times compared to my AWS Lambda function. Hence we extended our experiment to also read data from S3 after writing it.
Again, let us first check the report from Apache Bench:
- The state machine was able to process all requests 1287 seconds faster compared to the lambda function.
- The state machine was able to process 3.01 more requests per second
- The mean time per request for the state machine is 85,83 ms lower than the lambda based integration
API Gateway latencies and execution duration
Long story short, the results are comparable to the ones from the first experiment. But it is interesting to see, that the gap between the state machine and the Lambda function is getting bigger. Some factors will influence this, like the chosen implementation and runtime of the AWS Lambda function.
💡 Please checkout the awesome article of my AWS Community Builder fellow Alexandr Filichkin about a performance comparison of the different lambda runtimes.
The AWS Lambda function is not able to get closer to the latency behavior of the state machine implementation.
The AWS Lambda function needs almost double the amount of time to write and read data from/to S3.
Also interesting to see, that the latency between my AWS Lambda function and Amazon S3 seems to slightly increase compared to the first experiment on average. AWS Step Function keeps on optimizing the connection to Amazon S3 🤩.
Conclusions
Based on the things I learned, what would I answer now if someone states
Let us use AWS Lambda because a workflow described as a state machine sounds like it is very much slower.
My general answer would be: measure first. My specific answer on the comparision of AWS Step Functions and a AWS Lambda function is, that this is not true in all cases. Our little experiment revealed some interesting insights:
- AWS Step Function scales and is much faster in our setup compared to my AWS Lambda function.
- In this experiment, the state machine shows a more efficient communication with S3 compared to my custom code implementation.
- When we compare the Step Function implementation with AWS Lambda it is obvious that we do not have to write custom code to achieve the same results.
- The new capabilities of the Step Function Workflow Studio and SDK service integration lower the barrier to achieve the same result in this use case while reducing time-to-market.
But be cautious in generalizing the test results. There is a lot you can do to optimize your AWS Lambda functions to optimise for performance efficiency. Your results might also differ in other use cases. These results should not disband you from creating additional benchmarks including your specific use cases to measure what is important to you.
Please also keep in mind if you really have to optimize for performance and take into account if it is also possible to implement your use case asynchronously.
About the author:
👋 Hi my name is Christian. I am working as an AWS Solution Architect at DFL Digital Sports GmbH. Based in cologne with my beloved wife and two kids. I am interested in all things around ☁️ (cloud), 👨💻 (tech) and 🧠 (AI/ML).
With 10+ years of experience in several roles, I have a lot to talk about and love to share my experiences. I worked as a software developer in several companies in the media and entertainment business, as well as a solution engineer in a consulting company.
I love those challenges to provide high scalable systems for millions of users. And I love to collaborate with lots of people to design systems in front of a whiteboard.
You can find me on LinkedIn or Twitter
Cover Image by Mateusz Wacławek on Unsplash
Top comments (2)
Hi Christian,
Really interesting article.
I have question, can you tell me about the cost differences?
Thanks!
I am glad you asked. This is prepared for an upcoming test. Keep you updated here on dev.to