Thoughts on Lambda Memory Allocation
How do you allocate memory for your Lambda functions?
Lambda is billed only for the time it actually runs.
Increasing the memory also boosts CPU performance, but this raises the cost.
As performance increases, the execution time shortens.
In some cases, even if you increase the performance, the shorter execution time may result in the same cost.
Many of you might have adjusted the memory allocation while monitoring the execution time, keeping these points in mind.
For example, if you adjust the memory and get the following execution times:
Memory | Execution Time |
---|---|
128MB | 400ms |
256MB | 200ms |
512MB | 100ms |
1024MB | 50ms |
2048MB | 50ms |
Since billing is done in 100ms increments, the cost up to 512MB remains the same. Additionally, the performance hits a ceiling at 1024MB.
This kind of relationship between memory and execution time is common, so many of you may have chosen one of these two options:
Choosing 512MB considering the cost and performance since the billing unit is 100ms
Choosing 1024MB for better performance, knowing that the speed hits its limit at this point
The Relationship Between Lambda Memory Size, Billing Units, and Performance
AWS Lambda has a billing unit of 1 millisecond.
AWS Lambda can now be expanded to up to 10GB of memory and 6 vCPUs.
As a result:
Allocating more memory (and thus CPU) to achieve response times under 100ms no longer increases the total cost.
Once CPU performance reaches a certain level, additional vCPUs are assigned.
This has likely changed how we allocate memory with performance tuning and cost considerations in mind. Specifically, tuning for execution times under 100ms now has cost benefits, and we need to think about leveraging multi-core processing as Lambda now supports multiple cores.
In this article, I will examine these aspects.
Relationship Between Lambda Memory Allocation and vCPUs
In the update, it was mentioned that up to 6 vCPUs are available, but how much memory corresponds to how many vCPUs? I checked the official documentation, and the following was all I could find:
[Official] Configuring Lambda Function Memory
At 1,769MB, the equivalent of 1 vCPU is allocated.
Since there wasn’t much detail in the official documentation, I decided to use Python's multiprocessing.cpu_count()
to output the number of vCPUs by changing the Lambda memory allocation.
This isn't official information, so the results may change if the specifications are updated, but I hope it can serve as a reference. (The results are based on testing in the Oregon region as of December 12, 2020.)
Memory Allocation | vCPU |
---|---|
128MB–1769MB | 1 vCPU* |
1770MB–3008MB | 2 vCPUs |
3009MB–5307MB | 3 vCPUs |
5308MB–7076MB | 4 vCPUs |
7077MB–8845MB | 5 vCPUs |
8846MB–10240MB | 6 vCPUs |
*Even though cpu_count
outputs 2 vCPUs, based on the documentation and performance test results, it seems that only 1 vCPU is actually being utilized internally.
Performance Test 1 (Single Task, Multi-Thread, Multi-Process)
For memory allocations above 3009MB, the CPU performance increase comes from adding more vCPUs. For single-task operations, the performance likely hits a ceiling at that point. If you want to improve performance further, the process needs to be optimized to utilize multi-core effectively.
With that in mind, I conducted some tests.
Test Overview: Calculating the 30th Fibonacci number four times.
I ran the test under the following conditions (the actual code is provided at the end of this article):
- Single task
- Multi-thread processing
- Multi-process processing
Performance Test 1 Results
Memory Size | vCPUs | Single Task | Multi-Thread | Multi-Process |
---|---|---|---|---|
128MB | 1* | 22,357.25 | 24,526.17 | 22,601.75 |
256MB | 1* | 11,103.38 | 12,096.86 | 11,613.07 |
512MB | 1* | 5,554.15 | 5,783.73 | 5,675.32 |
1024MB | 1* | 2,737.14 | 2,913.90 | 2,792.90 |
1536MB | 1* | 1,859.68 | 1,909.14 | 1,880.79 |
1769MB | 1* | 1,576.30 | 1,691.91 | 1,597.14 |
2048MB | 2 | 1,574.19 | 1,626.24 | 1,370.34 |
3008MB | 2 | 1,590.36 | 1,643.64 | 950.26 |
3009MB | 3 | 1,621.39 | 1,639.41 | 940.40 |
4096MB | 3 | 1,574.45 | 1,590.55 | 722.13 |
5120MB | 3 | 1,578.06 | 1,633.17 | 637.16 |
6144MB | 4 | 1,547.85 | 1,656.60 | 484.49 |
7076MB | 4 | 1,578.67 | 1,653.11 | 403.14 |
7168MB | 5 | 1,606.02 | 1,627.12 | 402.30 |
8192MB | 5 | 1,602.95 | 1,654.36 | 402.57 |
9216MB | 6 | 1,577.55 | 1,633.96 | 420.52 |
10240MB | 6 | 1,591.83 | 1,640.31 | 407.27 |
The points marked in red indicate where the performance hits a ceiling.
As expected, single-task and multi-thread operations seem to only utilize a single core internally. The performance hits a ceiling when the number of vCPUs increases.
While cpu_count()
shows 2 vCPUs from 128MB to 3008MB, based on the results, it seems that the performance limit for single-threaded processing occurs at 1769MB. This aligns with the official documentation, which states that "1,769MB corresponds to 1 vCPU." Therefore, it seems that 1769MB and below is equivalent to 1 vCPU, while above that is equivalent to 2 vCPUs.
On the other hand, multi-process operations show improved performance, but they hit a limit when the number of processes exceeds the number of vCPUs.
Performance Test 2 (Varying Number of Processes)
In Test 1, we compared single-task, multi-thread, and multi-process operations. Now, let’s test how performance changes with different numbers of processes in multi-process operations.
Test Overview: Calculate the 30th Fibonacci number in each process.
Although it would have been better to balance the workload, the test was conducted as above (which means the total amount of computation increases with the number of processes).
Performance Test 2 Results
Memory Size | vCPUs | 3 Processes | 4 Processes | 6 Processes | 8 Processes | 12 Processes |
---|---|---|---|---|---|---|
128MB | 1* | 17,170.78 | 22,601.75 | 34,307.67 | 45,027.37 | 67,933.81 |
256MB | 1* | 8,469.28 | 11,613.07 | 17,009.97 | 22,894.79 | 34,513.40 |
512MB | 1* | 4,237.92 | 5,675.32 | 8,498.68 | 11,360.69 | 17,194.66 |
1024MB | 1* | 2,138.52 | 2,792.90 | 4,218.83 | 5,620.41 | 8,468.93 |
2048MB | 2 | 1,088.32 | 1,370.34 | 2,037.55 | 2,817.35 | 4,222.83 |
4096MB | 3 | 964.51 | 722.13 | 1,064.67 | 1,423.73 | 2,099.09 |
5120MB | 3 | 440.11 | 637.16 | 853.15 | 1,132.36 | 1,685.33 |
5307MB | 3 | 412.64 | 607.87 | - | - | - |
6144MB | 4 | 401.42 | 484.49 | 707.66 | 954.88 | 1,402.62 |
7076MB | 4 | - | 403.14 | - | - | - |
7168MB | 5 | 411.62 | 402.30 | 714.30 | 846.54 | 1,220.98 |
8192MB | 5 | 398.72 | 402.57 | 649.03 | 767.90 | 1,089.49 |
9216MB | 6 | 402.85 | 420.52 | 470.74 | 673.93 | 947.82 |
10240MB | 6 | 400.56 | 407.27 | 424.13 | 642.19 | 870.46 |
From these results, we can see that increasing the number of processes beyond the number of vCPUs does not lead to further performance improvements. Since Lambda’s current limit is 6 vCPUs, there isn’t much benefit in parallelizing beyond that.
Conclusion
Some of you might have previously given up on performance improvements beyond 100ms due to the lack of cost benefits, but with this recent update, why not push the performance limits further?
For example, in a project I worked on, we had a Lambda function that wrote data from Excel files (with multiple sheets) uploaded to S3 into DynamoDB. It was difficult to improve performance, but perhaps splitting the process by sheets and handling them with multi-processing could speed things up.
Python Code Used for Testing
Single Task Code
def lambda_handler(event, context):
fibonacci_num=int(30)
s0=fibonacci(fibonacci_num)
s1=fibonacci(fibonacci_num)
s2=fibonacci(fibonacci_num)
s3=fibonacci(fibonacci_num)
return 0
#Calculation Process (Fibonacci Retrieval)
def fibonacci(n):
if n < 2 :
return n
else:
return fibonacci(n-2) + fibonacci(n-1)
Multi-Thread Code
import threading
def lambda_handler(event, context):
fibonacci_num=int(30)
# Thread Creation
th0 = threading.Thread(target=fibonacci, args=(fibonacci_num,))
th1 = threading.Thread(target=fibonacci, args=(fibonacci_num,))
th2 = threading.Thread(target=fibonacci, args=(fibonacci_num,))
th3 = threading.Thread(target=fibonacci, args=(fibonacci_num,))
# Starting Threads
th0.start()
th1.start()
th2.start()
th3.start()
# Waiting for Threads
th0.join()
th1.join()
th2.join()
th3.join()
return 0
#Calculation Process (Fibonacci Retrieval)
def fibonacci(n):
if n < 2 :
return n
else:
return fibonacci(n-2) + fibonacci(n-1)
Multi-Process Code
import multiprocessing
def lambda_handler(event, context):
fibonacci_num=int(30)
# Process Creation
p0 = multiprocessing.Process(target=fibonacci, args=(fibonacci_num,))
p1 = multiprocessing.Process(target=fibonacci, args=(fibonacci_num,))
p2 = multiprocessing.Process(target=fibonacci, args=(fibonacci_num,))
p3 = multiprocessing.Process(target=fibonacci, args=(fibonacci_num,))
# Starting Process
p0.start()
p1.start()
p2.start()
p3.start()
# Waiting for Process Termination
p0.join()
p1.join()
p2.join()
p3.join()
return 0
#Calculation Process (Fibonacci Retrieval)
def fibonacci(n):
if n < 2 :
return n
else:
return fibonacci(n-2) + fibonacci(n-1)
Top comments (0)