Rust, with its strong ownership and borrowing system, is well known for its ability to prevent many common programming errors, including memory leaks. However, even Rust isn't immune to these issues under specific circumstances. This blog post serves as a reminder to my past self, who had to identify and resolve a memory leak in a Rust application, and a cautionary tale for my future self, emphasizing the importance of proactive profiling.
The Unlikely Culprit: A Rust Memory Leak
Imagine a Rust application deployed to Google Cloud Run. It has been running smoothly for weeks. However, over time, the memory usage gradually increases, leading to eventual crashes due to insufficient memory. In this chart we can see how each day the memory would hit it's peak then reset due a crash:
While Rust's ownership system prevents many common memory errors, certain scenarios can still lead to leaks:
- Reference Cycles: Circular references between objects can create a situation where objects hold onto each other, preventing them from being deallocated. This is similar to how memory leaks occur in languages with garbage collection.
- Unintentional Rc or Arc Cycles: Using Rc (reference counting) or Arc (atomic reference counting) can introduce cycles if not managed carefully. If objects have strong references to each other through these types, they can keep each other alive indefinitely.
- Global Variables with Interior Mutability: Global variables with interior mutability (RefCell, Mutex, etc.) can leak memory if the mutable references are not properly managed. If a reference is held indefinitely, the data it points to will also remain in memory.
- Forgotten drop Implementations: If a type owns resources that need explicit deallocation (e.g., file handles, network connections), forgetting to implement the drop trait can lead to resource leaks, which can manifest as memory leaks.
The Challenge of Troubleshooting Memory Leaks
Pinpointing the root cause of a memory leak can be a challenging task, even for experienced developers. Many programmers tend to avoid diving deep into memory profiling, due it's time consuming nature to narrow down the problem. It's a time-consuming process of elimination, akin to diagnosing a rare medical condition. You formulate hypotheses, test them, and discard them one by one until the culprit has nowhere left to hide.
In my case, given the critical nature of our service, we needed to act quickly. Within minutes of identifying the memory leak, we implemented a temporary workaround. A GitHub Workflow was set up to automatically restart our Cloud Run service every two hours.
Basically we just forced a redeploy pointing to the latest image, using GitHub Actions' cron functionality, sample:
name: Redeploy every 2 hours
on:
schedule:
- cron: '0 */2 * * *' # Runs every 2 hours
env:
...
jobs:
init:
...
tenant-deploys:
needs: [ init ]
runs-on: ubuntu-latest
strategy:
matrix:
service: [ tentant-1, tentant-2, tentant-3 ]
steps:
...
- name: Deploy on Cloud Run
uses: google-github-actions/deploy-cloudrun@v1
with:
service: ${{ matrix.service }}
image: ${{needs.init.outputs.image_name}}:latest
region: ${{ env.REGION }}
gcloud_component: beta
env_vars: |
ENV=${{ needs.init.outputs.env }}
COMMIT_ID=${{ env.COMMIT_ID }}
RUST_BACKTRACE=full
That was enough to prevent any downtime, now back to the drawing board:
Inspiration from the Rust Community
I came by this great reference from the community: The Rust Performance Book, I started testing the options from the list, until I got to Instruments:
Then that led me to these 2 videos:
I had used different memory profiling tools for other languages in the past, given those recommendations, I decided to explore Instruments' capabilities for profiling my Rust application.
The Unexpected Source of the Leak
After looking at Instruments profiling report:
I was able to narrow it down to the allocation of a few Pyo3 objects, the leak was triggered by a complex interaction between Rust and Python which was specific to our application code. The Rust code, calling Python, was holding onto memory Pyo3 objects that were needed during execution, but never released. Circling back to the beginning of the post, it was a bit like the Forgotten drop Implementations scenario.
A quick tip, if you try to build your rust binary and use it in Instruments, you may get this error:
You need to build the binary with debugging symbols:
[profile.release] debug = true
and sign the binary as:
https://forums.developer.apple.com/forums/thread/681687?answerId=734339022#734339022
The fix
Again, a bit specific to our custom implementation since we were loading some custom objects into memory, we then added a cleanup method, that we would call after running the Python code, a simple one liner did it:
py03_module.call_method0(“cleanup”)
After rerunning Instruments, with the fix the memory would stay well behaved:
Leveraging Instruments
Instruments proved to be an invaluable tool in identifying the memory leak, that I'd certainly recommend and use again! By analyzing the memory allocation patterns, I was able to pinpoint the exact line of Rust code responsible for the issue. Once the culprit was identified, fixing the memory leak was relatively straightforward.
Key Takeaways
- Pragmatism over perfection: Sometimes, a temporary workaround is the most practical approach. In our case, implementing a quick fix freed us to thoroughly investigate the memory leak without impacting users. This allowed us to dedicate the necessary time to find a permanent solution.
- Tool up: Familiarize yourself with a great memory profiler. When you encounter a memory leak, having the right tools can significantly speed up the debugging process.
- Embrace the challenge: While frustrating at times, hunting down memory leaks can be make you learn a lot about how the language works. The satisfaction of identifying and resolving the issue is a reward in itself.
By sharing this experience, I hope to encourage other Rust developers to embrace profiling as a best practice and to be on the watch for unexpected memory leaks, happy profiling!
Top comments (0)