The Problem - Inaccurate Metrics
In a streaming database like RisingWave, accurate metrics are not just nice to have, they are crucial for monitoring performance, identifying bottlenecks, and ensuring the system runs smoothly in real time. They're like the dashboard gauges in your car, giving you real-time feedback on what's happening under the hood. One key thing we monitor is the speed of data retrieval from our storage layer, which is critical for a real-time system.
Recently, we encountered a puzzling situation: one of our storage metrics was reporting zeros, even when we knew data was flowing through the system. This was like driving a car with a broken speedometer – we were moving, but we didn't know how fast. This sparked an investigation into the depths of our Rust code, leading to an interesting discovery about how Rust 2021 handles closures.
What is RAII and Why It Matters
In Rust, there's a concept called RAII, which stands for Resource Acquisition Is Initialization. Think of it like a cleanup crew that automatically tidies up after a party.
When a piece of data (a resource) is no longer needed, RAII ensures it's properly disposed of. This is important for preventing memory leaks and, in our case, ensuring metrics are reported.
Imagine you have a counter that needs to be incremented every time a certain event occurs. With RAII, you can set up the counter so that it automatically updates and in the end reports its value when it's done being used, without you having to manually write code to do so every time. This ensures that you never miss an update, even if your code gets complex.
However, even with this powerful tool, we were still seeing zero values in our metrics, which didn't make sense.
Our Setup: MonitoredStateStoreIterStats
and try_stream
To ensure we capture every data point, we designed a special tool, MonitoredStateStoreIterStats
. It tracks metrics like the number of data items processed. It's like a meticulous counter that keeps tabs on everything that happens in our data stream.
Every time a piece of data passes through our system, this counter diligently notes it down. Here's a simplified version of what this struct looks like:
struct MonitoredStateStoreIterStats {
total_items: usize,
total_size: usize,
storage_metrics: Arc<MonitoredStorageMetrics>,
}
We also use a handy Rust feature called try_stream
to simplify the process of working with streams of data. It's like a conveyor belt that efficiently moves data along. Our MonitoredStateStoreIterStats
sits beside this belt, counting each item that passes by.
Here's how we set up the stream:
pub struct MonitoredStateStoreIter<S> {
inner: S,
stats: MonitoredStateStoreIterStats,
}
impl<S: StateStoreIterItemStream> MonitoredStateStoreIter<S> {
#[try_stream(ok = StateStoreIterItem, error = StorageError)]
async fn into_stream_inner(mut self) {
let inner = self.inner;
futures::pin_mut!(inner);
while let Some((key, value)) = inner.try_next().await? {
self.stats.total_items += 1;
self.stats.total_size += key.encoded_len() + value.len();
yield (key, value);
}
}
}
When the MonitoredStateStoreIterStats
is done it automatically reports the collected metrics to our monitoring system. This is done in the Drop
implementation:
impl Drop for MonitoredStateStoreIterStats {
fn drop(&mut self) {
self.storage_metrics.iter_item.observe(self.total_items as f64);
self.storage_metrics.iter_size.observe(self.total_size as f64);
}
}
The Investigation: Digging into the Code
To figure out what was going wrong, we created a simplified version of our code. Think of it like a mini-experiment in a lab.
We used the Rust Playground to run this test. Here's the code we used to reproduce the issue:
struct Stat {
count: usize,
vec: Vec<u8>,
}
impl Drop for Stat {
fn drop(&mut self) {
println!("count: {}", self.count);
}
}
fn main() {
let mut stat = Stat {
count: 0,
vec: Vec::new(),
};
let mut f = move || {
stat.count += 1;
1
};
println!("num: {}", f());
}
In this code, we have a Stat struct that's similar to our MonitoredStateStoreIterStats
. It has a count field to track the number of times something happens and a vec field that we'll use later. The Drop
implementation prints the count when the Stat is dropped.
The main function creates a Stat
instance and a closure f
. This closure is supposed to increment the count
by 1 each time it's called.
However, when we run this code, we get the following output:
num: 1
count: 0
num: 1
shows that the closure was called and returned 1. But count: 0
indicates that the count
field of our Stat
struct was not incremented as expected. This was surprising and mirrored the problem we were seeing in RisingWave: our metrics were not being updated.
The Culprit: Rust's Closure Behavior
We discovered that the problem was related to how Rust handles closures, which are like mini-functions within your code. Normally, when you use a variable inside a closure, the closure "captures" it. This means it takes ownership of the variable or borrows it, depending on how it's used.
Things get interesting with the move
keyword. When you use move
before a closure, it forces the closure to take ownership of all the variables it uses. You might think this would solve our problem, but it actually made it more confusing.
Analogy: The Selective Shopper
Imagine you have a shopping list (your data) and you send someone (the closure) to the store. In older versions of Rust, the shopper would take the entire list, even if they only needed one item.
But in Rust 2021, the shopper is more efficient. They only take the part of the list that's relevant to their task.
This is usually a good thing because it saves memory and makes things faster. But in our case, it meant our 'counter' wasn't getting the information it needed to update the metrics. It's like the shopper only taking the "apples" part of the list and ignoring the "count" part, which tells them how many apples to buy.
Understanding the Details
To understand why this was happening, let's look at a few scenarios.
Moving the Entire Struct
First, we tried modifying the closure to explicitly move the entire stat
into it:
let mut f = move || {
let mut stat = stat; // Explicitly move stat into the closure
stat.count += 1;
1
};
With this change, the output becomes:
num: 1
count: 1
Now the count
is correctly incremented to 1. This shows that when we explicitly move stat
into the closure, the closure takes full ownership of it. Any modifications inside the closure affect the original stat
.
Using Other Fields
Next, we tried using another field vec in stat within the closure:
let mut f = move || {
let _ = stat.vec.len(); // Use the vec field
stat.count += 1;
1
};
Surprisingly, the output is still correct:
num: 1
count: 1
This demonstrates that if we use a field like vec
, which doesn't implement Copy
, the closure is forced to take ownership of the entire stat
. It won't just copy the count
field.
The Problem with Copy
Types
The issue arises when the closure only uses fields that implement Copy
, like our count
field, which is a usize
. In this case, even with move
, the closure doesn't take ownership of the whole struct. Instead, it just copies the Copy fields.
We hypothesized that even though stat implements Drop
and thus cannot be partially moved, if only Copy-type fields in stat
are used within the move
closure, those fields are copied into the closure. Any modifications made to these fields in the closure affect only the copied values, leaving the original fields unchanged.
Confirming Our Theory
To confirm our theory, we looked under the hood at the code generated by the Rust compiler. We used a tool called MIR (Mid-level Intermediate Representation) to see what the compiler was doing behind the scenes.
For the problematic code, the MIR showed that the closure was only capturing the usize
field (stat.count
), not the entire stat
struct:
_3 = {closure@src/main.rs:19:17: 19:24} { stat: (_1.0: usize) };
However, for the modified code where we explicitly moved stat, the MIR showed that the entire stat was being moved into the closure:
_3 = {closure@src/main.rs:19:17: 19:24} { stat: move _1 };
This confirmed our hypothesis. In the problematic code, the closure does not capture stat
's ownership but instead copies the Copy
fields.
The Solution
The fix was surprisingly simple. We just needed to explicitly tell the closure to take ownership of the entire stats
object in our into_stream_inner
function.
Here's the original problematic code:
#[try_stream(ok = StateStoreIterItem, error = StorageError)]
async fn into_stream_inner(mut self) {
let inner = self.inner;
...
self.stats.total_items += 1;
self.stats.total_size += key.encoded_len() + value.len();
...
}
And here's the corrected code:
#[try_stream(ok = StateStoreIterItem, error = StorageError)]
async fn into_stream_inner(self) {
let inner = self.inner;
let mut stats = self.stats; // Take ownership of self.stats
...
stats.total_items += 1;
stats.total_size += key.encoded_len() + value.len();
...
}
By adding let mut stats = self.stats;
, we force the closure to take ownership of the stats
object. Now, when we increment stats.total_items
and stats.total_size
, we're modifying the actual stats
object, not a copy.
The Bigger Picture: Implications for Rust Developers
Our investigation highlights a subtle aspect of Rust 2021's move
closure behavior: they capture the minimal part of a struct necessary. While efficient, this can lead to unexpected results if you're not aware of the specifics, especially when dealing with Copy
types within structs that implement Drop
.
In our case, this behavior resulted in metrics loss. The try_stream
macro's encapsulation, which uses closures internally, further complicated the issue. It created a situation where, despite our intention to move ownership of parameters into the stream, the actual ownership wasn't transferred as expected.
We've identified two key areas for improvement:
-
Macro Encapsulation: The
try_stream
macro should be more explicit about how it handles ownership. This would prevent similar issues where developers might assume amove
into a stream is happening when it's not. - Rust's Closure Behavior: While Rust's behavior is technically correct, it can be confusing. We reported this issue to the Rust community to discuss potential improvements to clarity or documentation.
This discovery has broader implications for RisingWave's codebase. We need to carefully review other areas where we rely on similar patterns to prevent potential data loss or unexpected behavior.
Conclusion: Lessons Learned
This experience has reinforced the importance of understanding the nuances of Rust's ownership and closure behavior, particularly with the changes introduced in Rust 2021. It also underscored the need for careful design of macro APIs to avoid ambiguity.
For RisingWave, this was more than just a bug fix; it was a valuable learning experience that will lead to a more robust system. We've not only corrected our metrics collection but also gained a deeper understanding of Rust that will help us write more reliable code in the future.
By sharing our experience, we aim to contribute to the collective knowledge of the Rust community. We hope that others can avoid similar pitfalls when working with closures, move semantics, and macro encapsulation. If you're interested in learning more about RisingWave and how we're using Rust to build a next-generation streaming database, check out our GitHub repository or join our community!
Top comments (0)