DEV Community

Cover image for The Case of the Missing Metrics: A Rust Closure Mystery
RisingWave Labs
RisingWave Labs

Posted on • Originally published at risingwave.com

The Case of the Missing Metrics: A Rust Closure Mystery

The Problem - Inaccurate Metrics

In a streaming database like RisingWave, accurate metrics are not just nice to have, they are crucial for monitoring performance, identifying bottlenecks, and ensuring the system runs smoothly in real time. They're like the dashboard gauges in your car, giving you real-time feedback on what's happening under the hood. One key thing we monitor is the speed of data retrieval from our storage layer, which is critical for a real-time system.

Recently, we encountered a puzzling situation: one of our storage metrics was reporting zeros, even when we knew data was flowing through the system. This was like driving a car with a broken speedometer – we were moving, but we didn't know how fast. This sparked an investigation into the depths of our Rust code, leading to an interesting discovery about how Rust 2021 handles closures.

What is RAII and Why It Matters

In Rust, there's a concept called RAII, which stands for Resource Acquisition Is Initialization. Think of it like a cleanup crew that automatically tidies up after a party.

When a piece of data (a resource) is no longer needed, RAII ensures it's properly disposed of. This is important for preventing memory leaks and, in our case, ensuring metrics are reported.

Imagine you have a counter that needs to be incremented every time a certain event occurs. With RAII, you can set up the counter so that it automatically updates and in the end reports its value when it's done being used, without you having to manually write code to do so every time. This ensures that you never miss an update, even if your code gets complex.

However, even with this powerful tool, we were still seeing zero values in our metrics, which didn't make sense.

Our Setup: MonitoredStateStoreIterStats and try_stream

To ensure we capture every data point, we designed a special tool, MonitoredStateStoreIterStats. It tracks metrics like the number of data items processed. It's like a meticulous counter that keeps tabs on everything that happens in our data stream.

Every time a piece of data passes through our system, this counter diligently notes it down. Here's a simplified version of what this struct looks like:

struct MonitoredStateStoreIterStats {
    total_items: usize,
    total_size: usize,
    storage_metrics: Arc<MonitoredStorageMetrics>,
}
Enter fullscreen mode Exit fullscreen mode

We also use a handy Rust feature called try_stream to simplify the process of working with streams of data. It's like a conveyor belt that efficiently moves data along. Our MonitoredStateStoreIterStats sits beside this belt, counting each item that passes by.

Here's how we set up the stream:

pub struct MonitoredStateStoreIter<S> {
    inner: S,
    stats: MonitoredStateStoreIterStats,
}

impl<S: StateStoreIterItemStream> MonitoredStateStoreIter<S> {
    #[try_stream(ok = StateStoreIterItem, error = StorageError)]
    async fn into_stream_inner(mut self) {
        let inner = self.inner;
        futures::pin_mut!(inner);
        while let Some((key, value)) = inner.try_next().await? {
            self.stats.total_items += 1;
            self.stats.total_size += key.encoded_len() + value.len();
            yield (key, value);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

When the MonitoredStateStoreIterStats is done it automatically reports the collected metrics to our monitoring system. This is done in the Drop implementation:

impl Drop for MonitoredStateStoreIterStats {
    fn drop(&mut self) {
        self.storage_metrics.iter_item.observe(self.total_items as f64);
        self.storage_metrics.iter_size.observe(self.total_size as f64);
    }
}
Enter fullscreen mode Exit fullscreen mode

The Investigation: Digging into the Code

To figure out what was going wrong, we created a simplified version of our code. Think of it like a mini-experiment in a lab.

We used the Rust Playground to run this test. Here's the code we used to reproduce the issue:

struct Stat {
    count: usize,
    vec: Vec<u8>,
}

impl Drop for Stat {
    fn drop(&mut self) {
        println!("count: {}", self.count);
    }
}

fn main() {
    let mut stat = Stat {
        count: 0,
        vec: Vec::new(),
    };

    let mut f = move || {
        stat.count += 1;
        1
    };

    println!("num: {}", f());
}
Enter fullscreen mode Exit fullscreen mode

In this code, we have a Stat struct that's similar to our MonitoredStateStoreIterStats. It has a count field to track the number of times something happens and a vec field that we'll use later. The Drop implementation prints the count when the Stat is dropped.

The main function creates a Stat instance and a closure f. This closure is supposed to increment the count by 1 each time it's called.

However, when we run this code, we get the following output:

num: 1
count: 0
Enter fullscreen mode Exit fullscreen mode

num: 1 shows that the closure was called and returned 1. But count: 0 indicates that the count field of our Stat struct was not incremented as expected. This was surprising and mirrored the problem we were seeing in RisingWave: our metrics were not being updated.

The Culprit: Rust's Closure Behavior

We discovered that the problem was related to how Rust handles closures, which are like mini-functions within your code. Normally, when you use a variable inside a closure, the closure "captures" it. This means it takes ownership of the variable or borrows it, depending on how it's used.

Things get interesting with the move keyword. When you use move before a closure, it forces the closure to take ownership of all the variables it uses. You might think this would solve our problem, but it actually made it more confusing.

Analogy: The Selective Shopper

Imagine you have a shopping list (your data) and you send someone (the closure) to the store. In older versions of Rust, the shopper would take the entire list, even if they only needed one item.

But in Rust 2021, the shopper is more efficient. They only take the part of the list that's relevant to their task.

This is usually a good thing because it saves memory and makes things faster. But in our case, it meant our 'counter' wasn't getting the information it needed to update the metrics. It's like the shopper only taking the "apples" part of the list and ignoring the "count" part, which tells them how many apples to buy.

Understanding the Details

To understand why this was happening, let's look at a few scenarios.

Moving the Entire Struct

First, we tried modifying the closure to explicitly move the entire stat into it:

let mut f = move || {
    let mut stat = stat; // Explicitly move stat into the closure
    stat.count += 1;
    1
};
Enter fullscreen mode Exit fullscreen mode

With this change, the output becomes:

num: 1
count: 1
Enter fullscreen mode Exit fullscreen mode

Now the count is correctly incremented to 1. This shows that when we explicitly move stat into the closure, the closure takes full ownership of it. Any modifications inside the closure affect the original stat.

Using Other Fields

Next, we tried using another field vec in stat within the closure:

let mut f = move || {
    let _ = stat.vec.len(); // Use the vec field
    stat.count += 1;
    1
};
Enter fullscreen mode Exit fullscreen mode

Surprisingly, the output is still correct:

num: 1
count: 1
Enter fullscreen mode Exit fullscreen mode

This demonstrates that if we use a field like vec, which doesn't implement Copy, the closure is forced to take ownership of the entire stat. It won't just copy the count field.

The Problem with Copy Types

The issue arises when the closure only uses fields that implement Copy, like our count field, which is a usize. In this case, even with move, the closure doesn't take ownership of the whole struct. Instead, it just copies the Copy fields.

We hypothesized that even though stat implements Drop and thus cannot be partially moved, if only Copy-type fields in stat are used within the move closure, those fields are copied into the closure. Any modifications made to these fields in the closure affect only the copied values, leaving the original fields unchanged.

Confirming Our Theory

To confirm our theory, we looked under the hood at the code generated by the Rust compiler. We used a tool called MIR (Mid-level Intermediate Representation) to see what the compiler was doing behind the scenes.

For the problematic code, the MIR showed that the closure was only capturing the usize field (stat.count), not the entire stat struct:

      _3 = {closure@src/main.rs:19:17: 19:24} { stat: (_1.0: usize) };
Enter fullscreen mode Exit fullscreen mode

However, for the modified code where we explicitly moved stat, the MIR showed that the entire stat was being moved into the closure:

      _3 = {closure@src/main.rs:19:17: 19:24} { stat: move _1 };
Enter fullscreen mode Exit fullscreen mode

This confirmed our hypothesis. In the problematic code, the closure does not capture stat's ownership but instead copies the Copy fields.

The Solution

The fix was surprisingly simple. We just needed to explicitly tell the closure to take ownership of the entire stats object in our into_stream_inner function.

Here's the original problematic code:

#[try_stream(ok = StateStoreIterItem, error = StorageError)]
async fn into_stream_inner(mut self) {
    let inner = self.inner;
    ...
    self.stats.total_items += 1;
    self.stats.total_size += key.encoded_len() + value.len();
    ...
}
Enter fullscreen mode Exit fullscreen mode

And here's the corrected code:

#[try_stream(ok = StateStoreIterItem, error = StorageError)]
async fn into_stream_inner(self) {
    let inner = self.inner;
    let mut stats = self.stats; // Take ownership of self.stats
    ...
    stats.total_items += 1;
    stats.total_size += key.encoded_len() + value.len();
    ...
}
Enter fullscreen mode Exit fullscreen mode

By adding let mut stats = self.stats;, we force the closure to take ownership of the stats object. Now, when we increment stats.total_items and stats.total_size, we're modifying the actual stats object, not a copy.

The Bigger Picture: Implications for Rust Developers

Our investigation highlights a subtle aspect of Rust 2021's move closure behavior: they capture the minimal part of a struct necessary. While efficient, this can lead to unexpected results if you're not aware of the specifics, especially when dealing with Copy types within structs that implement Drop.

In our case, this behavior resulted in metrics loss. The try_stream macro's encapsulation, which uses closures internally, further complicated the issue. It created a situation where, despite our intention to move ownership of parameters into the stream, the actual ownership wasn't transferred as expected.

We've identified two key areas for improvement:

  1. Macro Encapsulation: The try_stream macro should be more explicit about how it handles ownership. This would prevent similar issues where developers might assume a move into a stream is happening when it's not.
  2. Rust's Closure Behavior: While Rust's behavior is technically correct, it can be confusing. We reported this issue to the Rust community to discuss potential improvements to clarity or documentation.

This discovery has broader implications for RisingWave's codebase. We need to carefully review other areas where we rely on similar patterns to prevent potential data loss or unexpected behavior.

Conclusion: Lessons Learned

This experience has reinforced the importance of understanding the nuances of Rust's ownership and closure behavior, particularly with the changes introduced in Rust 2021. It also underscored the need for careful design of macro APIs to avoid ambiguity.

For RisingWave, this was more than just a bug fix; it was a valuable learning experience that will lead to a more robust system. We've not only corrected our metrics collection but also gained a deeper understanding of Rust that will help us write more reliable code in the future.

By sharing our experience, we aim to contribute to the collective knowledge of the Rust community. We hope that others can avoid similar pitfalls when working with closures, move semantics, and macro encapsulation. If you're interested in learning more about RisingWave and how we're using Rust to build a next-generation streaming database, check out our GitHub repository or join our community!

Top comments (0)