1BRC in PHP FFI + Rust

#php #ffi #rust #performance

We have tried multi-threading in PHP to speed up execution time; the results are good, but far from perfect. Is there another way we can improve PHP's performance?

In the previous post, we gave an overview of 1BRC, tried to push the limits of PHP when discussing performance optimization, and ran our best PHP script on an EC2 instance.

The results were not bad, but not noteworthy either: 17.0636 seconds (the fastest Java code took 1.535 seconds).

So what are we supposed to do? Call it a day and get on with our lives? No, obviously not!
We could "cheat" our way to a better score, by abusing one of Python's winning strategies: letting external libraries do the heavy lifting job!

Foreign Function Interface

One of the ways to optimize an interpreted language is by moving the slow operations in an external module, usually written in a low-level language.
In PHP you can write system-wide modules and enable them in php.ini; this is useful for generic functions or for code that is not specific to one application.
Since version 7.4 PHP introduced a new feature: Foreign Function Interface (FFI).
FFI is a method for calling external libraries in your PHP coding, without changing global PHP configuration.
This method is more flexible than dealing with modules, but configuring it could be a bit daunting at first.

Let's try to wrap a Rust solution of 1BRC in a PHP script (yes, ok, we are definitely cheating).

The Rust solution

To keep things simple we need a Rust solution that:

it's fast
it's written in a clear way
it's composed of a few files

There's no need to explain point 1; points 2 and 3 are needed because we are going to modify the code to make it work as a module.
I love Rust, but I'm not a Rust programmer, so the simpler the code the better.

I choose the solution written by Flavio Bizzarri https://github.com/newfla/1brc_rust

Compiling Rust module

First of all, we clone the repository, then we edit the Cargo.toml file to add some options:

[profile.release]
lto = true
strip = true
panic = "abort"
debug = false
opt-level = 3
codegen-units = 1

[lib]
crate-type = ["cdylib", "lib"]
bench = false

In the [profile.release] section, we enabled additional performance optimizations (debug, opt-level, codegen-units); we added the [lib] section, where we specify that we want to compile the source as a cdylib library (shared libraries that can be linked into external programs).

main.rs file is used just to call adv::process; we remove this file and add a run() method in lib.rs:

#[no_mangle]
pub extern "C" fn run(filename: *const c_char) -> *mut c_char {
    let c_str = unsafe {
        assert!(!filename.is_null());
        CStr::from_ptr(filename)
    };
    let path = c_str.to_str().unwrap().to_string();

    adv::process(path)
}

#[no_mangle] disables the mangle (in short: it keeps the function's name in the exported library) and marks this function as "to export".

We are cheating, but in a responsible way 😅: from PHP code, we pass the weather data filename to the Rust module. Then the Rust module returns the station's aggregated data to be displayed.
PHP is a loosely-typed language, while Rust is a strongly-typed language, so moving data between the two can be a bit of a challenge-in-the-challenge. We need libc crate and ffi::CStr from std.

The code needed to convert from PHP String to Rust string slice has been taken from "The Rust FFI Omnibus"; using it's words:

Getting a Rust string slice (&str) requires a few steps:

We need to make sure that the C pointer is not NULL as Rust references are not allowed to be NULL.

Use std::ffi::CStr to wrap the pointer. CStr will compute the string's length based on the terminating NULL. This requires an unsafe block as we will be dereferencing a raw pointer, which the Rust compiler cannot verify meets all the safety guarantees so the programmer must do it instead.

Ensure the C string is valid UTF-8 and convert it to a Rust string slice.

Use the string slice.

In adv.rs we use this code:

let json_string = CString::new(serde_json::to_string(&cities).unwrap()).unwrap();

json_string.into_raw()

to return a JSON string to the PHP script.

That's it for Rust; we can compile the library with:

cargo build --release

The PHP script

On the PHP side first of all we need a class to manage the input and output of th Rust module. Let's create a file called libonebrc.php:

<?php
final class LibOneBrc {
    private static $ffi = null;

    public function __construct() {
        if (is_null(self::$ffi)) {
            self::$ffi = FFI::cdef("char* run(const char* str);", "rust/libonebrc.so");
        }
    }

    public function run($filename) {
       $resultPtr = self::$ffi->run($filename);
       return FFI::string($resultPtr);
    }
}

The constructor's code uses FFI::cdef() to import the Rust function from the rust/libonebrc.so file.
Here we have to declare the extern function's signature using C code, so the Rust c_char parameters, become char*.

NOTE: it's also possible to use a .h header file to specify the function(s) that PHP needs to know about; since we only need one simple function, it is easier to declare it inline in PHP code.

The run() method invokes the run method of the Rust module (self::$ffi->run($filename)). We called both this wrapper method and the Rust function with the same name (run()); this is only a coincidence (...or lack of fantasy); it's not mandatory.
FFI::string converts the pointer to a String usable in PHP.

We also need an index.php file to instantiate this LibOneBrc class and to print the results:

<?php

require_once "libonebrc.php";
$libOneBrc = new LibOneBrc();

$filename = "./rust/measurements.txt";

$result = json_decode($libOneBrc->run($filename));

echo "{" . PHP_EOL;
$isFirstRow = true;
foreach ($result as $key => $value) {
    if (!$isFirstRow) {
        echo "," . PHP_EOL;
    } else {
        $isFirstRow = false;
    }

    echo "\t" . $value->city . '=' . $value->max / 10 . '/' . $value->min / 10 . '/' . round($value->sum / $value->count / 10, 1);
}
echo PHP_EOL . "}" . PHP_EOL;

Nothing interesting here: we call our run() method, passing it the measurements filename.
The JSON string from Rust contains temperatures as integers, so we need to divide them by 10 and calculate the average temperature for each station.

The benchmark

Let's run this code on the EC2 instance. The configuration is the same as last time: an m6a.8xlarge with 32 vCPUs and 128GB of memory. For the hard disk, I opted for a 200GB io1 volume (to reach 10,000 IOPS).

We run it with:

perf stat -o 1B-ffi.log -r 10 -d php app/index.php

and these are the results:

 Performance counter stats for 'php app/index.php' (10 runs):

          58802.93 msec task-clock                       #   29.718 CPUs utilized            ( +-  0.26% )
              4736      context-switches                 #   80.191 /sec                     ( +-  3.80% )
                57      cpu-migrations                   #    0.965 /sec                     ( +- 13.37% )
             52703      page-faults                      #  892.378 /sec                     ( +-  1.33% )
   <not supported>      cycles                                                      
   <not supported>      instructions                                                
   <not supported>      branches                                                    
   <not supported>      branch-misses                                               
   <not supported>      L1-dcache-loads                                             
   <not supported>      L1-dcache-load-misses                                       
   <not supported>      LLC-loads                                                   
   <not supported>      LLC-load-misses                                             

            1.9787 +- 0.0197 seconds time elapsed  ( +-  1.00% )

1.9787 seconds! 🥳 🎉

This is a surprising result, considering the overhead of calling an external module and the fact that we are still making some calculations on the PHP side of the app.

Conclusions

After this 2-parts-journey we can affirm that:

PHP is slow, but the performance improves significantly when using threads
Performance tuning is a game of trade-offs: you can improve the speed of a task by saturating all the CPU cores, but your system will become unresponsive. In PHP this is a problem if your application needs to accept more than one connection at a time
For heavy tasks, you can delegate to optimized external libraries

The full code is available on Github:
https://github.com/gfabrizi/1brc-php-ffi

I hope you enjoyed the post!