You may have heard of the "The One Billion Row Challenge" (1brc) and in case you don't, go checkout Gunnar Morlings's 1brc repo.
I got sucked in ...
For further actions, you may consider blocking this person and/or reporting abuse
I love the strategy: write the simple code first, refine performance and then introduce multi-threading. This also emphasizes the point that it's sometimes necessary to know your compiler and the inner working underneath a framework or library, to determine how to gain that performance boost.
A C# example of this would be knowing that if a request is IO bound, and if the function you called uses the CreateIoCompletionPort Win32 function, then it will use an IOCompletePort, where the CPU will release the thread and ask the IOCompletePort to tell the CPU when the work is finished, as opposed to tying up the thread until the work is finished.
There's a lot more going on underneath the hood to do that, but even with just a little bit of the knowledge of the inner working, we can save resources and improve performance.
I tried a standard PHP script execution on a WAMP environment with 8 GB RAM,
and it safely processed approximately 10 million records.
First, really beautiful how you improved the code step by step, and eventually got X60 boost - from half an hour to half a minute. WOW !!
Second, you note that the final best result you got is much different from what is seen in the leaderboard because of a very different hardware. It would be interesting to run this test on a similar hardware. This way we will get an indication if and how much php is slower than java.
I got you. I have the exact same AX161 from Hetzner. The final version of his code executes in 15.4 seconds (with JIT & parallel).
May I ask you if you could re-run the script with the changes that @xjakub suggested? I created a file with those changes at github.com/realFlowControl/1brc
I am very curious about the time you get on that machine
With pleasure.
Ran it five times as my server is a bit busier today: 13.4/12.8/12.9/12.6/12.5
And re-ran the old version five times, for completeness: 16.3/16.0/15.9/15.9/15.9
Interestingly it doesn't seem to really scale past 24 threads. Between 16 and 24 it gets faster the more threads I chuck at it. Over 24 and it starts to slow. nice/ionice make no noticable difference
Whoop whoop, that's awesome
I tried doing this in PHP too, and the end result was very similar to yours but with some changes which should amount to ~5% better performance (which just shows how hard it is to further optimise it):
isset($stations[$city])
, you can check if$station = &$stations[$city];
assigns null! This way you only need to access the hashmap once, and then change$station
as necessary!fgets
and$chunk_start > $chunk_end
to decide whether to end the loop, but it is possible to skip the first check. Although I'm not sure if this will bring a noticeable difference.As for the ZTS+Mac issue, you can try to either run the code in a Docker container (
php:zts
pluspecl install parallel
was able to run your solution), or use NTS+multiprocessing by spawning separate processes with eitheramphp/parallel
or justproc_open
! (I used the latter in my case)Edit: I also see that I skip a char at
strpos($line, ';', 1);
. I don't think it has any performance impact though 😄Removing the
isset()
shoved off 2.5% of wall time, that is an awesome idea and was easy to check. Making$chunk_start < $chunk_end
the expression of thewhile
loop and moving thefgets()
into the loop body shove off another 2.7% of wall time on my machine.I'll update the blog post once I am back from DutchPHP. Thats awesome! Thank you!
Oh, I'm glad both changes helped! Compared to the fread ones they were much simpler to implement, but I just realized we can simply use
stream_get_line
to fetch the station name, which is faster and simpler than the other approaches!Replied on twitter already, but for visibility: switching to
stream_get_line()
shoved off another 15%This is awesome!
I just tried reading the file with
fread
instead offgets
and I got a 15% performance improvement! The approach makes the code uglier, but it is noticeably faster at leastI updated the blog post with your suggestions, thank you!
Incredible! So many cool insights learned from this!
Good stuff!
Type casting and a huge change, neat!!
Have you checked if generators and async how they’d be?
Also, the sorting, the end result is much faster indeed. But you haven’t kept sorting. It isn’t exactly the same output, is it?
In any case; thanks for sharing and inciting people to experiment with performance tracing!
My God. Thank you for the article. Finally I optimize my code!
What a writing, I learn a lot from your post. Huge thanks
Bravo. I really appreciated the effort. PHP deserves some love and caring; it can pay back 10 fold :-)
Is there any test file to download? As far as I understood the source data can be created from a script, but for a first test a simple file would be easier.
Hey there,
I create a repo with the source code for PHP at github.com/realFlowControl/1brc
This also has a PHP script to generate the
measurements.txt
:Hope this helps!
Hy Florian,
works great! I The timing is reportet in ms, but it seems to be seconds.
It took about 160s to write a 132 MB file on my local drive, but the time to read the file will greatly depend on the file system, caching and the reading strategy. Depending on your system the reading will be much faster the second time your open the file.
Notepad++ opens the file without any delay, as they only read the part that is displayed. The standard Editor takes about one minute to do the same, as they read the whole file at once. We can do the same on reading the file in chunks, as file operations are handled in the background by the hard drive, this operations do usually not put any strain on the processor.
So, if we do so we get a nice contest on file operations, but as you have not much control over the details on most high level languages, what does the contest really measure?
Wow you gave me a great idea for my own application lol. Use threads!
Could you post the code to a github repo? Would be nice to get a link to this on 1brc.dev - however, the convention seems to be to link to a github repo, not a blog post.
Excellent work!
Hey there,
thanks for the hint, I created a GitHub repository and added the link to the issue you opened already! Thanks for that!
@realflowcontrol The repo is now linked from 1brc.dev. I was about to submit a PR but I think you beat me to it.
Once again, great work!
Great insights. This approach of optimization will definitely come in handy particularly in data-driven PHP applications that implements big data processing and ETL operations.
A classical example is in the case of receiving seller data from Amazon Seller central, knowing that one seller/user could have millions of records which Amazon will forward to the PHP application.
It becomes a wild goose chase when the server has to process these millions of records for ~25mins. The result is often server time-outs or crash in execution.
With this approach, I see that there's refinement in performance to serve resources at improved speed.
Did you consider using spl or even ext-ds instead of arrays? Maybe it would help a little bit as well.
That is a good idea, thanks for letting me know. I'll see if I find time. If you'd like to give it a shot, you can find a GitHub repo holding the source at github.com/realFlowControl/1brc
Although profiling shows me that most of the time (>80% of wall time) is spend in reading the data from the file via
fgets()
and then converting (split on;
and type cast). Nevertheless, shoving of some time with a better data structure would be nice, also a nice opportunity to dive intoext-ds
Beautiful!! Your method refined my thought-process significantly... thanks for this!!
Great article! it's always good to learn new performance techniques
Great post. Thanks for your insight and approach.
@realflowcontrol Thanks for the fantastic article! If i can, there is a typo..
CFLAGS and CLFAGS
Fixed it, thank you and awesome you liked it!
Thank you so much for this post, it was realy grate
This was incredible. Your work has given some credibility yo my favorite language. Hats off to you.
Which tool are you using for profiling and visualization of profiling output?
In my day job I am working on the PHP Profiler at Datadog. The screenshots you can see in the blog post are from the Datadog backend.
Interesting task. Just, the dataset is not very realistic. Usually you would have a time stamp for each reading. So, you would need to process the time stamps too.
In a real live case, you would try to avoid those "brute force" reading:
This is the right answer, and even simple business metrics get segmented like this. The actual "solutions" you end up needing in the real world come down to utilizing efficient queries and table structures in synchronicity with caching. Intelligent database design is what you need to process through billions of rows... Lowly SQL, so it confuses me, also, when people compare how fast a language is at performing tasks which, in real life, are going to be handled by the database. I appreciate the thought exercise, but we end up arguing about which orange soda tastes best at the apple pie fair.
Hey! It's a great attempt form you
If, after php optimizations, the speed is capped mostly by the "line reading". I do wonder, if it is allowed to install any extension in this test? The idea is to "hack" customized version of the "stream_get_line" just for this scenario.
I mean I had to install
ext-parallel
in order to run threads in user land. I guess something that is specifically targeted at the 1brc challenge might count as cheating, but some extensions that do exist already, likeext-ds
for data structures, why not? I am currently unaware of an extension that bypasses the PHP stream layer, but something like this could help.Great step-by-step optimization. Two things coming to mind: if you use explode or a simple for loop to iterate through characters, that might cut off some cycles. Also, I'm wondering about the ksort() since it sorts by cities, maybe there can be an array and then just iterate through the sub-arrays?
Hi , thank you so much for this test , i wanted to test that but creating a dataset with 1000000000 row takes so much time , maybe 1 ! week , and the final size would be about 20 gig. i have created a bash script that generate the data , is there any optimizer way or what !?
You are welcome!
The resulting file should be around 13 GB and there is a faster version of the generator available in the original repo at github.com/gunnarmorling/1brc in
create_measurements_fast.sh
@realflowcontrol
Thanks for this great article! I tried to run the krakjoe\parallel code locally on PHP 8.3 and ran into zend_mm_heap corrupted error - but only when thred count is greater than 1.
Any idea how to troubleshoot it?
Cheers!
Whoops, can you open an issue here: github.com/krakjoe/parallel/issues...
I would need the exact PHP and parallel version (including patch) and whatever else you got.
One more thing: are you using the latest released version or the version from the
develop
branch? I am asking, because thedevelop
branch has a knowzend_mm_heap corrupted
error that needs fixing before I do a new minor releaseThis was a pretty cool write up, giving people an idea of how to approach optimizing their code. really enjoyed that.
A couple of questions:
Hey Ravavyr,
glad that you liked it!
I ran the code on M1 MacBook Pro with 64 GB of Ram.
The code just outputs to
stdout
which on invocation I do redirect to some file on disk.Hope this helps 🖖
This was interesting in spite of being on the topic of PHP 😄
Hi Florian, do you recommend some online courses/tutorials for DataDog?
Hey @savgoran ,
the official docs should have you covered, for individual question you can sign up to Datadog's public Slack. In case it is about profiling and/or tracing, you'll find a #apm and #profiling channel in that Slack to ask folks, I am there as well.
Thank you
Hi!
maybe you will use swoole for co-routines and async::read for asynchronous reads?
Thank you , what php profiler did you used ?
In my day job I am working on the PHP Profiler at Datadog. The screenshots you can see in the blog post are from the Datadog backend.
I’ve never used a profiler in PHP before, so maybe could we get a link to the profiler?
Hey there,
the profiler I used is the one I am working on in my day job. You can find it at github.com/DataDog/dd-trace-php/tr...
This profiler sends the gathered samples to the Datadog backend where I took the screenshots from.
Could you explain why it's better to use the 8.4 beta version instead of PHP 8.3? Are there any performance improvements, and if so, where exactly?
Hey Ahmed,
PHP 8.4 comes with a new JIT and frameless function calls and such. Generally speaking: newer versions of PHP should always be faster, not always by a lot ;-)
Impressive, Which profiler did you use to gain these insights?
In my day job I am working on the PHP Profiler at Datadog. The screenshots you can see in the blog post are from the Datadog backend. The profiler is open source and can be found at: github.com/DataDog/dd-trace-php
I wonder, would it be possible to use fibers here (which is core since 8.1) instead of parallel?
Fibers are cooperatively scheduled, meaning they run in your current thread and you have to schedule them manually. We can't use fibers to do parallel processing like it is possible with threads.
There is a nice writeup about it at php.watch/versions/8.1/fibers
Which tool did you use to create the timeline view of the performance of the PHP scripts?
In my day job im building a PHP profiler for Datadog where this timeline visualisation is relatively new.