
Originally published at deepu.tech.
This is a multi-part series where I'll discuss concurrency in modern programming languages. I will be building...
For further actions, you may consider blocking this person and/or reporting abuse
Interesting results! I didn't expect everyone to perform about the same. This is great!
I would hypothesize that this is due to the fact that network requests are mostly I/O-bound. That is to say, the CPU remains idle most of the time as it waits for the network to respond.
Therefore, the underlying runtime—namely Node.js for JavaScript, Tokio for Rust and Deno, etc.—is experimentally irrelevant for this use case, as you have shown in your data. It seems that under the hood, all runtimes manage to process the requests faster than the network/hardware can provide the bytes, hence the insignificant differences in the various time-based metrics. TL;DR: the network may be the bottleneck, not the languages.
With that said, I would be very interested in a follow-up post where you go beyond time-based metrics since they don't paint the full picture. Namely, I would like to cite Discord's case study on why they switched from Go to Rust.
In their article, the major performance gains mostly came from the absence of garbage collection, which you also briefly mentioned in your conclusion. Go's garbage-collected runtime caused large spikes in latency and CPU usage every two minutes or so, which ultimately proved to be unacceptable at Discord's scale. I highly recommend reading their thoughts on it. 👌
Anyway, what I'm trying to say is that I look forward to an investigation into other metrics beyond "requests-per-second". As Discord's engineering team has shown, this does not always paint the full picture. Data on CPU and memory usage would definitely make your series more comprehensive.
Nevertheless, this is excellent write-up!
Aye, the findings about Go match my experience as well. It's very useful for static caches (e.g. ZIP code and address data), but horrible at LRU caches and the like. If you have an upper bound on your memory usage and know you can keep that in memory on your instance, it's great and super quick. If you need to free up memory and dynamically replace cache entries, it falls apart.
Yes, I fully agree and thats why I added a disclaimer. This is a very simple benchmark, for a real world usecase there are considerations beyond this and Rust has way more benefits than concurrency to win over Go. I would choose Rust over Go anyday. And thanks for the Discourd article, I didn't see that before, its very interesting
There are two things that look fishy to me in those results:
This suggests there was a common bottleneck outside of your server implementations, and you've measured the performance of that bottleneck, not the servers. Which also means the results are probably inconclusive and you can't interpret them as "Rust has won".
I looked quickly at your code and it seems you're opening a new connection for each request. This typically adds a large amount of latency and system load to each request and might become a problem, particularly at low concurrency levels like 100.
A few suggestions for better benchmarking:
For a throughput comparison you need to verify if the servers are really working at their full speed, so you should capture CPU load. It is also good to capture other system metrics like system CPU time, cache misses, context switches and syscalls, which are often a good indicator of how efficiently the server app interacts with the system.
Cache connections and leverage the HTTP keep-alive. That makes a tremendous difference in throughout.
Play with different concurrency levels. If concurrency is too low and latency is too high, you won't get the max throughput. The server would simply wait for requests, handle them quickly and go idle waiting for more. Also switching between idle and active is costly (context switch).
In latency tests, latency median is not as interesting as a full histogram. I'd expect large differences in P99 between GCed and non-gced servers. So even if medians are very close, it doesn't mean the servers would work equally well in production. Obviously you should do latency tests at lower throughout than max, so those should be separate experiments.
Anyway I'd love to see updated results, because you seem to have put a lot of work into multiple implementations and it would be a pity if you stopped now ;)
Never run client/server benchmarks on the same computer.
The process to generate loads will inevitably impact the process to serve the requests.
Best infra for benchmarking is two independent computer hardware. Not even VMs, as they also compete for CPU resources.
Depends on how efficient the load generation tool is vs how much work on the server side is required to handle the request. You can also pin those two processes to different CPU core sets. This way one computer is enough to get meaningful results. Obviously if your don't know what you're doing, it is better to use two separate machines.
Ya in this case the server is quite simple and doesn't need too much resource that might explain why I got similar results from both. I would be interested in learning more about pinning process to cores. Do you have any resource you can recommend?
man taskset
Honestly I didn't expect people to take this so seriously or even for the post to do well. I was just wrapping up a series that was taking a lot of effort and not much interest in terms of views. But man this blowed up. Now I think I have to rework this to something better 😂
That's what happens when you publish benchmarks! 😂
lesson learned :P
When I started the series, I did want to capture more metrics, but that kept pushed and Its been months so I decided to do something simple atleast. The source is in GitHub so if you are interested feel free to use it and publish a follow up. I might not have time anytime in the near future due to other commitments. The metrics you are suggesting will take a lot of effort and time to do properly. The bottleneck is the sleep introduced, so theoritically 25 seconds is the best possible for this code. If I remove the sleep this is the result for same 10k req with 100 concurrent
Concurrency level 100 is way to small. Try something in range 500-5000. Beware that ab is not a good tool for testing high concurrency.
Ya, I wasn't expecting people take this simple experiment of mine so seriously. I'll try to update the tests to something better
I have updated the benchmarks with more data. WDYT now?
When you add 2 second delay every 10 requests you make the comparison totally meaningless. You are mesuring delays, not the code.
Reading the file in every loop mesures reading from disk, not actual program performance.
Also, ab is not a good tool for measuring. Usually you are measuring the performace of ab, not the system that can be 20 times faster than what ab can measure. Use github.com/wg/wrk instead.
When I remove the delay then go program crashes when testing with wrk: Error reading:EOF.
I was about to comment on the topic, thank you for pointing this out and not blindly trusting the internet.
The massive hint is the extremely similar performance between all languages. I'm sure the intent of this article was to help the community, but I hope the author will understand their mistake and update the results accordingly.
I have updated the benchmarks with more data. WDYT now?
I did add a disclaimer that this is a simple concurrency benchmark. I don't agree that its meaningless as I'm comparing exact same impl across languages to see if the language/platform makes any difference, sleep was added to introduce a concurrency bottleneck. This is not a HTTP performance comparison, its a concurrency comparison and for that I think AB is as good as any other tool. I'll try wrk and post the results.
Instead of sleeping 2 sec or even 200 ms, you should sleep 3-8 ms to simulate access to fast SQL server. Then you would have meaningful request rate.
I have updated the benchmarks with more data. WDYT now?
All you demonstrated is that thread sleep works in all languages, and that overhead is significantly smaller than 200ms. You did not really benchmark the various language / server combos.
If it makes you all happy I'll update the post with numbers from the code without sleep. Oh and there are cases where a request takes more than 2 seconds, I have performance tuned many such systems when I was working for enterprise companies. Also do suggest better approach to simulate a thread blocking request
And I reiterate again if it wasn't clear from the title, intro or all the previous posts. I was trying to benchmark performance of concurrency and not web server performance. They are related but not the same
System sleep is not exactly 2 seconds, it can vary wildly.
Basically you are comparing nonsense. This is like comparing people who jump at top of 2000m mountain and and you tell that someone jumped 2000.3 meters and someone else jumped 2000.26 meters and you tell that "results are almost same". And acctually in every jump the mountain height varies a lot.
If you want ot test concurrency then take the web code away. Give every program same amount of loops to run. And let the tests run 5 seconds or preferably more, anything less is not statistically valid.
I have updated the benchmarks with more data. WDYT now?
Good. Could you update the code in repo too - they contain sleep().
Its a different branch now (nosleep)
It's remarkable too that none of the other commenters noted this. As the adage goes: just because you read it on the internet doesn't make it true! Good catch.
I have updated the benchmarks with more data. WDYT now?
I think it's a lot better! I suspect that some of the differences now are due to technical trivia of superficially irrelevant details you happened to choose when implementing these programs, but that's the real-world for you. The current data highlights much more clearly just how many req/sec any of this options can handle - because I think that's the real takeaway here; the web-frameworks themselves are unlikely to be a significant bottleneck in any real-world usecase; and if the language and/or framework matters for heavier, real workloads - well, that's the kind of thing you can't microbenchmark well; you need a real use case.
What the current data also highlights more clearly is just how finicky perf at this level is; e.g. the way the program using the go http stack apparently is much more efficient than the program you labelled TCP; or how wrk and drill results are quite different. And that's important to understand; microbenchmarks are notoriously flaky and sensitive to all kinds of details you don't actually care about. Taking a microbenchmark to mean that task X takes time Y is usually the wrong way to think about it - it takes time Y only in the hyper-specific circumstances of the test - but generalizing a specific result is quite error-prone.
I think this is an excellent takeaway
I'm sorry but it's one more time a biased argumentation for rust. Speaking about the threading control is OK, but what can we see in the reality ? You have no native access to http library with rust, you have more lines of code to type and it's not as simple to read than the Go TCP version.
More, you say that you haven't as much as control for threading with go... But goroutines are made to use concurrency or threading without the need to develop the switch. Then if you want threading and control you can use C inside Go or there are packages for this, so you can avoid goroutines.
Rust is cool for memory management and why not low level development like in Linux kernel. But it's way more complicated than Go to develop such http service. Instead of switching to rust, I prefer to ask to Go creator to help on fixing garbage collection control.
And leaving go managing concurrency and threading with ease and efficiency.
Sorry for my answer but I see too many pro rust article with too much of criticism for Go.
The biggest selling point of Rust is IMHO fearless concurrency with guarantee for no data races. So while Go (and JS and Java) programs may appear initially simpler to write, because they give a bit more freedom to a programmer, at the end of the day they are often not as easy to reason about. It is trivial to guarantee that a piece of code won't be called concurrently in Rust, I can see explicitly what is allowed to run concurrently and what not and if I try to invoke a non-thread safe code accidentally in multithreaded context, it simply won't compile. Fixing a compile time error vs fixing code failing once a week in production under heavy load only - the choice is pretty obvious to me.
Exactly and that why I said Rust is better for multi-threading. Performance is just added bonus
There are tools in Go to check race condition.
The fact is that there are two paradigms, not a better one over the other.
Rust is not suitable for developing REST APIs, at least not as easily as with Go or even Python. Rust is very cool for developing low-level tools with increased control of memory management, in the case of low CPU cost applications.
But when we develop an HTTP application with a very complex management of coroutines to manage SSEs, with messages coming from different routines, Rust becomes purely and simply infernal.
Rust has its advantages, but you have to keep in mind that other languages are not outdone. I don't see myself developing machine learning in Go or Rust - I don't develop kernel modules in JS, and I definitely don't do REST APIs in Rust or C.
And for so many reasons, I'd need a multi-step article to demonstrate that Rust doesn't fit so well in many areas.
So, having made a few REST services with Rust, I can say that it's fine for it. I just generated the stub code with the OpenAPI CLI generator from a spec file, and then implemented the business logic as I would do in most languages.
The main disadvantage to Rust is that it's more difficult to learn. Ofcourse, that is just my opinion. But being more difficult, it is also more expensive to hire competent devs to maintain your application once you, the master programmer, have finished it.
I think I could have made the same services with Node or Python in 25% of the time, with no fear of data races, due to the nature of the services. Also, I/O to the cloud provider would be the bottleneck in most of the applications, not time spent in logic.
So my take away after an enjoyable 16 months of exclusively programming Rust is that it is not the tool for everything. If you are writing an MPEG encoder, or a scientific calculation library, it would be great, but if you are writing wrappers for other services, there are better languages with cheaper development costs.
I agree. It's definitely not suitable for everything
I'm a polyglot developer with roots in Java, JavaScript, who later did Go, C#, PHP, Python and Rust so I'm not trying to be biased. I have done most of the coding in Java, JS/TS, Go and Rust and what I wrote is based on what I experienced. For me the only selling points of Go over Rust is that its simple to read (not to write, Go is way more boilerplate due to lack of generics for bigger projects) and concurrency is easier to write. When I said control over concurrency, I just stated the fact that Rust offers more control on that area than Go. I did say Go was easier for asyc than Rust. So for me this is not from a fanboy standpoint but more from a user who finds one product better than other. I'm not married to Rust or Go, if there is a new language that is better than Rust i'll sing praise of that in a heartbeat.
See this PR for example, in Rust you don't even have to think about data races
Also this post from another comment talks in detail about why people who try both Go and Rust end up preferring Rust for such use cases discord.com/blog/why-discord-is-sw...
That's not the case for everyone. There are plenty of examples of developers that switches from Rust to Go, from JS to Python, from Python to Go, from Java to Rust...
95% of users are on Windows, that doesn't make Windows the best OS. That's the same for languages and technologies.
Go is simpler to use, like Python is simpler to use. That helps to develop faster and with a certain level of needed control. Discord needed to avoid long LRU cache cleaning: OK. Now, is this very important for 95% of the websites in the world ?
You know what, I never cleaned cache of all website I do in Go. Their memory print is low... so... why using Rust here ? (And one of the API I develop has 100k request/sec to manage)
It might be interesting to also measure the resource use while the test is running. I did something like that earlier between Java and Rust. Where for cpu it was pretty comparable, but memory use with Rust was much lower.
Rust uses way less resources. Actually that would be an interesting metric to look at. I know from expereince that Rust uses way less memory than all others for same stuff. I worte KDash in Rust, which is way more graphically intensive than kubectl, but its still uses 6-7 times less memory than kubectl. For memory usage my bet would be Rust < Go < Deno < Node.js < Java
Not mentioned but Elixir parallelism is something else for it being based on the Erlang VM called BEAM.
Referenced from this article
The concurrency system has been tested with a single really buff machine to handle 2 millions concurrent websockets.
Also in this Erlang VM there's the OTP system you can do cron job internally, have caching without Redis and have processes restart when their parent process notice they did crash.
What was the actual process doing? It seems that every request had 200 ms baseline delay and for example Rust took 0.7 ms over that vs Node.js taking 4-7 ms. If you get rid of that 200 latency, Rust should be 5-10x faster than Node.js in this test.
I'm gonna look into that
How did you compile each program?
Was compiled using their native compilers in production mode of available
When publishing benchmarks you have to give exact steps to repeat, so exact command line parameters and compiler flags used to compile should be given.
Also this:
Suggests that something is way off either with your setup or your code. Async Rust and async Go are capable of running hundred thousands concurrent connections.
All the commands can be found in the code repository mentioned. And for breakdown at 2000 concurrency, yes it's possible that the code is a problem. Do you see anything obvious?
First, I thought, It could also be the tool used itself which fails at those rates but then Node.js with multiple workers seems to work better so I'm not sure anymore
Updates to benchmark testing were great!
I would recommend also testing servers using one (or more) computer as client and another as server. For example, you were testing localhost connections only which doesn't represent real world performance with real sockets that well. In addition, wrk was running on 8 CPU cores so unless you had reserved additional identical number of physical cores for all the test servers, asynchronous servers would get extra boost compared to multi-threaded servers due not over-booking the CPU that badly.
With real sockets I'd expect the server with lowest latency (Rust with async + multi-threaded) to get the best results.
If your benchmark software supports it, usually a better way to test servers is to decide timeout for a request (say 50 ms) and then test how many request/s you can execute until you start to get timeouts. Some server software is really unfair and fails to serve older request first to keep worst case latency sensible. This kind of testing would preferably ramp the request rate slowly until the timeout is triggered for a request. Best output for this kind of test would be a graph with request/s on horizontal axis and worst case latency on vertical axis.
If you end the test on first timeout, I'd expect Java servers to fail early because those often stall during GC and if your timeout is pretty small, a single world-stop GC may be enough to ruin the run. It's possible to create Java servers that do not exhibit stalls but the most simple implementation often fails on that.
I hope next time you will also include V-lang in this comparison chalange. Perhaps results will be similar but this Lang is really worth to look at.
V-lang is indeed interesting but last time I checked, the automatic memory freeing was really buggy (check the issues on GitHub for details) and if you don't free memory, RAM usage is obliviously going to explode pretty fast if you handle e.g. 100k requests.
What about the memory footprint? It seems like it would be an important feature to consider
I have updated the benchmarks with more data. WDYT now?
Great post! I really like how easy is to right concurrent code in Go!
Well many of you miss a concurrent async extension of PHP called swoole. Or workerman. It also shows impressive results closer to golang
These results are very interesting! I wonder how will they look like when the full version of Zig will come out. I'm pretty sure that it'll knock out Rust
We'll be publishing a post soon about comparing Rust and Go in 2024, stay tuned here - packagemain.tech/
Rust for Gophers with John Arundel packagemain.tech/p/rust-for-gophers