Numerous copy routine implementations are readily available in .NET. If I were to simply list them alongside a few benchmark numbers and charts, it wouldn't make for a very interesting article.
⚠️ What if I told you upfront that none of these routines is designed to be the absolute fastest?
If you're interested in a basic comparison, I recommend checking out this article here on dev.to: What is the best way to copy an array? or a more detailed, older comparison: High performance memcpy gotchas in C#.
Here, I'll outline a list of options, but this is far from the whole story:
- A simple
for
loop (hint:foreach
is usually a bit faster) Array.Copy
Span.CopyTo
Buffer.BlockCopy
Buffer.MemoryCopy
Marshal.Copy
Unsafe.CopyBlock
- Imported
memcpy
If you're currently struggling with slow array or memory copy operations, try one of the functions on this list.
However, there are some elephants in the room, and I plan to uncover a few of them.
Elephant No.1 - Cache Pollution
What You might read in many places is that framework functions are already highly optimized. This is true, but they are not necessarily optimized for the highest speed possible.
Built-in functions provided by .NET framework are optimized in various ways. One significant consideration is preventing cache pollution. Wait - what? Yes, x86 CPUs achieve their speed primarily thanks to their cache. If the cache is disabled or not utilized properly, code execution speed drops dramatically.
Cache Pollution Explained
Cache pollution when copying large data blocks: This occurs when frequently used data gets replaced in the CPU cache by the data being copied. Chances are that once You are done copying, You won't touch the same data ever again, so it makes placing them in a CPU cache unnecessary.
For example, imagine a network stack - once data is sent, it's unlikely to be touched again. Similarly, when loading textures with the CPU for GPU usage, caching may be unnecessary.Standard .NET functions mitigate this issue when copying larger blocks (usually sizes above 1MB) by using non-temporal access, bypassing the cache.
This might not suit your use case. For instance, if you are waiting for the copy to complete before doing anything else - some cache pollution would be an acceptable tradeoff. Especially when the data is likely to be cached anyway, as in simulations or CPU rendering.
▶️ How can we observe this? And what can we do about it?
Chart above compares various buffer sizes (ranging from 1MB to 100MB) and copy methods. The leftmost data points for 1MB block shows great performance for Buffer.MemoryCopy
and Unsafe.CopyBlock
, two best methods available in .NET for memory copying. However performance falls sharply past 1MB.
To illustrate the real HW limitation, please notice the comparison between AVX based copy routine using 256bit (orange) and 512bit (red) vectors and the difference between normal loads and non temporal 512 bit variant (light blue). These tests were run on a Ryzen 9950x, with 48kB L1 data cache, 1024kB of L2 and 32MB of L3 available to any single CPU core. Total cache sizes are 1280 KB L1 Cache (16x 32kB + 16x 48kB), L2 Cache 16 MB (16x 1024kB) and L3 Cache 64 MB (2x32MB).
Using cached AVX loads, high copy speed is sustained even for 8MB (and larger) blocks. However, when copying large blocks relative to cache size (e.g.: 100MB), built-in functions regained their advantage.
Elephant No.2 - Overhead
My first benchmark shows that splitting larger buffers into smaller chunks can improve performance at the expense of increased CPU cache utilization.
There are other factors causing slowdowns, for example:
- Managed memory introduces framework checks for each access.
- Unaligned memory access could decrease the CPU cache efficiency.
- And finally, with unmanaged, 64 byte aligned memory, there is still a final question:
▶️ What is the optimal block size?
This depends on the actual CPU, but we are trying to strike a balance between call overhead and efficient CPU resource utilization.
The chart above shows the throughput for transferring a 32MB buffer using various block sizes and methods. The key takeaway is Buffer.MemoryCopy
(blue) and Unsafe.CopyBlock
(yellow) perform best with block sizes between 8kB to 1MB. Notably, these methods outperform themselves (orange), compared with single call for the whole 32MB buffer.
Methods represented by horizontal lines do not use variable block sizes. It is worth mentioning that AVX variants are always loading 8 vectors (of 256 or 512 bits) before storing them, thus effectively working with 256 an 512 BYTE blocks regardless of the total buffer size.
Elephant No.3 - Multi Threading
So far, we have tested only a single threaded performance, as memory speed and cache capacity are often the limiting factor. However, we are not utilizing all CPU resources, which might cost us some performance!
▶️ Can we combine the previous techniques with multiple threads?
Modern CPUs, even desktop ones, are becoming increasingly heterogeneous. By splitting the workload across multiple threads, we might take advantage of more CPU resources:
The chart above shows the chunked, multi-threaded approach. For reference, the test system uses dual channel DDR5 6000MT/s memory, with theoretical peak performance about half of the total throughput (50% of 90GB/s).
What we observe here is a synergistic effect between smaller blocks and multiple threads!
Final Thoughts
Standard framework functions are well-optimized, safe and sufficient for most use cases. If extreme performance is required, there are additional techniques and tradeoffs available.
Example with a single call to Buffer.MemoryCopy
on an 8MB buffer reaching ~30GB/s speed, offers significant gains. It is possible to reach 220GB/s with multiple transfers of 128kB blocks and using multiple threads on a same CPU. This is more than 7x improvement.
Top comments (0)