07-24-2023, 12:53 PM
I recently benchmarked `std::atomic::fetch_add` vs `std::atomic::compare_exchange_strong` on a 32 core Skylake Intel processor. Unsurprisingly (from the myths I've heard about fetch_add), fetch_add is almost an order of magnitude more scalable than compare_exchange_strong. Looking at the disassembly of the program `std::atomic::fetch_add` is implemented with a `lock add` and `std::atomic::compare_exchange_strong` is implemented with `lock cmpxchg` (
What makes `lock add` so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from [MESI](
-------
[This](
[To see links please register here]
).What makes `lock add` so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from [MESI](
[To see links please register here]
)). How then does the processor optimize fetch_add internally?-------
[This](
[To see links please register here]
) is a simplified version of the benchmarking code. There was no load+CAS loop for the compare_exchange_strong benchmark, just a compare_exchange_strong on the atomic with an input variable that kept getting varied by thread and iteration. So it was just a comparison of instruction throughput under contention from multiple CPUs.