Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 340 Vote(s) - 3.61 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How "lock add" is implemented on x86 processors

#1
I recently benchmarked `std::atomic::fetch_add` vs `std::atomic::compare_exchange_strong` on a 32 core Skylake Intel processor. Unsurprisingly (from the myths I've heard about fetch_add), fetch_add is almost an order of magnitude more scalable than compare_exchange_strong. Looking at the disassembly of the program `std::atomic::fetch_add` is implemented with a `lock add` and `std::atomic::compare_exchange_strong` is implemented with `lock cmpxchg` (

[To see links please register here]

).

What makes `lock add` so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from [MESI](

[To see links please register here]

)). How then does the processor optimize fetch_add internally?

-------

[This](

[To see links please register here]

) is a simplified version of the benchmarking code. There was no load+CAS loop for the compare_exchange_strong benchmark, just a compare_exchange_strong on the atomic with an input variable that kept getting varied by thread and iteration. So it was just a comparison of instruction throughput under contention from multiple CPUs.
Reply

#2
`lock add` and `lock cmpxchg` both work essentially the same way, by holding onto that cache line in Modified state for the duration of the microcoded instruction. (

[To see links please register here]

). According to [Agner Fog's][1] instruction tables, `lock cmpxchg` and `lock add` are very similar numbers of uops from microcode. (Although `lock add` is slightly simpler). Agner's throughput numbers are for the uncontended case, where the var stays hot in L1d cache of one core. And cache misses can cause uop replays, but I don't see any reason to expect a significant difference.

You claim you aren't doing load+CAS or using a retry loop. But is it possible you're only counting successful CAS or something? On x86, every CAS (including failures) has almost identical cost to `lock add`. (With all your threads hammering on the same atomic variable, you'll get lots of CAS failures from using a stale value for `expected`. This is not the usual use-case for CAS retry loops).

Or does your CAS version actually do a pure load from the atomic variable to get an `expected` value? That might be leading to memory-order mis-speculation.

You don't have complete code in the question so I have to guess, and couldn't try it on my desktop. You don't even have any perf-counter results or anything like that; there are lots of perf events for off-core memory access, and events like `mem_inst_retired.lock_loads` that could record number of `lock`ed instructions executed.

With `lock add`, every time a core gets ownership of the cache line, it succeeds at doing an increment. Cores are only waiting for HW arbitration of access to the line, never for another core to get the line and then fail to increment because it had a stale value.

----

It's plausible that HW arbitration could treat `lock add` and `lock cmpxchg` differently, e.g. perhaps letting a core hang onto the line for long enough to do a couple `lock add` instructions.

Is that what you mean?

---

Or maybe you have some major failure in microbenchmark methodology, like maybe not doing a warm-up loop to get CPU frequency up from idle before starting your timing? Or maybe some threads happen to finish early and let the other threads run with less contention?


[1]:

[To see links please register here]

Reply

#3
> to execute both instructions **with sequential consistency**, the
> executing CPU has to pull the line into it's own core in exclusive or
> modified mode (from MESI).

No, to execute either instruction with any consistent, defined semantic that guarantees that concurrent executions on multiple CPU do not lose increments, you would need that. Even if you were willing to drop "sequential consistency" (on these instructions) or even drop the usual acquire and release guarantees of reads and writes.

Any locked instruction effectively *enforces mutual exclusion on the part of memory sufficient to guarantee atomicity*. (Like a regular mutex but at the memory level.) Because no other core can access that memory range for the duration of the operation, the atomicity is trivially guaranteed.

> What makes lock add so much faster on an intel multi-core processor?

I would expect any tiny difference of timing to be critical in these cases, and doing the load plus compare (or compare-load plus compare-load ...) might change the timing enough to lose the chance, much like too code using mutexes can have widely different efficiency when there is heavy contention and a small change in access pattern changes the way the mutex is attributed.
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through