Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 657 Vote(s) - 3.6 Average
  • 1
  • 2
  • 3
  • 4
  • 5
In x86-64 asm: is there a way of optimising two adjacent 32-bit stores / writes to memory if the source operands are two immediate values?

#1
Is there a good way of optimising this code (x86-64) ?

mov dword ptr[rsp], 0;
mov dword ptr[rsp+4], 0

where the immediate values could be any values, not necessarily zero, but in this instance always immediate constants.

Is the original pair of stores even slow? Write-combining in the hardware and parallel operation of the μops might just make everything ridiculously fast anyway? I’m wondering if there is no problem to fix.

I’m thinking of something like (don’t know if the following instructions even exist)

mov qword ptr[rsp], 0

or

mov eax, 0;
mov qword ptr[rsp], rax ; assuming we can spare a register, a bad idea to corrupt one though

Reply

#2
**Yes, compile-time / asm-source-level write-coalescing is often a good idea, especially if both or the high half is zero (or `-1`) so you can do the whole qword with 1 instruction**; modern x86 CPUs have efficient unaligned stores especially when it doesn't cross a cache line boundary.

You generally want to minimize total fused-domain uops (to get your code through the front-end as efficiently as possible), total code-size in bytes, and total unfused-domain uops (back end space in the scheduler / RS). In something like that priority. Also, there are uop-cache considerations for Sandybridge-family; 64-bit immediates, or 32-bit immediate + disp8/disp32, can need to borrow extra space from an adjacent entry in the uop cache line. (See Agner Fog's microarch pdf on

[To see links please register here]

, the Sandybridge chapter. That still applies to later uarches like Skylake)

Also minimizing pressure on certain back-end execution ports that surrounding code uses heavily is good. Ice Lake has 2 store-data and store-address ports so can run both stores in parallel, but before that all x86 CPUs are limited to 1 store per clock (having only a single store-data port to write data into the store buffer). Commit to L1d cache is also limited to 1 store per clock from the store buffer. Out-of-order exec does smooth that out so 2 stores back to back isn't a big problem, but 2x 4-byte immediate stores takes a lot of instruction size.

x86 unfortunately doesn't have a `mov r/m32, sign_extended_imm8`, only imm32. (

[To see links please register here]

) x86-64 *does* have `mov r/m64, sign_extended_imm32`, though, which is what you should use:

mov qword [rsp], 0 ; 8 bytes, 1 fused-domain uop on modern Intel and AMD CPUs

vs. 7 bytes + 8 bytes and 2 uops for `mov dword [rsp],0` / `mov dword [rsp+4], 0`. xor-zeroing EAX / storing RAX would be smaller (code size), but cost 2 uops instead of 1.

> assuming we can spare a register, a bad idea to corrupt one though

Hardly; you often have a use for a zeroed register, and [xor-zeroing is literally as cheap as a NOP on Sandybridge-family][1]. (And also cheap on AMD.) If you can do this store somewhere that makes it useful to have a zeroed register, this is very cheap:

xor eax, eax
mov [rsp], rax ; better if you have a use for RAX later

Or for non-zero 64-bit values where you want `mov r64, imm64`, it's typical that you have a spare register you could use as a scratch destination. If you would have to spill a register or save/restore an extra reg around the whole function, then it's probably better to just do 2 separate dword-immediate stores if you can't do a single sign-extended-imm32.

---


For non-zero constants, if the whole qword constant can be represented as a sign-extended 32-bit immediate, use `mov qword [rsp], imm32`. (Or `push imm32` and optimize away an earlier `sub rsp, 8`.)


If you know your qword memory location is 8-byte aligned, then it's worth combining even for an arbitrary 8-byte constant that doesn't fit in a 32-bit immediate:

mov rax, 0x123456789abcdef0 ; 10 bytes, 1 uop
mov [rsp], rax ; 4 bytes, 1 micro-fused uop, for port 4 + port 2,3, or 7

It's only somewhat better than doing 2 separate dword stores, and could be slower in the (probably?) rare case where it crosses a 64-byte cache-line boundary

mov dword [rsp], 0x9abcdef0 ; 7 bytes, 1 micro-fused uop for port 4 + port 2,3, or 7
mov dword [rsp+4], 0x12345678 ; 8 bytes, 1 micro-fused uop for port 4 + port 2,3, or 7

Or if your constant happens to fit in a 32-bit value *zero-extended* to 64-bit, but not sign-extended, you can `mov eax, 0x87654321` (5 bytes, very efficient) / `mov [rsp], rax`.

----

**If you want to do a qword reload later, definitely do a single qword store so store-forwarding can work efficiently**.

----

> Write-combining in the hardware

That's not the major factor. More important is OoO exec and the store buffer decoupling store execution from the surrounding code.

If you *were* actually hoping to get more than 1 store (of any width) per clock executed, you're definitely out of luck on uarches before Ice Lake. On any uarch (even non-x86), hardware store-coalescing happens after stores execute.

You're also out of luck if you're hoping it will coalesce and take fewer entries in the store buffer so it has more time / room to absorb two cache-miss stores. We don't have any real evidence of any x86 doing that to save store-buffer-drain bandwidth, or free up store-buffer entries sooner. See

[To see links please register here]

for my current understanding of (lack-of) store buffer coalescing. There's some evidence that Intel at least can commit store misses to LFBs on cache-miss stores to free up space in the store buffer, but only with the limits of program order, and no evidence of committing multiple per clock.


[1]:

[To see links please register here]

Reply

#3
Yes, you can combine your two 32-bit writes into a single 64-bit write, like so:

~~~
mov QWORD PTR [rsp], 0
~~~

The immediate value is a 32-bit sign extended immediate, so it's not this simple if your second write is non-zero<sup>1</sup>, or if the MSB of the first write is 1. In that case, you can load a 64-bit constant using `movabs` and write that. E.g., to write 1 and 2,

~~~
movabs rax, 0x200000001
mov QWORD PTR [rsp], rax
~~~

The constant `0x200000001` results in the right values being written into each 32-bit half.

This trick is definitely worth it for the zero case and maybe worth it for the non-zero case and [Peter's answer][1] goes into much more detail on the tradeoffs in that latter case.

Compilers can also make this optimization (they call it "store combining" or something like that), meaning you can play with this [on godbolt](

[To see links please register here]

).

---

<sup>1</sup> Except in the special case that the sign extension gives you exactly what you want. I.e., the second value is exactly `0xFFFFFFFF` and the high bit of the first value is set.


[1]:

[To see links please register here]

Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through