**Yes, compile-time / asm-source-level write-coalescing is often a good idea, especially if both or the high half is zero (or `-1`) so you can do the whole qword with 1 instruction**; modern x86 CPUs have efficient unaligned stores especially when it doesn't cross a cache line boundary.
You generally want to minimize total fused-domain uops (to get your code through the front-end as efficiently as possible), total code-size in bytes, and total unfused-domain uops (back end space in the scheduler / RS). In something like that priority. Also, there are uop-cache considerations for Sandybridge-family; 64-bit immediates, or 32-bit immediate + disp8/disp32, can need to borrow extra space from an adjacent entry in the uop cache line. (See Agner Fog's microarch pdf on
[To see links please register here]
, the Sandybridge chapter. That still applies to later uarches like Skylake)
Also minimizing pressure on certain back-end execution ports that surrounding code uses heavily is good. Ice Lake has 2 store-data and store-address ports so can run both stores in parallel, but before that all x86 CPUs are limited to 1 store per clock (having only a single store-data port to write data into the store buffer). Commit to L1d cache is also limited to 1 store per clock from the store buffer. Out-of-order exec does smooth that out so 2 stores back to back isn't a big problem, but 2x 4-byte immediate stores takes a lot of instruction size.
x86 unfortunately doesn't have a `mov r/m32, sign_extended_imm8`, only imm32. (
[To see links please register here]
) x86-64 *does* have `mov r/m64, sign_extended_imm32`, though, which is what you should use:
mov qword [rsp], 0 ; 8 bytes, 1 fused-domain uop on modern Intel and AMD CPUs
vs. 7 bytes + 8 bytes and 2 uops for `mov dword [rsp],0` / `mov dword [rsp+4], 0`. xor-zeroing EAX / storing RAX would be smaller (code size), but cost 2 uops instead of 1.
> assuming we can spare a register, a bad idea to corrupt one though
Hardly; you often have a use for a zeroed register, and [xor-zeroing is literally as cheap as a NOP on Sandybridge-family][1]. (And also cheap on AMD.) If you can do this store somewhere that makes it useful to have a zeroed register, this is very cheap:
xor eax, eax
mov [rsp], rax ; better if you have a use for RAX later
Or for non-zero 64-bit values where you want `mov r64, imm64`, it's typical that you have a spare register you could use as a scratch destination. If you would have to spill a register or save/restore an extra reg around the whole function, then it's probably better to just do 2 separate dword-immediate stores if you can't do a single sign-extended-imm32.
---
For non-zero constants, if the whole qword constant can be represented as a sign-extended 32-bit immediate, use `mov qword [rsp], imm32`. (Or `push imm32` and optimize away an earlier `sub rsp, 8`.)
If you know your qword memory location is 8-byte aligned, then it's worth combining even for an arbitrary 8-byte constant that doesn't fit in a 32-bit immediate:
mov rax, 0x123456789abcdef0 ; 10 bytes, 1 uop
mov [rsp], rax ; 4 bytes, 1 micro-fused uop, for port 4 + port 2,3, or 7
It's only somewhat better than doing 2 separate dword stores, and could be slower in the (probably?) rare case where it crosses a 64-byte cache-line boundary
mov dword [rsp], 0x9abcdef0 ; 7 bytes, 1 micro-fused uop for port 4 + port 2,3, or 7
mov dword [rsp+4], 0x12345678 ; 8 bytes, 1 micro-fused uop for port 4 + port 2,3, or 7
Or if your constant happens to fit in a 32-bit value *zero-extended* to 64-bit, but not sign-extended, you can `mov eax, 0x87654321` (5 bytes, very efficient) / `mov [rsp], rax`.
----
**If you want to do a qword reload later, definitely do a single qword store so store-forwarding can work efficiently**.
----
> Write-combining in the hardware
That's not the major factor. More important is OoO exec and the store buffer decoupling store execution from the surrounding code.
If you *were* actually hoping to get more than 1 store (of any width) per clock executed, you're definitely out of luck on uarches before Ice Lake. On any uarch (even non-x86), hardware store-coalescing happens after stores execute.
You're also out of luck if you're hoping it will coalesce and take fewer entries in the store buffer so it has more time / room to absorb two cache-miss stores. We don't have any real evidence of any x86 doing that to save store-buffer-drain bandwidth, or free up store-buffer entries sooner. See
[To see links please register here]
for my current understanding of (lack-of) store buffer coalescing. There's some evidence that Intel at least can commit store misses to LFBs on cache-miss stores to free up space in the store buffer, but only with the limits of program order, and no evidence of committing multiple per clock.
[1]:
[To see links please register here]