Login

I have a couple of questions related to moving XMM values to general purpose registers. All the questions found on SO focus on the opposite, namely transfering values in gp registers to XMM.

1. How can I move an XMM register value (128-bit) to two 64-bit general purpose registers?

movq RAX XMM1 ; 0th bit to 63th bit
mov? RCX XMM1 ; 64th bit to 127th bit

1. Similarly, how can I move an XMM register value (128-bit) to four 32-bit general purpose registers?

movd EAX XMM1 ; 0th bit to 31th bit
mov? ECX XMM1 ; 32th bit to 63th bit

mov? EDX XMM1 ; 64th bit to 95th bit
mov? ESI XMM1 ; 96th bit to 127 bit

The following handles both get and set and seems to work (I think it's AT&T syntax):

#include <iostream>

int main() {
uint64_t lo1(111111111111L);
uint64_t hi1(222222222222L);
uint64_t lo2, hi2;

asm volatile (
"movq %3, %%xmm0 ; " // set high 64 bits
"pslldq $8, %%xmm0 ; " // shift left 64 bits
"movsd %2, %%xmm0 ; " // set low 64 bits
// operate on 128 bit register
"movq %%xmm0, %0 ; " // get low 64 bits
"movhlps %%xmm0, %%xmm0 ; " // move high to low
"movq %%xmm0, %1 ; " // get high 64 bits
: "=x"(lo2), "=x"(hi2)
: "x"(lo1), "x"(hi1)
: "%xmm0"
);

std::cout << "lo1: [" << lo1 << "]" << std::endl;
std::cout << "hi1: [" << hi1 << "]" << std::endl;
std::cout << "lo2: [" << lo2 << "]" << std::endl;
std::cout << "hi2: [" << hi2 << "]" << std::endl;

return 0;
}

You cannot move the upper bits of an XMM register into a general purpose register directly.
You'll have to follow a two-step process, which may or may not involve a roundtrip to memory or the destruction of a register.

**in registers (SSE2)**

movq rax,xmm0 ;lower 64 bits
movhlps xmm0,xmm0 ;move high 64 bits to low 64 bits.
movq rbx,xmm0 ;high 64 bits.

[`punpckhqdq xmm0,xmm0`][1] is the SSE2 integer equivalent of [`movhlps xmm0,xmm0`][2]. Some CPUs may avoid a cycle or two of bypass latency if xmm0 was last written by an integer instruction, not FP.

**via memory (SSE2)**

movdqu [mem],xmm0
mov rax,[mem]
mov rbx,[mem+8]

**slow, but does not destroy xmm register (SSE4.1)**

mov rax,xmm0
pextrq rbx,xmm0,1 ;3 cycle latency on Ryzen! (and 2 uops)

A hybrid strategy is possible, e.g. store to memory, `movd/q e/rax,xmm0` so it's ready quickly, then reload the higher elements. (Store-forwarding latency is not much worse than ALU, though.) That gives you a balance of uops for different back-end execution units. Store/reload is especially good when you want lots of small elements. (`mov` / `movzx` loads into 32-bit registers are cheap and have 2/clock throughput.)

-----

For 32 bits, the code is similar:

**in registers**

movd eax,xmm0
psrldq xmm0,xmm0,4 ;shift 4 bytes to the right
movd ebx,xmm0
psrldq xmm0,xmm0,4 ; pshufd could copy-and-shuffle the original reg
movd ecx,xmm0 ; not destroying the XMM and maybe creating some ILP
psrlq xmm0,xmm0,4
movd edx,xmm0

**via memory**

movdqu [mem],xmm0
mov eax,[mem]
mov ebx,[mem+4]
mov ecx,[mem+8]
mov edx,[mem+12]

**Not destroying xmm register (SSE4.1)** (slow like the `psrldq` / `pshufd` version)

movd eax,xmm0
pextrd ebx,xmm0,1 ;3 cycle latency on Skylake!
pextrd ecx,xmm0,2 ;also 2 uops: like a shuffle(port5) + movd(port0)
pextrd edx,xmm0,3

-----

The 64-bit shift variant can run in 2 cycles. The `pextrq` version takes 4 minimum. For 32-bit, the numbers are 4 and 10, respectively.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

On Intel SnB-family (including Skylake), shuffle+`movq` or `movd` has the same performance as a `pextrq`/`d`. It decodes to a shuffle uop and a `movd` uop, so this is not surprising.

On AMD Ryzen, `pextrq` apparently has 1 cycle lower latency than shuffle + `movq`. `pextrd/q` is 3c latency, and so is `movd/q`, according to [Agner Fog's tables][1]. This is a neat trick (if it's accurate), since `pextrd/q` does decode to 2 uops (vs. 1 for `movq`).

Since shuffles have non-zero latency, shuffle+`movq` is always strictly worse than `pextrq` on Ryzen (except for possible front-end decode / uop-cache effects).

The major downside to a pure ALU strategy for extracting all elements is throughput: it takes a lot of ALU uops, and most CPUs only have one execution unit / port that can move data from XMM to integer. Store/reload has higher latency for the first element, but better throughput (because modern CPUs can do 2 loads per cycle). If the surrounding code is bottlenecked by ALU throughput, a store/reload strategy could be good. Maybe do the low element with a `movd` or `movq` so out-of-order execution can get started on whatever uses it while the rest of the vector data is going through store forwarding.

---

Another option worth considering (besides what Johan mentioned) for extracting 32-bit elements to integer registers is to do some of the "shuffling" with integer shifts:

mov rax,xmm0
# use eax now, before destroying it
shr rax,32

pextrq rcx,xmm0,1
# use ecx now, before destroying it
shr rcx, 32

`shr` can run on p0 or p6 in Intel Haswell/Skylake. p6 has no vector ALUs, so this sequence is quite good if you want low latency but also low pressure on vector ALUs.

----

Or if you want to keep them around:

movq rax,xmm0
rorx rbx, rax, 32 # BMI2
# shld rbx, rax, 32 # alternative that has a false dep on rbx
# eax=xmm0[0], ebx=xmm0[1]

pextrq rdx,xmm0,1
mov ecx, edx # the "normal" way, if you don't want rorx or shld
shr rdx, 32
# ecx=xmm0[2], edx=xmm0[3]

[1]:

[To see links please register here]

buckhound187896

jeannotmgeuryh

methylanthracene484070

eolith296