Note that only `rep movs` and `rep stos` are fast. `repe/ne` `cmps` and `scas` on current CPUs only loop 1 element at a time. (
[To see links please register here]
has some perf numbers, like 2 cycles per RCX count for `repe cmpsb`). They still have some microcode startup overhead, though.
---
The `rep movs` microcode has several strategies to choose from. *If* the src and dest don't overlap closely, the microcoded loop can transfer in 64b chunks larger. (This is the so-called "fast strings" feature introduced with P6 and occasionally re-tuned for later CPUs that support wider loads/stores). But if dest is only one byte from src, `rep movs` has to produce the exact same result you'd get from that many separate `movs` instructions.
So the microcode has to check for overlap, and probably for alignment (of src and dest separately, or relative alignment). It probably also chooses something based on small/medium/large counter values.
According to [Andy Glew's comments][1] on an answer to
[To see links please register here]
, **conditional branches in microcode aren't subject to branch-prediction**. So there's a significant penalty in startup cycles if the default not-taken path isn't the one actually taken, even for a loop that uses the same `rep movs` with the same alignment and size.
He supervised the initial `rep` string implementation in P6, so he should know. :)
> REP MOVS uses a cache protocol feature that is not available to
> regular code. Basically like SSE streaming stores, but in a manner
> that is compatible with normal memory ordering rules, etc. // The
> "large overhead for choosing and setting up the right method" is
> mainly due to the lack of microcode branch prediction. I have long
> wished that I had implemented REP MOVS using a hardware state machine
> rather than microcode, which could have completely eliminated the
> overhead.
>
> By the way, I have long said that one of the things that hardware can do
> better/faster than software is complex multiway branches.
>
> Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996,
> which I supervised. The P6 fast strings took REP MOVSB and larger, and
> implemented them with 64 bit microcode loads and stores and a no-RFO
> cache protocol. They did not violate memory ordering, unlike ERMSB in
> iVB.
>
> The big weakness of doing fast strings in microcode was (a) microcode
> branch mispredictions, and (b) the microcode fell out of tune with
> every generation, getting slower and slower until somebody got around
> to fixing it. Just like a library men copy falls out of tune. I
> suppose that it is possible that one of the missed opportunities was
> to use 128-bit loads and stores when they became available, and so on
>
>
> In retrospect, I should have written a self-tuning infrastructure, to
> get reasonably good microcode on every generation. But that would not
> have helped use new, wider, loads and stores, when they became
> available. // The Linux kernel seems to have such an autotuning
> infrastructure, that is run on boot. // Overall, however, I advocate
> hardware state machines that can smoothly transition between modes,
> without incurring branch mispredictions. // It is debatable whether
> good microcode branch prediction would obviate this.
Based on this, my best guess at a specific answer is: the fast-path through the microcode (as many branches as possible actually take the default not-taken path) is the 15-cycle startup case, for intermediate lengths.
Since Intel doesn't publish the full details, black-box measurements of cycle counts for various sizes and alignments are the best we can do. **Fortunately, that's all we need to make good choices.** Intel's manual, and
[To see links please register here]
, have good info on how to use `rep movs`.
----
Fun fact: without ERMSB (new in IvB): `rep movsb` is optimized for small-ish copies. It takes longer to start up than `rep movsd` or `rep movsq` for large (more than a couple hundred bytes, I think) copies, and even after that may not achieve the same throughput.
The optimal sequence for large aligned copies without ERMSB and without SSE/AVX (e.g. in kernel code) may be `rep movsq` and then clean-up with something like an unaligned `mov` that copies the last 8 bytes of the buffer, possibly overlapping with the last aligned chunk of what `rep movsq` did. (basically use [glibc's small-copy `memcpy` strategy][2]). But if the size might be smaller than 8 bytes, you need to branch unless it's safe to copy more bytes than needed. Or `rep movsb` is an option for cleanup if small code-size matters more than performance. (`rep` will copy 0 bytes if RCX = 0).
A SIMD vector loop is often at least slightly faster than `rep movsb` even on CPUs with Enhanced Rep Move/Stos B. Especially if alignment isn't guaranteed. (
[To see links please register here]
, and see also Intel's optimization manual. Links [in the x86 tag wiki][3])
----
**Further details:** I think there's some discussion somewhere on SO about testing how `rep movsb` affects out-of-order exec of surrounding instructions, how soon uops from later instructions can get into the pipeline. I think we found some info in an Intel patent that shed some light on the mechanism.
Microcode can use a kind of predicated load and store uop that lets it issue a bunch of uops without initially knowing the value of RCX. If it turns out RCX was a small value, some of those uops choose not to do anything.
I've done some testing of `rep movsb` on Skylake. It seems consistent with that initial-burst mechanism: below a certain threshold of size like 96 bytes or something, IIRC performance was nearly constant for any size. (With small aligned buffers hot in L1d cache). I had `rep movs` in a loop with an independent `imul` dependency chain, testing that it can overlap execution.
But then there was a significant dropoff beyond that size, presumably when the microcode sequencer finds out that it needs to emit more copy uops. So I think when the `rep movsb` microcoded-uop reaches the front of the IDQ, it gets the microcode sequencer to emit enough load + store uops for some fixed size, and a check to see if that was sufficient or if more are needed.
This is all from memory, I didn't re-test while updating this answer. If this doesn't match reality for anyone else, let me know and I'll check again.
[1]:
[To see links please register here]
[2]:
[To see links please register here]
[3]:
[To see links please register here]