Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 256 Vote(s) - 3.66 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Bubble sort slower with -O3 than -O2 with GCC

#1
I made a [bubble sort][1] implementation in C, and was testing its performance when I noticed that the `-O3` flag made it run even slower than no flags at all! Meanwhile `-O2` was making it run a lot faster as expected.

Without optimisations:

```none
time ./sort 30000

./sort 30000 1.82s user 0.00s system 99% cpu 1.816 total
```

`-O2`:

```none
time ./sort 30000

./sort 30000 1.00s user 0.00s system 99% cpu 1.005 total
```

`-O3`:

```none
time ./sort 30000

./sort 30000 2.01s user 0.00s system 99% cpu 2.007 total
```

The code:

```c
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <time.h>

int n;

void bubblesort(int *buf)
{
bool changed = true;
for (int i = n; changed == true; i--) { /* will always move at least one element to its rightful place at the end, so can shorten the search by 1 each iteration */
changed = false;

for (int x = 0; x < i-1; x++) {
if (buf[x] > buf[x+1]) {
/* swap */
int tmp = buf[x+1];
buf[x+1] = buf[x];
buf[x] = tmp;

changed = true;
}
}
}
}

int main(int argc, char *argv[])
{
if (argc != 2) {
fprintf(stderr, "Usage: %s <arraysize>\n", argv[0]);
return EXIT_FAILURE;
}

n = atoi(argv[1]);
if (n < 1) {
fprintf(stderr, "Invalid array size.\n");
return EXIT_FAILURE;
}

int *buf = malloc(sizeof(int) * n);

/* init buffer with random values */
srand(time(NULL));
for (int i = 0; i < n; i++)
buf[i] = rand() % n + 1;

bubblesort(buf);

return EXIT_SUCCESS;
}
```

The assembly language generated for `-O2` (from [godbolt.org][2]):

```none
bubblesort:
mov r9d, DWORD PTR n[rip]
xor edx, edx
xor r10d, r10d
.L2:
lea r8d, [r9-1]
cmp r8d, edx
jle .L13
.L5:
movsx rax, edx
lea rax, [rdi+rax*4]
.L4:
mov esi, DWORD PTR [rax]
mov ecx, DWORD PTR [rax+4]
add edx, 1
cmp esi, ecx
jle .L2
mov DWORD PTR [rax+4], esi
mov r10d, 1
add rax, 4
mov DWORD PTR [rax-4], ecx
cmp r8d, edx
jg .L4
mov r9d, r8d
xor edx, edx
xor r10d, r10d
lea r8d, [r9-1]
cmp r8d, edx
jg .L5
.L13:
test r10b, r10b
jne .L14
.L1:
ret
.L14:
lea eax, [r9-2]
cmp r9d, 2
jle .L1
mov r9d, r8d
xor edx, edx
mov r8d, eax
xor r10d, r10d
jmp .L5
```

And the same for `-O3`:

```none
bubblesort:
mov r9d, DWORD PTR n[rip]
xor edx, edx
xor r10d, r10d
.L2:
lea r8d, [r9-1]
cmp r8d, edx
jle .L13
.L5:
movsx rax, edx
lea rcx, [rdi+rax*4]
.L4:
movq xmm0, QWORD PTR [rcx]
add edx, 1
pshufd xmm2, xmm0, 0xe5
movd esi, xmm0
movd eax, xmm2
pshufd xmm1, xmm0, 225
cmp esi, eax
jle .L2
movq QWORD PTR [rcx], xmm1
mov r10d, 1
add rcx, 4
cmp r8d, edx
jg .L4
mov r9d, r8d
xor edx, edx
xor r10d, r10d
lea r8d, [r9-1]
cmp r8d, edx
jg .L5
.L13:
test r10b, r10b
jne .L14
.L1:
ret
.L14:
lea eax, [r9-2]
cmp r9d, 2
jle .L1
mov r9d, r8d
xor edx, edx
mov r8d, eax
xor r10d, r10d
jmp .L5
```

It seems like the only significant difference to me is the apparent attempt to use [SIMD][3], which *seems* like it should be a large improvement, but I also can't tell what on earth it's attempting with those `pshufd` instructions... is this just a failed attempt at SIMD? Or maybe the couple of extra instructions is just about edging out my instruction cache?

Timings were done on an AMD [Ryzen][4] 5 3600.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

[4]:

[To see links please register here]



Reply

#2
This is a regression in GCC11/12.
GCC10 and earlier were doing separate dword loads, even if it merged for a qword store.

It looks like GCC's naïveté about [store-forwarding][1] stalls is hurting its auto-vectorization strategy here. See also *[Store forwarding by example][2]* for some practical benchmarks on Intel with hardware performance counters, and *https://stackoverflow.com/questions/46135369/what-are-the-costs-of-failed-store-to-load-forwarding-on-x86* Also [Agner Fog's x86 optimization guides][3].

(`gcc -O3` enables `-ftree-vectorize` and a few other options not included by `-O2`, e.g. `if`-conversion to branchless `cmov`, which is [another way `-O3` can hurt][4] with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at `-O2`, although some of its optimizations are still only on at `-O3`.)

It's doing 64-bit loads (and branching to store or not) on pairs of ints. This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so **we get a store-forwarding stall after every swap**. But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.

([Bubble sort is bad in general][5], especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.)

Anyway, this is pretty clearly an anti-optimization you should **report on *[GCC Bugzilla][6]* with the "missed-optimization" keyword**. Scalar loads are cheap, and store-forwarding stalls are costly. (*https://stackoverflow.com/questions/46135766/can-modern-x86-implementations-store-forward-from-more-than-one-prior-store* no, nor can [microarchitectures][7] other than in-order [Atom][8] efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.)

Even better would be to keep `buf[x+1]` in a register and use it as `buf[x]` in the next iteration, avoiding a store and load. (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.)

If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even. [SSE][9] 4.1 for a branchless `pmind` / `pmaxd` comparator might be interesting, but that would mean always storing and the C source doesn't do that.

-----

**If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine** like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half. E.g.,

```lang-asm
## What GCC should have done,
## if it was going to use this 64-bit load strategy at all

movsx rax, edx # apparently it wasn't able to optimize away your half-width signed loop counter into pointer math
lea rcx, [rdi+rax*4] # Usually not worth an extra instruction just to avoid an indexed load and indexed store, but let's keep it for easy comparison.
.L4:
mov rax, [rcx] # into RAX instead of XMM0
add edx, 1
# pshufd xmm2, xmm0, 0xe5
# movd esi, xmm0
# movd eax, xmm2
# pshufd xmm1, xmm0, 225
mov rsi, rax
rol rax, 32 # swap halves, just like the pshufd
cmp esi, eax # or eax, esi? I didn't check which is which
jle .L2
movq QWORD PTR [rcx], rax # conditionally store the swapped qword
```

(Or with BMI2 available from `-march=native`, `rorx rsi, rax, 32` can copy-and-swap in one uop. Without BMI2, `mov` and swapping the original instead of the copy saves latency if running on a CPU without mov-elimination, such as [Ice Lake with updated microcode][10].)

So total latency from load to compare is just integer load + one ALU operation (rotate). Vs. XMM load -> `movd`. And its fewer ALU uops.
**This does *nothing* to help with the store-forwarding stall problem, though, which is still a showstopper.** This is just an integer SWAR implementation of the same strategy, replacing 2x pshufd and 2x `movd r32, xmm` with just `mov` + `rol`.

Actually, there's no reason to use 2x `pshufd` here. Even if using XMM registers, GCC could have done one shuffle that swapped the low two elements, setting up for both the store and `movd`. So even with XMM regs, this was sub-optimal. But clearly two different parts of GCC emitted those two `pshufd` instructions; one even printed the shuffle constant in hex while the other used decimal! I assume one swapping and the other just trying to get `vec[1]`, the high element of the qword.

---

> slower than no flags at all

The default is `-O0`, consistent-debugging mode that [spills all variables to memory after every C statement][11], so it's pretty horrible and creates big store-forwarding latency bottlenecks. (Somewhat like if every variable was `volatile`.) But it's *successful* store forwarding, not stalls, so "only" ~5 cycles, but still much worse than 0 for registers. (A few modern microarchitectures including [Zen 2][12] have some [special cases that are lower latency][13]). The extra store and load instructions that have to go through the pipeline don't help.

It's generally not interesting to benchmark `-O0`. `-O1` or `-Og` should be your go-to baseline for the compiler to do the basic amount of optimization a normal person would expect, without anything fancy, but also not intentionally gimp the asm by skipping register allocation.

---

Semi-related: optimizing bubble sort for *size* instead of speed can involve memory-destination rotate (creating store-forwarding stalls for back-to-back swaps), or a memory-destination `xchg` (implicit `lock` prefix -> very slow). See [this *Code Golf* answer][14].


[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

[4]:

[To see links please register here]

[5]:

[To see links please register here]

[6]:

[To see links please register here]

[7]:

[To see links please register here]

[8]:

[To see links please register here]

[9]:

[To see links please register here]

[10]:

[To see links please register here]

[11]:

[To see links please register here]

[12]:

[To see links please register here]

[13]:

[To see links please register here]

[14]:

[To see links please register here]

Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through