Login

I was looking for the fastest way to `popcount` large arrays of data. I encountered a *very weird* effect: Changing the loop variable from `unsigned` to `uint64_t` made the performance drop by 50% on my PC.

The Benchmark
-------------

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

using namespace std;
if (argc != 2) {
cerr << "usage: array_size in MB" << endl;
return -1;
}

uint64_t size = atol(argv[1])<<20;
uint64_t* buffer = new uint64_t[size/8];
char* charbuffer = reinterpret_cast<char*>(buffer);
for (unsigned i=0; i<size; ++i)
charbuffer[i] = rand()%256;

uint64_t count,duration;
chrono::time_point<chrono::system_clock> startP,endP;
{
startP = chrono::system_clock::now();
count = 0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with unsigned
for (unsigned i=0; i<size/8; i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}
{
startP = chrono::system_clock::now();
count=0;
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with uint64_t
for (uint64_t i=0;i<size/8;i+=4) {
count += _mm_popcnt_u64(buffer[i]);
count += _mm_popcnt_u64(buffer[i+1]);
count += _mm_popcnt_u64(buffer[i+2]);
count += _mm_popcnt_u64(buffer[i+3]);
}
}
endP = chrono::system_clock::now();
duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t"
<< (10000.0*size)/(duration) << " GB/s" << endl;
}

free(charbuffer);
}

As you see, we create a buffer of random data, with the size being `x` megabytes where `x` is read from the command line. Afterwards, we iterate over the buffer and use an unrolled version of the x86 `popcount` intrinsic to perform the popcount. To get a more precise result, we do the popcount 10,000 times. We measure the times for the popcount. In the upper case, the inner loop variable is `unsigned`, in the lower case, the inner loop variable is `uint64_t`. I thought that this should make no difference, but the opposite is the case.

The (absolutely crazy) results
------------------------------

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):

g++ -O3 -march=native -std=c++11 test.cpp -o test

Here are the results on my [Haswell][1] [Core i7-4770K][2] CPU @ 3.50&nbsp;GHz, running `test 1` (so 1&nbsp;MB random data):

* unsigned 41959360000 0.401554 sec **26.113&nbsp;GB/s**
* uint64_t 41959360000 0.759822 sec **13.8003&nbsp;GB/s**

As you see, the throughput of the `uint64_t` version is **only half** the one of the `unsigned` version! The problem seems to be that different assembly gets generated, but why? First, I thought of a compiler bug, so I tried `clang++` (Ubuntu [Clang][3] version 3.4-1ubuntu3):

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Result: `test 1`

* unsigned 41959360000 0.398293 sec **26.3267 GB/s**
* uint64_t 41959360000 0.680954 sec **15.3986 GB/s**

So, it is almost the same result and is still strange. *But now it gets super strange.* I replace the buffer size that was read from input with a constant `1`, so I change:

uint64_t size = atol(argv[1]) << 20;

to

uint64_t size = 1 << 20;

Thus, the compiler now knows the buffer size at compile time. Maybe it can add some optimizations! Here are the numbers for `g++`:

* unsigned 41959360000 0.509156 sec **20.5944&nbsp;GB/s**
* uint64_t 41959360000 0.508673 sec **20.6139&nbsp;GB/s**

Now, both versions are equally fast. However, the `unsigned` **got even slower**! It dropped from `26` to `20 GB/s`, thus replacing a non-constant by a constant value lead to a **deoptimization**. Seriously, I have no clue what is going on here! But now to `clang++` with the new version:

* unsigned 41959360000 0.677009 sec **15.4884&nbsp;GB/s**
* uint64_t 41959360000 0.676909 sec **15.4906&nbsp;GB/s**

*Wait, what?* Now, both versions dropped to the **slow** number of 15&nbsp;GB/s. Thus, replacing a non-constant by a constant value even lead to slow code in **both** cases for Clang!

I asked a colleague with an [Ivy Bridge][4] CPU to compile my benchmark. He got similar results, so it does not seem to be Haswell. Because two compilers produce strange results here, it also does not seem to be a compiler bug. We do not have an AMD CPU here, so we could only test with Intel.

More madness, please!
--------------------

Take the first example (the one with `atol(argv[1])`) and put a `static` before the variable, i.e.:

static uint64_t size=atol(argv[1])<<20;

Here are my results in g++:

* unsigned 41959360000 0.396728 sec **26.4306 GB/s**
* uint64_t 41959360000 0.509484 sec **20.5811 GB/s**

*Yay, yet another alternative*. We still have the fast 26&nbsp;GB/s with `u32`, but we managed to get `u64` at least from the 13&nbsp;GB/s to the 20&nbsp;GB/s version! **On my collegue's PC, the `u64` version became even faster than the `u32` version, yielding the fastest result of all.** Sadly, this only works for `g++`, `clang++` does not seem to care about `static`.

My question
-----------

Can you explain these results? Especially:

* How can there be such a difference between `u32` and `u64`?
* How can replacing a non-constant by a constant buffer size trigger *less optimal code*?
* How can the insertion of the `static` keyword make the `u64` loop faster? Even faster than the original code on my collegue's computer!

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a **100% difference** in execution time and that small factors like a constant buffer size can again mix results totally. Of course, I always want to have the version that is able to popcount 26&nbsp;GB/s. The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. This is the only way I can get rid of compilers that seem to go mad on small changes. What do you think? Is there another way to reliably get the code with most performance?

The Disassembly
---------------

Here is the disassembly for the various results:

26&nbsp;GB/s version from **g++ / u32 / non-const bufsize**:

0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8

13&nbsp;GB/s version from **g++ / u64 / non-const bufsize**:

0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00

15&nbsp;GB/s version from **clang++ / u64 / non-const bufsize**:

0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50

20&nbsp;GB/s version from **g++ / u32&u64 / const bufsize**:

0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68

15&nbsp;GB/s version from **clang++ / u32&u64 / const bufsize**:

0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0

Interestingly, the fastest (26&nbsp;GB/s) version is also the longest! It seems to be the only solution that uses `lea`. Some versions use `jb` to jump, others use `jne`. But apart from that, all versions seem to be comparable. I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. The slowest (13&nbsp;GB/s) version looks even very short and good. Can anyone explain this?

Lessons learned
---------------

No matter what the answer to this question will be; I have learned that in really hot loops *every* detail can matter, *even details that do not seem to have any association to the hot code*. I have never thought about what type to use for a loop variable, but as you see such a minor change can make a *100%* difference! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the `static` keyword in front of the size variable! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance.

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. So even if you unroll, you can still get hit by major performance deviations. Quite interesting.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

[4]:

[To see links please register here]

Have you tried moving the reduction step outside the loop? Right now you have a data dependency that really isn't needed.

Try:

uint64_t subset_counts[4] = {};
for( unsigned k = 0; k < 10000; k++){
// Tight unrolled loop with unsigned
unsigned i=0;
while (i < size/8) {
subset_counts[0] += _mm_popcnt_u64(buffer[i]);
subset_counts[1] += _mm_popcnt_u64(buffer[i+1]);
subset_counts[2] += _mm_popcnt_u64(buffer[i+2]);
subset_counts[3] += _mm_popcnt_u64(buffer[i+3]);
i += 4;
}
}
count = subset_counts[0] + subset_counts[1] + subset_counts[2] + subset_counts[3];

You also have some weird aliasing going on, that I'm not sure is conformant to the strict aliasing rules.

I coded up an equivalent C program to experiment, and I can confirm this strange behaviour. What's more, `gcc` believes the 64-bit integer (which should probably be a `size_t` anyway...) to be better, as using `uint_fast32_t` causes gcc to use a 64-bit uint. 
I did a bit of mucking around with the assembly: 
Simply take the 32-bit version, replace all 32-bit instructions/registers with the 64-bit version in the inner popcount-loop of the program. Observation: the code is **just as fast as the 32-bit version!** 
This is obviously a hack, as the size of the variable isn't really 64 bit, as other parts of the program still use the 32-bit version, but as long as the inner popcount-loop dominates performance, this is a good start. 
I then copied the inner loop code from the 32-bit version of the program, hacked it up to be 64 bit, fiddled with the registers to make it a replacement for the inner loop of the 64-bit version. **This code also runs as fast as the 32-bit version.** 
My conclusion is that this is bad instruction scheduling by the compiler, not actual speed/latency advantage of 32-bit instructions. (Caveat: I hacked up assembly, could have broken something without noticing. I don't think so.)

This is not an answer, but it's hard to read if I put results in comment.

I get these results with a [Mac Pro][1] ([Westmere][2] 6-Cores [Xeon][3] 3.33&nbsp;GHz). I compiled it with `clang -O3 -msse4 -lstdc++ a.cpp -o a` (-O2 get same result).

###clang with `uint64_t size=atol(argv[1])<<20;`

unsigned 41950110000 0.811198 sec 12.9263 GB/s
uint64_t 41950110000 0.622884 sec 16.8342 GB/s

###clang with `uint64_t size=1<<20;`

unsigned 41950110000 0.623406 sec 16.8201 GB/s
uint64_t 41950110000 0.623685 sec 16.8126 GB/s

I also tried to:

1. Reverse the test order, the result is the same so it rules out the cache factor.
2. Have the `for` statement in reverse: `for (uint64_t i=size/8;i>0;i-=4)`. This gives the same result and proves the compile is smart enough to not divide size by 8 every iteration (as expected).

Here is my wild guess:

The speed factor comes in three parts:

- code cache: `uint64_t` version has larger code size, but this does not have an effect on my Xeon CPU. This makes the 64-bit version slower.

- Instructions used. Note not only the loop count, but the buffer is accessed with a 32-bit and 64-bit index on the two versions. Accessing a pointer with a 64-bit offset requests a dedicated 64-bit register and addressing, while you can use immediate for a 32-bit offset. This may make the 32-bit version faster.

- Instructions are only emitted on the 64-bit compile (that is, prefetch). This makes 64-bit faster.

The three factors together match with the observed seemingly conflicting results.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

I can't give an authoritative answer, but provide an overview of a likely cause. [This reference][1] shows pretty clearly that for the instructions in the body of your loop there is a 3:1 ratio between latency and throughput. It also shows the effects of multiple dispatch. Since there are (give-or-take) three integer units in modern x86 processors, it's generally possible to dispatch three instructions per cycle.

So between peak pipeline and multiple dispatch performance and failure of these mechanisms, we have a factor of six in performance. It's pretty well known that the complexity of the x86 instruction set makes it quite easy for quirky breakage to occur. The document above has a great example:

> The Pentium 4 performance for 64-bit right shifts is really poor. 64-bit left shift as well as all 32-bit shifts have acceptable performance. It appears that the data path from the upper 32 bits to the lower 32 bit of the ALU is not well designed.

I personally ran into a strange case where a hot loop ran considerably slower on a specific core of a four-core chip (AMD if I recall). We actually got better performance on a map-reduce calculation by turning that core off.

Here my guess is contention for integer units: that the `popcnt`, loop counter, and address calculations can all just barely run at full speed with the 32-bit wide counter, but the 64-bit counter causes contention and pipeline stalls. Since there are only about 12 cycles total, potentially 4 cycles with multiple dispatch, per loop body execution, a single stall could reasonably affect run time by a factor of 2.

The change induced by using a static variable, which I'm guessing just causes a minor reordering of instructions, is another clue that the 32-bit code is at some tipping point for contention.

I know this is not a rigorous analysis, but it _is_ a plausible explanation.

[1]:

Have you tried passing `-funroll-loops -fprefetch-loop-arrays` to GCC?

I get the following results with these additional optimizations:

[1829] /tmp/so_25078285 $ cat /proc/cpuinfo |grep CPU|head -n1
model name : Intel® Core™ i3-3225 CPU @ 3.30GHz
[1829] /tmp/so_25078285 $ g++ --version|head -n1
g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3

[1829] /tmp/so_25078285 $ g++ -O3 -march=native -std=c++11 test.cpp -o test_o3
[1829] /tmp/so_25078285 $ g++ -O3 -march=native -funroll-loops -fprefetch-loop-arrays -std=c++11 test.cpp -o test_o3_unroll_loops__and__prefetch_loop_arrays

[1829] /tmp/so_25078285 $ ./test_o3 1
unsigned 41959360000 0.595 sec 17.6231 GB/s
uint64_t 41959360000 0.898626 sec 11.6687 GB/s

[1829] /tmp/so_25078285 $ ./test_o3_unroll_loops__and__prefetch_loop_arrays 1
unsigned 41959360000 0.618222 sec 16.9612 GB/s
uint64_t 41959360000 0.407304 sec 25.7443 GB/s

Dranimatism4

Sirfantail0

overripe629

mojave696

intaglioed409021

advertizements71787