Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 233 Vote(s) - 3.59 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How much does function alignment actually matter on modern processors?

#1
When I compile C code with a recent compiler on an amd64 or x86 system, functions are aligned to a multiple of 16 bytes. How much does this alignment actually matter on modern processors? Is there a huge performance penalty associated with calling an unaligned function?

## Benchmark

I ran the following microbenchmark (`call.S`):

// benchmarking performance penalty of function alignment.
#include <sys/syscall.h>

#ifndef SKIP
# error "SKIP undefined"
#endif

#define COUNT 1073741824

.globl _start
.type _start,@function
_start: mov $COUNT,%rcx
0: call test
dec %rcx
jnz 0b
mov $SYS_exit,%rax
xor %edi,%edi
syscall
.size _start,.-_start

.align 16
.space SKIP
test: nop
rep
ret
.size test,.-test

with the following shell script:

#!/bin/sh

for i in `seq 0 15` ; do
echo SKIP=$i
cc -c -DSKIP=$i call.S
ld -o call call.o
time -p ./call
done

On a CPU that identifies itself as *Intel® Core™ i7-2760QM CPU @ 2.40GHz* according to `/proc/cpuinfo`. The offset didn't make a difference for me, the benchmark took constant 1.9 seconds to run.

On the other hand, on another system with a CPU that reports itself as a *Intel® Core™ i7 CPU L 640 @ 2.13GHz*, the benchmark takes 6.3 seconds, except if you have a offset of 14 or 15, where the code takes 7.2 seconds. I think that's because the function starts to span multiple cache lines.
Reply

#2
**TL;DR**: Cache alignment matters. You don't want bytes that you won't execute.

You would, at least, want to avoid fetching instructions before the first one you will execute. Since this is a micro-benchmark, you most likely don't see any difference, but imagine on a full program, if you have an extra cache-miss on a bunch of functions because the first byte wasn't aligned to a cache-line and you eventually had to fetch a new cache line for the last N bytes of the function (where N <= the number of bytes before the function that you cached but didn't use).

[Intel's optimization manual][1] says this:

> **3.4.1.5 Code Alignment**
>
> Careful arrangement of code can enhance cache and memory locality. Likely sequences of basic blocks should be laid out contiguously in memory. This may involve removing unlikely code, such as code to handle error conditions, from the sequence. See
> Section 3.7, “Prefetching,” on optimizing the instruction prefetcher.
>
> **3-8 Assembly/Compiler Coding Rule 12. (M impact, H generality)** All branch targets should be 16- byte aligned.
>
> **Assembly/Compiler Coding Rule 13. (M impact, H generality)** If the body of a conditional is not likely to be executed, it should be placed in another part of the program. If it is highly unlikely to be executed and code locality is an issue, it should be placed on a different code page.

It also helps in explaining why you don't notice any difference in your program. All the code gets cached once and never leaves the cache (modulo context-switches, of course).

[1]:
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through