Login

While running some tests for the -O2 optimization of the gcc compilers, I observed the following instruction in the disassembled code for a function:

data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

What does this instruction do?

To be more detailed I was trying to understand how does the compiler optimize useless recursions like the below with O2 optimization:

int foo(void)
{
return foo();
}
int main (void)
{
return foo();
}

The above code causes stack overflow when compiled without optimization, but works for O2 optimized code.

I think with O2 it completely removed the pushing the stack of the function foo, but why is the `data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)` needed?

0000000000400480 <foo>:
foo():
400480: eb fe jmp 400480 <foo>
400482: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
400489: 1f 84 00 00 00 00 00

0000000000400490 <main>:
main():
400490: eb fe jmp 400490 <main>

You see an [operand forwarding][1] optimization of the cpu pipeline.

Although it is an empty loop, gcc tries to optimize this as well :-).

The cpu you are running has a [superscalar][2] architecture. It means, that it has a pipeline in it, and different phases of the executions of the consecuting instructions happen parallel. For example, if there is a

mov eax, ebx ;(#1)
mov ecx, edx ;(#2)

then the loading & decoding of instruction #2 can happen already while #1 is executed.

The pipelining has major problems to solve in the case of the branches, even if they are unconditional.

For example, while the `jmp` is decoding, the next instruction is already prefetched into the pipeline. But the `jmp` changes the location of the next instruction. In such cases, the pipeline needs to by emptied and refilled, and a lot of worthy cpu cycles will be lost.

Looks this empty loop will run faster if the pipeline is filled with a no-op in this case, despite that it won't be ever executed. It is actually an optimization of some uncommon feature of the x86 pipeline.

Earlier dec alphas could even segfault from such things, and empty loops had to have a lot of no-ops in them. x86 would be only slower. This is because they must be compatible with the intel 8086.

[Here][3] you can read a lot from the handling of branching instructions in pipelines.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

The functions foo() is an infinite recursion without termination. Without optimization, gcc generates normal subroutine calls, which include stacking the return address at least. As the stack is limited, this will create an stack overflow which is _undefined_behaviour_.

If optimizing, gcc detects foo() does not require a stack frame at all (there are no arguments or local variables). It also detects, foo() instantly returns to the caller (which would also be foo()). This is called tail-chaining: a function call right at the end of a function (i.e. explicit/implicit return) is converted to a jump to that function, so there is no need for a stack.

This is still undefined behaviour, but this time, nothing "bad" is observed.

Just remenber: undefined includes fatal behaviour as well as expected behaviour (but that just by chance). Code which behaves differently with different optimization levels should always be regarder errorneous.
There is one exception: Timing. This is not subject to the C language standard (neither of most other languages).

As others stated, the data32 ... is very certain padding to get an 16 byte alignment which might be the size of the internal instruction bus and/or cache lines.

To answer the question in the title, the instruction

data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

is a 14-byte NOP (no operation) instruction that is used to pad the gap between the `foo` function and the `main` function to maintain 16-byte alignment.

The x86 architecture has a large number of different NOP instructions of different sizes that can be used to insert padding into an executable segment such that they'll have no effect if the CPU ends up executing over them. Then [Intel optimization manual](

[To see links please register here]

) contains information on recommended NOP encoding for different lengths that can be used as padding.

In this specific case, it is completely irrelevant, as the NOP will never be executed (or even decoded as it is after an unconditional jump), so the compiler could pad with any random garbage it wanted to.

bitstock467

cyclonal761217

attuned637688

angelieotkw