Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 856 Vote(s) - 3.58 Average
  • 1
  • 2
  • 3
  • 4
  • 5
What is the meaning of the data32 data32 nopw %cs:0x0(%rax,%rax,1) instruction in disassembly of gcc's output?

#1
While running some tests for the -O2 optimization of the gcc compilers, I observed the following instruction in the disassembled code for a function:

data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

What does this instruction do?

To be more detailed I was trying to understand how does the compiler optimize useless recursions like the below with O2 optimization:

int foo(void)
{
return foo();
}
int main (void)
{
return foo();
}

The above code causes stack overflow when compiled without optimization, but works for O2 optimized code.

I think with O2 it completely removed the pushing the stack of the function foo, but why is the `data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)` needed?

0000000000400480 <foo>:
foo():
400480: eb fe jmp 400480 <foo>
400482: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
400489: 1f 84 00 00 00 00 00

0000000000400490 <main>:
main():
400490: eb fe jmp 400490 <main>
Reply

#2
You see an [operand forwarding][1] optimization of the cpu pipeline.

Although it is an empty loop, gcc tries to optimize this as well :-).

The cpu you are running has a [superscalar][2] architecture. It means, that it has a pipeline in it, and different phases of the executions of the consecuting instructions happen parallel. For example, if there is a

mov eax, ebx ;(#1)
mov ecx, edx ;(#2)

then the loading & decoding of instruction #2 can happen already while #1 is executed.

The pipelining has major problems to solve in the case of the branches, even if they are unconditional.

For example, while the `jmp` is decoding, the next instruction is already prefetched into the pipeline. But the `jmp` changes the location of the next instruction. In such cases, the pipeline needs to by emptied and refilled, and a lot of worthy cpu cycles will be lost.

Looks this empty loop will run faster if the pipeline is filled with a no-op in this case, despite that it won't be ever executed. It is actually an optimization of some uncommon feature of the x86 pipeline.

Earlier dec alphas could even segfault from such things, and empty loops had to have a lot of no-ops in them. x86 would be only slower. This is because they must be compatible with the intel 8086.

[Here][3] you can read a lot from the handling of branching instructions in pipelines.


[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

Reply

#3
The functions foo() is an infinite recursion without termination. Without optimization, gcc generates normal subroutine calls, which include stacking the return address at least. As the stack is limited, this will create an stack overflow which is _undefined_behaviour_.

If optimizing, gcc detects foo() does not require a stack frame at all (there are no arguments or local variables). It also detects, foo() instantly returns to the caller (which would also be foo()). This is called tail-chaining: a function call right at the end of a function (i.e. explicit/implicit return) is converted to a jump to that function, so there is no need for a stack.

This is still undefined behaviour, but this time, nothing "bad" is observed.

Just remenber: undefined includes fatal behaviour as well as expected behaviour (but that just by chance). Code which behaves differently with different optimization levels should always be regarder errorneous.
There is one exception: Timing. This is not subject to the C language standard (neither of most other languages).

As others stated, the data32 ... is very certain padding to get an 16 byte alignment which might be the size of the internal instruction bus and/or cache lines.

Reply

#4
To answer the question in the title, the instruction

data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

is a 14-byte NOP (no operation) instruction that is used to pad the gap between the `foo` function and the `main` function to maintain 16-byte alignment.

The x86 architecture has a large number of different NOP instructions of different sizes that can be used to insert padding into an executable segment such that they'll have no effect if the CPU ends up executing over them. Then [Intel optimization manual](

[To see links please register here]

) contains information on recommended NOP encoding for different lengths that can be used as padding.

In this specific case, it is completely irrelevant, as the NOP will never be executed (or even decoded as it is after an unconditional jump), so the compiler could pad with any random garbage it wanted to.
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through