Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 418 Vote(s) - 3.55 Average
  • 1
  • 2
  • 3
  • 4
  • 5
"enter" vs "push ebp; mov ebp, esp; sub esp, imm" and "leave" vs "mov esp, ebp; pop ebp"

#1
What is the difference between the `enter` and

push ebp
mov ebp, esp
sub esp, imm

instructions? Is there a performance difference? If so, which is faster and why do compilers always use the latter?

Similarly with the `leave` and

mov esp, ebp
pop ebp

instructions.
Reply

#2
There is no real speed advantage using either of them, though the long method will probably run better due to the fact CPU's these days are more 'optimized' to the shorter simpler instructions that are more generic in use (plus it allows saturation of the execution ports if your lucky).

The advantage of `LEAVE` (which is still used, just see the windows dlls) is that its smaller than manually tearing down a stack frame, this helps a lot when your space is limited.

The Intel instruction manuals (volume 2A to be precise) will have more nitty gritty details on the instructions, so should [Dr Agner Fogs Optimization manuals][1]


[1]:

[To see links please register here]

Reply

#3
There is a performance difference, especially for `enter`. On modern processors this decodes to some 10 to 20 µops, while the three instruction sequence is about 4 to 6, depending on the architecture. For details consult [Agner Fog's][1] instruction tables.

Additionally the `enter` instruction usually has a quite high latency, for example 8 clocks on a core2, compared to the 3 clocks dependency chain of the three instruction sequence.

Furthermore the three instruction sequence may be spread out by the compiler for scheduling purposes, depending on the surrounding code of course, to allow more parallel execution of instructions.


[1]:

[To see links please register here]

Reply

#4
**`enter` is unusably slow on all CPUs,** nobody uses it except maybe for code-size optimization at the expense of speed. (If a frame pointer is needed at all, or desired to allow more compact addressing modes for addressing stack space.)

**`leave` *is* fast enough to be worth using**, and GCC *does* use it (if ESP / RSP isn't already pointing at a saved EBP/RBP; otherwise it just uses `pop ebp`).


`leave` is only 3 uops on modern Intel CPUs (and 2 on some AMD). (

[To see links please register here]

,

[To see links please register here]

).

mov / pop is only 2 uops total (on modern x86 where a "stack engine" tracks updates to ESP/RSP). So `leave` is just one more uop than doing things separately. I've tested this on Skylake, comparing a call/ret in a loop with the function setting up a traditional frame pointer and tearing down its stack frame using `mov`/`pop` or `leave`. `perf` counters for `uops_issued.any` shows one more front-end uop when you use leave than for mov/pop. (I ran my own test in case other measurement methods has been counting a stack-sync uop in their leave measurements, but using it in a real function controls for that.)

Possible reasons why older CPUs might have benefited more keeping mov / pop split up:

* In most CPUs without a uop cache (i.e. Intel before Sandybridge, AMD before Zen), multi-uop instructions can be a decode bottleneck. They can only decode in the first ("complex") decoder, so might mean the decode cycle before that produced fewer uops than normal.
* Some Windows calling conventions are callee-pops stack args, using `ret n`. (e.g. `ret 8` to do ESP/RSP += 8 after popping the return address). This is a multi-uop instruction, unlike plain near `ret` on modern x86. So the above reason goes double: leave and `ret 12` couldn't decode in the same cycle
* Those reasons also apply to legacy decode to build uop-cache entries.

* P5 Pentium also preferred a RISC-like subset of x86, being unable to even break up complex instructions into separate uops *at all*.

**For modern CPUs**, `leave` takes up 1 extra uop in the uop cache. And all 3 have to be in the same line of the uop cache, which could lead to only partial filling of the previous line. So larger x86 code size *could* actually improve packing into the uop cache. Or not, depending on how things line up.


Saving 2 bytes (or 3 in 64-bit mode) may or may not be worth 1 extra uop per function.

GCC favours `leave`, clang and MSVC favour `mov`/`pop` (even with `clang -Oz` code-size optimization even at the expense of speed, e.g. doing stuff like `push 1 / pop rax` (3 bytes) instead of 5-byte `mov eax,1`).

ICC favours mov/pop, but with `-Os` will use `leave`.

[To see links please register here]

Reply

#5
When designing the 80286, Intel's CPU designers decided to add two instructions to help maintain displays.

Here the micro code inside the CPU:

; ENTER Locals, LexLevel

push bp ;Save dynamic link.
mov tempreg, sp ;Save for later.
cmp LexLevel, 0 ;Done if this is lex level zero.
je Lex0

lp:
dec LexLevel
jz Done ;Quit if at last lex level.
sub bp, 2 ;Index into display in prev act rec
push [bp] ; and push each element there.
jmp lp ;Repeat for each entry.

Done:
push tempreg ;Add entry for current lex level.

Lex0:
mov bp, tempreg ;Ptr to current act rec.
sub sp, Locals ;Allocate local storage


Alternative to ENTER would be:

; enter n, 0 ;14 cycles on the 486

push bp ;1 cycle on the 486
sub sp, n ;1 cycle on the 486


; enter n, 1 ;17 cycles on the 486

push bp ;1 cycle on the 486
push [bp-2] ;4 cycles on the 486
mov bp, sp ;1 cycle on the 486
add bp, 2 ;1 cycle on the 486
sub sp, n ;1 cycle on the 486

; enter n, 3 ;23 cycles on the 486

push bp ;1 cycle on the 486
push [bp-2] ;4 cycles on the 486
push [bp-4] ;4 cycles on the 486
push [bp-6] ;4 cycles on the 486
mov bp, sp ;1 cycle on the 486
add bp, 6 ;1 cycle on the 486
sub sp, n ;1 cycle on the 486

Etc. The long way might increase your file size, but is way quicker.

One last note, programmer don't really use display anymore since that was a very slow work around, making ENTER pretty useless now.

Source:

[To see links please register here]

Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through