Login

I understand the basic principle of [instruction pipelining](

[To see links please register here]

).

I also get that some instructions may take longer to execute ([cycles per instruction](

[To see links please register here]

)).

But I don't get the link between both.

All pipeline diagrams that I see seem to have "perfect" instructions, they all have the same length (number of cycles).

[![4-staged pipeline][1]][1]

But what if the first instruction does take 5 cycles, and the second one takes 3 cycles ? Does the cpu stalls for 2 cycles ?

Would this stall be called a [bubble][2] ? Or is this different from [hazards][3] and data dependencies ?

Also, does the length of an instruction, in bytes, matter in any way ?

[1]:

[2]:

[To see links please register here]

[3]:

[To see links please register here]

It's actually a bit more complex than you picture.

For one the CPU does not execute instructions, it executes uops instead, secondly it can execute uops out of order.

**uops**
A simple instruction translates to a single uop, a complex instruction is split into multiple uops. The CPU has a uop cache that keeps the last (e.g 1024) few uops. The uops are more simular to each other than the full instructions, and thus pair better in the pipeline.

**Out of order execution**
If the CPU needs to wait for the result of a calculation, it looks for a uops that do not have a dependency on the previous instruction and executes these instead.
In order to allow OoO-execution the CPU has a register file with many more registers than are available to the programmer (e.g. 256 general purpose registers). It can use this as a scratch pad to store intermediate results.
All executed instructions go into a retirement buffer where the results are outputted in the original order.

**Buffers**
In addition to all this the issue of stalls is fixed by buffers.
Instructions are fetched speculatively, and sit in a buffer waiting for decode.

**Constant time decoding**
X86/X64 is notorious for its complex decoding. Both AMD and Intel have solved this problem by devoting a lot of silicon to the decoding problem, so that their cpus can decode a constant number of bytes per cycle, independent of instruction complexity. The length of the instruction does not really matter, because time-critical code (tight loops) is executed from the uop-cache, which does not need to be decoded. In addition the decoding is commonly over-dimensioned, so that it is near certain to not be a bottleneck.

**More stages**
Modern CPU has 14 or more stages, not the 4 that you seem to envisage.

See for instance this exposition of AMD's Zen architecture:

[To see links please register here]

So in addition to the pipeline there are quite a few other processes that take place, al of which are put in place to prevent stalls and fill up the bubbles.

In practise modern processors do not suffer when pairing instructions with different latencies. The use of low-latency uops has eliminated this issue to a large extend.

**Hazards**
The wikipedia article you link to explains it pretty well. Modern CPU's use [Tomasulo's algorithm with register renaming][1] to prevent the bubbles.

[1]:

[To see links please register here]

You touched on quite a few things in your question, so I'll put in my 2 cents to try and make it all a bit clearer. Let's look at an in-order MIPS architecture as an example - it features all of the things you mention except the variable-length instructions.

Many MIPS CPUs have 5-stage pipelines with stages: `IF -> ID -> EX -> MEM -> WB`. (

[To see links please register here]

). Let's first look at those instructions where each of these stages will generally take a single clock-cycle (this might not be the case on cache misses, for example). For instance, SW (store word to memory), BNEZ (branch on not zero) and ADD (add two registers and store to register). Not all of these instructions have useful work in all pipe stages. For example, SW has no work to do in WB stage, BNEZ can be finished as early as ID stage (that's the earliest the target address can be computed) and ADD has no work in MEM stage.

Regardless of that, each of these instructions will go through each and every stage of the pipeline even if they have no work in some of them. The instruction will occupy a given stage but no actual work will be done (i.e. no result is written to a register in WB stage for SW instruction). In other words, there will be no stalls in this case.

Moving over to more complex instructions whose EX stage can take up to tens of cycles such as MUL or DIV. Things get much trickier here. Now the instructions can get completed out of order even though they are always fetched in order (meaning [WAW hazards][1] are now possible). Take the following example:

MUL R1, R10, R11
ADD R2, R5, R6

MUL is fetched first and it reaches the EX stage before ADD, however ADD will get completed way before as MUL's EX stage runs for more than 10 clock-cycles. However, the pipeline won't be stalled at any point as there is no possibility of hazards in this sequence - neither RAW nor WAW hazards are possible. Take another example:

MUL R1, R10, R11
ADD R1, R5, R6

Now both MUL and ADD write the same register. As ADD will complete way earlier than MUL, it will complete WB and write its result. At later point, MUL will do the same and R1 would end up having wrong (old) value. This is where pipeline ***stall*** is needed. One way to solve this is to prevent ADD from issuing (moving from ID to EX stage) until MUL enters MEM stage. That's done by freezing or stalling the pipeline. Introducing floating-point operations leads to similar problems in the pipeline.

I'd complete my answer by touching on the topic of fixed-length vs. variable length instruction format (even though you didn't explicitly asked for it). MIPS (and most RISC) CPUs have fixed-length encoding. This tremendously simplifies the implementation of a CPU pipeline, as instructions can be decoded and input registers read within a single cycle (assuming that register locations are fixed in a given instruction format which is true for MIPS). Additionally, the fetching is simplified as instructions are always of the same length so there's no need to start decoding the instruction to find its length.

There are of course disadvantages: the possibility to generate compact binary is reduced which leads to larger programs which in turn leads to poorer cache performance. Additionally, memory traffic is increased as well as more bytes of data are read/written from/to memory which might be important for energy efficient platforms.

This advantage has led to some RISC architectures defining a 16-bit instruction-length mode ([MIPS16][2] or ARM Thumb), or even a variable-length instruction set ([ARM Thumb2][3] has 16-bit and 32-bit instructions). Unlike x86, Thumb2 was designed to make it easy to determine instruction-length quickly, so it's still easy for CPUs to decode.

These compacted ISAs often require more instructions to implement the same program, but take less total space and run faster if code-fetch is more of a bottleneck than instruction throughput in the pipeline. (Small / nonexistent instruction cache, and/or reading from a ROM in an embedded CPU).

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

gradygvl

pansylkghnpl

groin83