Giáo trình

Computer Architecture

Science and Technology

Instruction Pipelining

Tác giả: Hoang Lan Nguyen


Basic concepts

An instruction has a number of stages. The various stages can be worked on simultanously through various blocks of production. This is a pipeline. This process is also referred as instruction pipeling. Figure 8.1 shown the pipeline of two independent stages: fetch instruction and execusion instruction. The first stage fetches an instruction and buffers it. While the second stage is executing the instruction, the first stage takes advantage of any unused memory cycles to fetch and buffer the next instruction. This process will speed up instruction execution

Two stages Instruction Pipeline

Pipeline principle

The decomposition of the instruction processing by 6 stages is the following.

- Fetch Instruction (FI): Read the next expected introduction into a buffer

- Decode Instruction (DI): Determine the opcode and the operand specifiers

- Calculate Operands (CO): Calculate the effective address of each source operand. This may involve displacement, register indirect, indirect or other forms of address calculations.

- Fetch Operands (FO): Fetch each operand from memory. Operands in register need not be fetched.

- Execute Instruction (EI): Perform the indicated operation and store the result, if any, in the specified destination operand location.

- Write Operand (WO): Store result in memory.

Using the assumption of the equal duration for various stages, the figure 8.2 shown that a six stage pipeline can reduce the execution time for 9 instructions from 54 time units to 14 time units.

Timing diagram for instruction pipeline operation.

Also the diagram assumes that all of the stages can be performed in parallel, in particular, it is assumed that there are no memory conflicts. The processor make use of instruction pipelining to speed up executions, pipeling invokes breaking up the instruction cycle into a number of separate stages in a sequence. However the occurrence of branches and independencies between instruction complates the design and use of pipeline.

Pipeline Performance and Limitations

With the pipeling approach, as a form of parallelism, a “good” design goal of any system is to have all of its components performing useful work all of the time, we can obtain a high efficiency. The instruction cycle state diagram clearly shows the sequence of operations that take place in order to execute a single instruction.

This strategy can give the following:

- Perform all tasks concurrently, but on different sequential instructions

– The result is temporal parallelism.

– Result is the instruction pipeline.

Pipeline performance

In this subsection, we can show some measures of pipeline performance based on the book “Computer Organization and Architecture: Designing for Performance”, 6th Edition by William Stalling.

The cycle time  of an instruction pipeline can be determined as:

T=max[Ti]+d=Tm+d size 12{T="max" \[ T rSub { size 8{i} } \] +d=T rSub { size 8{m} } +d} {} with 1 size 12{ <= {}} {} i size 12{ <= {}} {}k


Tm size 12{T rSub { size 8{m} } } {} = Maximun stage delay through stage

k = number of stages in instruction pipeline

d = time delay of a latch.

In general, the time delay d is equivalent to a clock pulse and Tm size 12{T rSub { size 8{m} } } {} >> d. Suppose that n instruction are processed with no branched.

  • The total time required Tk size 12{T rSub { size 8{k} } } {} to execute all n instruction is:

Tk size 12{T rSub { size 8{k} } } {}= [k + (n-1)]

  • The speedup factor for the instruction pipeline compared to execution without the pipeline is defined as:

S K = T 1 T K = nk τ k + ( n 1 ) τ = nk k + ( n 1 ) size 12{ { size 24{S} } rSub { size 8{K} } = { { { size 24{T} } rSub { size 8{1} } } over { { size 24{T} } rSub { size 8{K} } } } = { { ital "nk"τ} over { left [k+ \( n - 1 \) right ]τ} } = { { ital "nk"} over {k+ \( n - 1 \) } } } {}

  • An ideal pipeline divides a task into k independent sequential subtasks

– Each subtask requires 1 time unit to complete

– The task itself then requires k time units tocomplete. For n iterations of the task, the execution times will be:

– With no pipelining: nk time units

– With pipelining: k + (n-1) time units

Speedup of a k-stage pipeline is thus

S = nk / [k+(n-1)] ==> k (for large n)

Pipeline Limitations

Several factors serve to limit the pipeline performance. If the six stage are not of equal duration, there will be some waiting involved at various pipeline stage. Another difficulty is the condition branch instruction or the unpredictable event is an interrupt. Other problem arise that the memory conflicts could occur. So the system must contain logic to account for the type of conflict.

  • Pipeline depth

- Data dependencies also factor into the effective length of pipelines

- Logic to handle memory and register use and to control the overall pipeline increases significantly with increasing pipeline depth

– If the speedup is based on the number of stages, why not build lots of stages?

– Each stage uses latches at its input (output) to buffer the next set of inputs

+ If the stage granularity is reduced too much, the latches and their control become a significant hardware overhead

+ Also suffer a time overhead in the propagation time through the latches

- Limits the rate at which data can be clocked through the pipeline

  • Data dependencies

– Pipelining must insure that computed results are the same as if computation was performed in strict sequential order

– With multiple stages, two instructions “in execution” in the pipeline may have data dependencies. So we must design the pipeline to prevent this.

– Data dependency examples:

A = B + C

D = E + A

C = G x H

A = D / H

Data dependencies limit when an instruction can be input to the pipeline.

  • Branching

One of the major problems in designing an instruction pipeline is assuring a steady flow of instructions to initial stages of the pipeline. However, 15-20% of instructions in an assembly-level stream are (conditional) branches. Of these, 60-70% take the branch to a target address. Until the instruction is actually executed, it is impossible to determin whether the branch will be taken or not.

- Impact of the branch is that pipeline never really operates at its full capacity.

– The average time to complete a pipelined instruction becomes

Tave =(1-pb)1 + pb[pt(1+b) + (1-pt)1]

– A number of techniques can be used to minimize the impact of the branch instruction (the branch penalty).

- A several approaches have been taken for dealing with conditional branches:

+ Multiple streams

+ Prefetch branch target

+ Loop buffer

+ Branch prediction

+Delayed branch.

  • Multiple streams

- Replicate the initial portions of the pipeline and fetch both possible next instructions

- Increases chance of memory contention

- Must support multiple streams for each instruction in the pipeline

  • Prefetch branch target

- When the branch instruction is decoded, begin to fetch the branch target instruction and place in a second prefetch buffer

- If the branch is not taken, the sequential instructions are already in the pipe, so there is not loss of performance

- If the branch is taken, the next instruction has been prefetched and results in minimal branch penalty (don’t have to incur a memory read operation at the end of the branch to fetch the instruction)

  • Loop buffer: Look ahead, look behind buffer

- Many conditional branches operations are used for loop control

- Expand prefetch buffer so as to buffer the last few instructions executed in addition to the ones that are waiting to be executed

- If buffer is big enough, entire loop can be held in it, this can reduce the branch penalty.

  • Branch prediction

- Make a good guess as to which instruction will be executed next and start that one down the pipeline.

- Static guesses: make the guess without considering the runtime history of the program

Branch never taken

Branch always taken

Predict based on the opcode

- Dynamic guesses: track the history of conditional branches in the program.

Taken / not taken switch History table

Branch prediction using 2 history bits
  • Delayed branch

- Minimize the branch penalty by finding valid instructions to execute in the pipeline while the branch address is being resolved.

- It is possible to improve performance by automatically rearranging instruction within a program, so that branch instruction occur later than actually desired

- Compiler is tasked with reordering the instruction sequence to find enough independent instructions (wrt to the conditional branch) to feed into the pipeline after the branch that the branch penalty is reduced to zero

Superscalar and Superpipelined Processors

Superpipeline designs

– Observation: a large number of operations do not require the full clock cycle to complete

– High performance can be obtained by subdividing the clock cycle into a number of sub intervals » Higher clock frequency!

– Subdivide the “macro” pipeline H/W stages into smaller (thus faster) substages and clock data through at the higher clock rate

– Time to complete individual instructions does not change

» Degree of parallelism goes up

» Perceived speedup goes up


– Implement the CPU such that more than one instruction can be performed (completed) at a time

– Involves replication of some or all parts of the CPU/ALU

– Examples:

» Fetch multiple instructions at the same time

» Decode multiple instructions at the same time

» Perform add and multiply at the same time

» Perform load/stores while performing ALU operation

– Degree of parallelism and hence the speedup of the machine goes up as more instructions are executed in parallel

  • Data dependencies in superscalar

– It must insure computed results are the same as would be computed on a strictly sequential machine

– Two instructions can not be executed in parallel if the (data) output of one is the input of the other or if they both write to the same output location

– Consider:

S1: A = B + C

S2: D = A + 1

S3: B = E + F

S4: A = E + 3

Resource dependencies:

– In the above sequence of instructions, the adder unit gets a real workout!

– Parallelism is limited by the number of adders in the ALU

Instruction issue policy

Problem: In what order are instructions issued to the execution unit and in what order do they finish?

There is 3 types of ordering.

- The order in which instructions are fetched

- The order in which instructions are executed

- The order in which instructions update the contents of registre or memory location.

  • In-order issue, in-order completion

» Simplest method, but severely limits performance

» Strict ordering of instructions: data and procedural dependencies or resource conflicts delay all subsequent instructions

» Delay execution of some instructions delay all subsequent instructions

  • In-order issue, out-of-order completion

» Any number of instructions can be executed at a time

» Instruction issue is still limited by resource conflicts or data and procedural dependencies

» Output dependencies resulting from out-of order completion must be resolved

» “Instruction” interrupts can be tricky

  • Out-of-order issue, out-of-order completion

» Decode and execute stages are decoupled via an instruction buffer “window”

» Decoded instructions are “stored” in the window awaiting execution

» Functional units will take instructions from the window in an attempt to stay busy

This can result in out-of-order execution

S1: A = B + C

S2: D = E + 1

S3: G = E + F

S4: H = E * 3

“Antidependence” class of data dependencies must be dealt with it.