2

# **Suggested Readings**

University of Notre Dame

- Readings
  - H&P: Chapter 4.5-4.7
    - (Over the next 3-4 lectures)

## Lecture 12 Introduction to Pipelining

| University of Notre Dame |                                                                                                                                                                                                        |                                                                                           | University of Notre Dame                            |                                 |                                   |  |  |
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|-----------------------------------------------------|---------------------------------|-----------------------------------|--|--|
| CSE 3                    | 30321 – Lecture 12 – Introduction to                                                                                                                                                                   | Pipelining 3                                                                              | CSE 30321 – Lecture 12 – Introduction to Pipelining |                                 |                                   |  |  |
| Multicore processors     | Processor components                                                                                                                                                                                   |                                                                                           | Example: We have to build x cars                    |                                 |                                   |  |  |
| and programming          |                                                                                                                                                                                                        | Processor comparison                                                                      | Each                                                | car takes 6 sto                 | eps to build                      |  |  |
|                          |                                                                                                                                                                                                        | Athlor<br>4 VS. Pentium<br>Data Care inside                                               | Build the frame<br>(~1 hour)                        | Build the body<br>(~1.25 hours) | Install interior<br>(~1.25 hours) |  |  |
| Let a si as              | Goal:<br>Describe the fundamental components required<br>a single core of a modern microprocessor as we<br>as how they interact with each other, with main<br>memory, and with external storage media. |                                                                                           |                                                     |                                 |                                   |  |  |
| <b>*</b>                 |                                                                                                                                                                                                        | a = (a*b) + c;<br>} $\downarrow$<br>MULT r1,r2,r3 # r1 $\leftarrow$ r2*r3                 | Put on axles, wheels<br>(~1 hour)                   | Paint<br>(~1.5 hours)           | Roll out<br>(~1 hours)            |  |  |
| Writing more             |                                                                                                                                                                                                        | ADD r2,r1,r4 ↓ # r2 ← r1+r4<br>110011 000001 000010 000011<br>001110 000010 000001 000100 |                                                     | THE AND                         |                                   |  |  |
| efficient code           | The right HW for the<br>right application                                                                                                                                                              | HLL code translation                                                                      |                                                     |                                 |                                   |  |  |



# **Pipelining Lessons (laundry example)**



- <u>Multiple</u> tasks operating simultaneously
- Pipelining doesn't help <u>latency</u> of single task, it helps <u>throughput</u> of entire workload
- Pipeline rate limited by slowest pipeline stage
- Potential speedup = <u>Number pipe stages</u>
- Unbalanced lengths of pipe stages reduces speedup
- Also, need time to "<u>fill</u>" and "<u>drain</u>" the pipeline.

#### University of Notre Dame

CSE 30321 – Lecture 12 – Introduction to Pipelining

### **Pipelining: Some terms**

- If you're doing laundry or implementing a μP, each stage where something is done called a pipe stage
  - In laundry example, washer, dryer, and folding table are pipe stages; clothes enter at one end, exit other
  - In a  $\mu$ P, instructions enter at one end and have been executed when they leave
- <u>Throughput</u> is how often stuff comes out of a pipeline

#### University of Notre Dame

CSE 30321 – Lecture 12 – Introduction to Pipelining

- On the board...
- The "math" behind pipelining...



# **Recap:** Pipeline Math

- If times for all S stages are equal to T:
  - Time for one initiation to complete still ST
  - Time between 2 initiates = T not ST
  - Initiations per second = 1/T

Time for N initiations to complete: Throughput:

• (i.e. hamper)

1

Dry 1

Wash 2

Time for N initiations to complete:

0

Wash 1

Throughput:

2

Fold 1

Dry 2

Wash 3

NT + (S-1)T Time per initiation = T + (S-1)T/N  $\rightarrow$  T!

- Pipelining: Overlap multiple executions of same sequence
  - Improves THROUGHPUT, not the time to perform a single operation

**University of Notre Dame** CSE 30321 – Lecture 12 – Introduction to Pipelining

More technical detail

Book's approach to draw pipeline timing diagrams...

- Each "row" below corresponds to distinct initiation Boundary b/t 2 column entries: pipeline register

- Look at columns to see what stage is doing what

Pack 1

Fold 2

Dry 3

Wash 4

4

Pack 2

Fold 3

Dry 4

Wash 5

NT + (S-1)T

5

Pack 3

Fold 4

Dry 5

Wash 6

Time per initiation = T + (S-1)T/N  $\rightarrow$  T!

6

Pack 4

Fold 5

Drv 6

3

- Time runs left-to-right, in units of stage time

# **Recap:** How much (ideal) speedup?



**University of Notre Dame** 

CSE 30321 – Lecture 12 – Introduction to Pipelining

# The "new look" dataflow



Note: Some extra HW needed. Note: purple in a latch indicates data from that instruction stored there



ID/EX EX/MEM MEM/WE Shift left 2 Bead data 1 Read Bead data 2 Write

Data must be stored from one stage to the next in pipeline registers/latches. hold temporary values between clocks and



needed info. for execution.

11

12

**University of Notre Dame** 

# Another way to look at it...



#### **University of Notre Dame**

CSE 30321 - Lecture 12 - Introduction to Pipelining

# Limits, limits, limits...

- So, now that the ideal stuff is out of the way, let's look at how a pipeline REALLY works...
- Pipelines are slowed b/c of:
  - Pipeline latency
  - Imbalance of pipeline stages
    - (Think: A chain is only as strong as its weakest link)
    - Well, a pipeline is only as fast as its slowest stage
  - Pipeline overhead (from where?)
    - Register delay from pipe stage latches

# So, what about the details?

- In each cycle, new instruction fetched and begins 5 cycle execution
- In perfect world (pipeline) performance improved 5 times over!
- Now, let's talk about overhead... •
  - (i.e. what else do we have to worry about?)
    - Must know what's going on in every cycle of machine
    - What if 2 instructions need same resource at same time?
      - (LOTS more on this later)
      - Separate instruction/data memories, multiple register ports, etc. help avoid this

#### **University of Notre Dame**

CSE 30321 – Lecture 12 – Introduction to Pipelining

# Let's look at some examples:

- Specifically: •
  - (1 instruction sequence with a problem)
  - (2 instruction sequence)

13

Load word: Cycle 1

### **Executing Instructions in Pipelined Datapath**

- Following charts describe 3 scenarios:
  - Processing of load word (lw) instruction
    - Bug included in design (make SURE you understand the bug)
  - Processing of lw
    - Bug corrected (make SURE you understand the fix)
  - Processing of lw followed in pipeline by sub

lw

Instruction decode

· (Sets the stage for discussion of HAZARDS and interinstruction dependencies)

### Instruction fetch Note: purple in a latch indicates





#### **University of Notre Dame University of Notre Dame** CSE 30321 – Lecture 12 – Introduction to Pipelining CSE 30321 – Lecture 12 – Introduction to Pipelining 19

# Load Word: Cycle 2

20

18

# Load Word: Cycle 3







lw

# Load Word: Cycle 5

Where's the bug?

ID/EX

Shift left 2 EX/MEM

Read

addr

Write

addr

Write

data Memory

Read

data

Data

# Load Word: Cycle 4



### 23

21

# Load Word: Fixed Bug

**University of Notre Dame** 

CSE 30321 – Lecture 12 – Introduction to Pipelining

#### Bug: source for Write Reg is invalid Solution: Need to preserve register number for write-back ► × additional pipeline bits for write register address ID/EX EX/MEM MEM/WB Shift left 2 Read data 1 Read add Read Instruction data 2 Memory egister Data Sign

### PC Read reg 1 Read reg 1 Read Read Read

reg 2

reg

Read

data 2

e aister

Sign

× Mux

Instruction

Memory

University of Notre Dame

#### CSE 30321 – Lecture 12 – Introduction to Pipelining

### A 2 instruction sequence

- Examine multiple-cycle & single-cycle diagrams for a sequence of 2 independent instructions
  - (i.e. no common registers b/t them)
    - lw \$10, 9(\$1)
    - sub\$11, \$2, \$3



24

lw

Vrite

back

MEM/WB

### Single-cycle diagrams: cycle 1

25

27



#### **University of Notre Dame**

CSE 30321 – Lecture 12 – Introduction to Pipelining

# Single-cycle diagrams: cycle 3



# Single-cycle diagrams: cycle 2



University of Notre Dame

CSE 30321 – Lecture 12 – Introduction to Pipelining

# Single-cycle diagrams: cycle 4



28

University of Notre Dame

University of Notre Dame

# Single-cycle diagrams: cycle 6



#### University of Notre Dame

### CSE 30321 – Lecture 12 – Introduction to Pipelining

### **Questions about control signals**

- Following discussion relevant to a single instruction
- Q: Are all control signals active at the same time?
- A: ?
- Q: Can we generate all these signals at the same time?
- A: ?

# Single-cycle diagrams: cycle 5



# What about control signals?

CSE 30321 – Lecture 12 – Introduction to Pipelining

32

33

35

### Passing control w/pipe registers

- Analogy: send instruction with car on assembly line ٠
  - "Install Corinthian leather interior on car 6 @ stage 3"



CSE 30321 - Lecture 12 - Introduction to Pipelining

# **Pipelined datapath w/control signals**



#### **University of Notre Dame**

#### CSE 30321 – Lecture 12 – Introduction to Pipelining

### On the board...

- Let's look at hazards...
  - ...and how they (generally) impact performance.





34

- Pipeline hazards prevent next instruction from executing during designated clock cycle
- There are 3 classes of hazards:
  - Structural Hazards:
    - Arise from resource conflicts
    - HW cannot support all possible combinations of instructions
  - Data Hazards:
    - Occur when given instruction depends on data from an instruction ahead of it in pipeline
  - Control Hazards:
    - Result from branch, other instructions that change flow of program (i.e. change PC)

### CSE 30321 – Lecture 12 – Introduction to Pipelining How do we deal with hazards?

- Often, pipeline must be stalled
- Stalling pipeline usually lets some instruction(s) in pipeline proceed, another/others wait for data, resource, etc.
- A note on terminology:
  - If we say an instruction was "issued <u>later</u> than instruction x", we mean that <u>it was issued after instruction x</u> and is not as far along in the pipeline
  - If we say an instruction was "issued <u>earlier</u> than instruction x", we mean that it <u>was issued before</u> <u>instruction x</u> and is further along in the pipeline

### University of Notre Dame

### CSE 30321 – Lecture 12 – Introduction to Pipelining

# **Structural hazards**

- 1 way to avoid structural hazards is to duplicate resources
  - i.e.: An ALU to perform an arithmetic operation and an adder to increment PC
- If not all possible combinations of instructions can be executed, structural hazards occur
- Most common instances of structural hazards:
  - When a functional unit not fully pipelined
  - When some resource not duplicated enough
- Pipelines stall result of hazards, CPI increased from the usual "1"

### CSE 30321 – Lecture 12 – Introduction to Pipelining

**University of Notre Dame** 

# Stalls and performance

- Stalls impede progress of a pipeline and result in deviation from 1 instruction executing/clock cycle
- Pipelining can be viewed to:
  - Decrease CPI or clock cycle time for instruction
  - Let's see what affect stalls have on CPI...
- CPI pipelined =
  - Ideal CPI + Pipeline stall cycles per instruction
  - 1 + Pipeline stall cycles per instruction
- Ignoring overhead and assuming stages are balanced:

 $Speedup = \frac{CPI \ unpipelined}{1 + pipeline \ stall \ cycles \ per \ instruction}$ 

 If no stalls, speedup equal to # of pipeline stages in ideal case

How is it resolved?

#### CSE 30321 – Lecture 12 – Introduction to Pipelining

### An example of a structural hazard



| Instruction 1 | Mem Reg |                             | Reg        |             |
|---------------|---------|-----------------------------|------------|-------------|
| Instruction 2 | Mem     | Reg                         |            | Reg         |
| Stall         |         | Bubble Bubble               | Bubble But | bble Bubble |
| Instruction 3 |         | Mem                         | Reg        |             |
|               |         | ally stalled by ble" or NOP |            |             |

#### University of Notre Dame

#### CSE 30321 – Lecture 12 – Introduction to Pipelining

Reg

# A simple example

- The facts:
  - Data references constitute 40% of an instruction mix
  - Ideal CPI of the pipelined machine is 1
  - A machine with a structural hazard has a clock rate that's 1.05 times higher than a machine without the hazard.
- How much does this LOAD problem hurt us?
- Recall: Avg. Inst. Time = CPI x Clock Cycle Time
  - $= (1 + 0.4 \times 1) \times (\text{Clock cycle time}_{\text{ideal}}/1.05)$
  - = 1.3 x Clock cycle time<sub>ideal</sub>
  - Therefore the machine without the hazard is better

# Or alternatively...

University of Notre Dame

CSE 30321 - Lecture 12 - Introduction to Pipelining

| ← Clock Number |    |    |    |       |     |     |    |     |     |     |
|----------------|----|----|----|-------|-----|-----|----|-----|-----|-----|
| Inst. #        | 1  | 2  | 3  | 4     | 5   | 6   | 7  | 8   | 9   | 10  |
| LOAD           | IF | ID | EX | MEM   | WB  |     |    |     |     |     |
| Inst. i+1      |    | IF | ID | EX    | MEM | WB  |    |     |     |     |
| Inst. i+2      |    |    | IF | ID    | EX  | MEM | WB |     |     |     |
| Inst. i+3      |    |    |    | stall | IF  | ID  | EX | MEM | WB  |     |
| Inst. i+4      |    |    |    |       |     | IF  | ID | EX  | MEM | WB  |
| Inst. i+5      |    |    |    |       |     |     | IF | ID  | EX  | MEM |
| Inst. i+6      |    |    |    |       |     |     |    | IF  | ID  | EX  |

- LOAD instruction "steals" an instruction fetch cycle which will cause the pipeline to stall.
- Thus, no instruction completes on clock cycle 8

43

Load

Mem

### **Remember the common case!**

- All things being equal, a machine without structural hazards will always have a lower CPI.
- But, in some cases it may be better to allow them than to eliminate them.
- These are situations a computer architect might have to consider:
  - Is pipelining functional units or duplicating them costly in terms of HW?
  - Does structural hazard occur often?
  - What's the common case???

University of Notre Dame

CSE 30321 – Lecture 12 – Introduction to Pipelining

- Answer: Add more hardware.
  - As we'll see, CPI degrades quickly from our ideal '1' for even the simplest of cases...

University of Notre Dame