### Suggested Readings

- Readings
  - H&P: Chapter 7
    - (Over next 2 weeks)

### Lecture 24 Parallel Processing on Multi-Core Chips



University of Notre Dame

### Transistors used to manipulate/store 1s & 0s

Switch-level representation

### Cross-sectional view



YEAR

TECHNOLOGY

2004



Using above diagrams as context, note that if we (i) apply a suitable voltage to the gate & (ii) then apply a suitable voltage between source and drain, current will flow.

**University of Notre Dame** 

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

**Previous Industry Projections** 

2010

2013

32 nm

2016

22 nm

### Moore's Law

"Cramming more components onto integrated circuits."

- G.E. Moore, Electronics 1965

- Observation: DRAM transistor density doubles annually
  - Became known as "Moore's Law"
  - Actually, a bit off:
    - Density doubles every 18 months (now more like 24)
    - (in 1965 they only had 4 data points!)

#### - Corollaries:

- Cost per transistor halves annually (18 months)
- Power per transistor decreases with scaling
- · Speed increases with scaling
  - Of course, it depends on how small you try to make things
    - » (I.e. no exponential lasts forever)

#### **Remember these!**

University of Notre Dame

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

#### 8

### A funny thing happened on the way to 45 nm



## 2005 projection was for 5.2 GHz - and we didn't make it in production. Further, we're still stuck at 3+ GHz in production.

### 90 nm 65 nm 45 nm

2007

| CHIP SIZE                             | 550 mm <sup>2</sup> |
|---------------------------------------|---------------------|---------------------|---------------------|---------------------|---------------------|
| NUMBER OF<br>TRANSISTORS<br>(LOGIC)   | 553 M               | 1 Billion           | 2 Billion           | 4.5 Billion         | 8.5 Billion         |
| DRAM<br>CAPACITY                      | 1.0 Gbits           | 2.0 Gbits           | 4.3 Gbits           | 8.5 Gbits           | 35 Gbits            |
| MAXIMUM<br>CLOCK<br>FREQUENCY         | 4.1 GHz             | 9.3 GHz             | 15 GHz              | 23 GHz              | 40 GHz              |
|                                       |                     |                     |                     |                     |                     |
|                                       |                     |                     |                     |                     |                     |
| MINIMUM<br>SUPPLY<br>VOLTAGE          | 0.9 V               | 0.8 V               | 0.7 V               | 0.6 V               | 0.5 V               |
| SUPPLY                                | 0.9 V<br>150 W      | 0.8 V<br>190 W      | 0.7 V<br>200 W      | 0.6 V<br>200 W      | 0.5 V<br>200 W      |
| SUPPLY<br>VOLTAGE<br>MAXIMUM<br>POWER |                     |                     |                     |                     |                     |

### A funny thing happened on the way to 45 nm

 Power decreases with scaling... Technology 350 250 180 130 90 65 500 45 22 Node (nm) 100 **Iormalized Power** Dynamic power 0.01 0.0001 Static Power (leakage) 0.0000001 1995 2000 2005 2010 2015 2020 1990

#### **University of Notre Dame**

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

### **Summary of relationships**

- (+) If V increases, speed (performance) increases
- (-) If V increases, power (heat) increases
- (+) If L decreases, speed (performance) increases
- (?) If L decreases, power (heat) does what?
  - P could improve because of lower C
  - P could increase because >> # of devices switch
  - P could increase because >> # of devices switch faster!

# Need to carefully consider tradeoffs between speed and heat

### A bit on device performance...

- One way to think about switching time:
  - Charge is carried by electrons
  - Time for charge to cross channel = length/speed
- What about power (i.e. heat)?
- Thus, to make a device faster, we want to either increase  $V_{ds}$  or decrease feature sizes (i.e. L)
- <u>Dynamic</u> power is:  $P_{dyn} = C_L V_{dd}^2 f_{0-1}$

•  $C_L = (e_{ox}WL)/d$ 

11

• = L<sup>2</sup>/(mV<sub>ds</sub>)

- e<sub>ox</sub> = dielectric, WL = parallel plate area, d = distance between oate and substrate

University of Notre Dame

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

#### 12

Oxide (SiO<sub>2</sub>)

### A funny thing happened on the way to 45 nm

- Speed increases with scaling...
- Power decreases with scaling...

### Why the clock flattening? POWER!!!!



13

15

### (Short term?) Solution

#### High art meets high-tech.

Lincoln's latest project, titled "CUBE," is a 10' x 10' translucent structure out with video cameras, unjouely combining sculpture, portraiture and architecture With **Intel® Centrans®** processor **technology** inside, a notebook becomes mar other things as well — portable studio, cames, inspiration tool. Top 5 Must-Haves

- Dip or vasa the RECESSOR POWERFUL RECESSOR A partrait of partermance. "My generative portraits are demanding on the processors in my lapto, as they continuously manipulate video," says Linco Thankikly, the deal-cere partbranemate of Intel Centrino processor technolo can handle intensive tasks with flying colors.
- DIZZYING TRANSFER SPEEDS Art (at 30 frames per second). Data transferring up to 20% faster allows Lincoln to store footage from 24 video cameras with lightnin
- HIGH-SPEED WIRELESS
- (HIGH-SPEED WIRELESS Always Connected. With up to twice the range and lix the speed when connected to a Wireless N home network,<sup>2</sup> Lincoln can download musi or shop for art books anywhere, anytime.
- ENHANCED VIDEO High-def (redefined). Lincoln can view his gene View clarity, thanks to stunning multimedia performance, for a super-enhance: high-def video.experience.

more and the second secon



- Processor complexity is good enough
- Transistor sizes can still scale
- Slow processors down to manage power
- Get performance from...

### **Parallelism**

**Top 5 Must-Haves** 

POWERFUL PROCESSOR

A portrait of performance. "My generative portraits are demanding on the processors in my laptop, as they continuously manipulate video," says Lincoln. Thankfully, the dual-core performance of Intel Centrino processor technology can handle intensive tasks with flying colors.

(i.e. 1 processor, 1 ns clock cycle VS. 2 processors, 2 ns clock cycle)

### Are there design problems and issues unique to parallel processing on multi-core chips?

**University of Notre Dame** 

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

16

### Issues

**University of Notre Dame** 

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

- Not that much different than those listed earlier: •
  - Cache Coherency
  - Contention
  - Latency
  - Reliability
  - Languages
  - Algorithms
- In order of priority... ٠
  - Algorithms / Languages
  - Contention / Latency
  - Cache coherency

Are there parallel processing models more suitable to chip-level systems?

**Board examples** 

## Multithreading

- Idea:
  - Performing multiple threads of execution in parallel
    - Replicate registers, PC, etc.
  - Fast switching between threads
- Flavors:
  - Fine-grain multithreading
    - Switch threads after each cycle
    - Interleave instruction execution
    - If one thread stalls, others are executed
  - Coarse-grain multithreading
    - Only switch on long stall (e.g., L2-cache miss)
    - Simplifies hardware, but doesn't hide short stalls (e.g., data hazards)
  - SMT (Simultaneous Multi-Threading)
    - Especially relevant for superscalar

#### Refer to this picture:





#### University of Notre Dame

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

### Impact of modern processing principles (Lots of "state")

- User:
  - state used for application execution
- Supervisor:
  - state used to manage user state
- Machine:
  - state that configures the system
- Transient:
  - state used during instruction execution
- Access-Enhancing:
  - state used to simplify translation of other state names
- Latency-Enhancing:
  - state used to reduce latency to other state values

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

### Impact of modern processing principles (Total State vs. Time)



## Comparison: multi-core vs SMT

- Multi-core:
  - Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture)
  - However, great with thread-level parallelism
- SMT
  - Can have one large and fast superscalar core
  - Great performance on a single thread
  - Mostly still only exploits instruction-level parallelism

## The memory hierarchy

- If simultaneous multithreading only:
  - all caches shared
- Multi-core chips:
  - L1 caches private

Examples: AMD Opteron, AMD Athlon, Intel Pentium D

- L2 caches private in some architectures and shared in others
- Memory is always shared

#### University of Notre Dame

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

### Or can do both...

- Dual-core
  Intel Xeon processors
- Each core is
  hyper-threaded
- Private L1 caches
- Shared L2 caches



#### University of Notre Dame

CSE 30321 – Lecture 24 – Parallel Processing on Multi-Core Chips

### **Real life examples...** Designs with private L2 caches

#### ш ш ш ш ۲ R 2 2 0 0 C о<sub>L1 cache</sub> υ о<sub>L1 cache</sub> L1 cache L1 cache L2 cache L2 cache L2 cache L2 cache L3 cache L3 cache memory memory Both L1 and L2 are private

A design with L3 caches

Example: Intel Itanium 2