#### General context: Multiprocessors

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications

Lecture 28 - Parallel Processi

 Multiprocessor is any computer with several processors

#### SIMD

- Single instruction, multiple data
- Modern graphics cards
- MIMD
  - Multiple instructions, multiple data

#### University of Notre Dame

#### Lecture 28 - Parallel Processing

University of Notre Dame

#### Multiprocessing

- Flynn's Taxonomy of Parallel Machines - How many Instruction streams?
  - How many Data streams?

(note: we'll spend most time talking about just 1 class...)

- SISD: Single I Stream, Single D Stream
   A uniprocessor
- SIMD: Single I, Multiple D Streams
  - Each "processor" works on its own data
  - But all execute the same instrs in lockstep
  - E.g. a vector processor or MMX

#### Lecture 28 - Parallel Processing

# Flynn's Taxonomy

- MISD: Multiple I, Single D Stream
  - Not used much
- MIMD: Multiple I, Multiple D Streams
  - Each processor executes its own instructions and operates on its own data
  - This is your typical off-the-shelf multiprocessor (made using a bunch of "normal" processors)
    - Not superscalar
    - Each node is superscalar
    - · Lessons will apply to multi-core too!

University of Notre Dame

# Or...in pictures!

Lecture 28 - Parallel Processing



#### Lecture 28 - Parallel Processing

University of Notre Dame

#### **Multiprocessors**

- · Why do we need multiprocessors?
  - Uniprocessor speed improving fast
  - But there are things that need even more speed
    - Wait for a few years for Moore's law to catch up?
    - $\cdot$  Or use multiple processors and do it now?
    - (Is Moore's Law still catching up? M/C?)

#### Multiprocessor software problem

- Most code is sequential (for uniprocessors)
   MUCH easier to write and debug
- Correct parallel code very, very difficult to write
  - Efficient and correct is much more difficult
  - · Debugging even more difficult

Let's look at a few MIMD example configurations...

University of Notre Dame



Lemieux cluster

supercomputing center

Pittsburgh

#### Lecture 28 - Parallel Processi Lecture 28 - Parallel Processing **MIMD** Multiprocessors Multiprocessor memory types Centralized Shared Processo Processo Shared memory: Memory In this model, there is one (large) common shared memory for all processors One or One o more levels of cache more lev of cach more leve of cache more level of cache • Distributed memory: In this model, each processor has its own (small) local memory, and its content is not I/O system replicated anywhere else Note: just 1 memory © 2003 Elsevier Science (USA). All rights reserve University of Notre Dame University of Notre Dame

Lecture 28 - Parallel Processing

MIMD Multiprocessors



Multiple, distributed memories here.

© 2003 Elsevier Science (USA). All rights reserved

University of Notre Dame

# Before, we did parallel processing by chaining together separate processors.

Lecture 28 - Parallel Processing

Now we can do it on the same chip.

University of Notre Dame

#### Lecture 28 – Parallel Processing

Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip

- Multi-core processors are MIMD: Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data).
- Multi-core is a shared memory multiprocessor: All cores share the same memory

#### Lecture 28 - Parallel Processing

# Single-core computer





# Multi-core CPU chip

Lecture 28 – Parallel Processing

- · The cores fit on a single processor socket
- Also called CMP (Chip Multi-Processor)

| С | с | С | С |
|---|---|---|---|
| 0 | 0 | 0 | 0 |
| r | r | r | r |
| е | е | е | е |
| 1 | 2 | 3 | 4 |
|   |   |   |   |

#### University of Notre Dame

Lecture 28 – Parallel Processing

# What applications benefit from multi-core?

- Database servers
- Web servers (Web commerce)
- Compilers
- Multimedia applications
- Scientific applications, CAD/CAM
- In general, applications with *Thread-level parallelism* (as opposed to instructionlevel parallelism)





#### Lecture 28 - Parallel Processing

# More examples

- Editing a photo while recording a TV show through a digital video recorder
- Downloading software while running an anti-virus program
- "Anything that can be threaded today will map efficiently to multi-core"
- BUT: some applications difficult to parallelize

University of Notre Dame

# The cores run in parallel

Lecture 28 - Parallel Processing



University of Notre Dame



#### Lecture 28 - Parallel Processing

# Amdahl's Law

- Sequential part can limit speedup
- Example: 100 processors, 90× speedup?

- Speedup = 
$$\frac{1}{(1 - F_{parallelizable}) + F_{parallelizable}/100} = 90$$

- Solving: F<sub>parallelizable</sub> = 0.999
- Need sequential part to be 0.1% of original time

#### Lecture 28 - Parallel Processing

#### Scaling Example

- Workload: sum of 10 scalars, and 10 × 10 matrix sum Speed up from 10 to 100 processors
- Single processor: Time = (10 + 100) ×  $t_{add}$
- 10 processors
  - Time = 10 × t<sub>add</sub> + 100/10 × t<sub>add</sub> = 20 × t<sub>add</sub>
     Speedup = 110/20 = 5.5 (55% of potential)
- 100 processors
  - Time = 10 × t<sub>add</sub> + 100/100 × t<sub>add</sub> = 11 × t<sub>add</sub>
     Speedup = 110/11 = 10 (10% of potential)
- · Assumes load can be balanced across processors

- What if matrix size is 100 × 100?
- Single processor: Time = (10 + 10000) × t<sub>add</sub>
- 10 processors
  - Time = 10 ×  $t_{add}$  + 10000/10 ×  $t_{add}$  = 1010 ×  $t_{add}$
  - Speedup = 10010/1010 = 9.9 (99% of potential)
- 100 processors
  - Time = 10 ×  $t_{add}$  + 10000/100 ×  $t_{add}$  = 110 ×  $t_{add}$
  - Speedup = 10010/110 = 91 (91% of potential)
- Assuming load balanced

# Speedup Challenge

- To get full benefit of parallelism need to be able to parallelize the entire program!
- Amdahl's Law
  - Time<sub>after</sub> = (Time<sub>affected</sub>/Improvement)+Time<sub>unaffected</sub>
  - Example: We want 100 times speedup with 100 processors
  - Time<sub>unaffected</sub> = 0!!!

University of Notre Dame

Lecture 28 - Parallel Processing

Lecture 28 - Parallel Processing

University of Notre Dame

## Cache Coherence Problem

#### • Shared memory easy with no caches

- P1 writes, P2 can read
- Only one copy of data exists (in memory)
- · Caches store their own copies of the data
  - Those copies can easily get inconsistent
  - Classical example: adding to a sum
    - P1 loads allSum, adds its mySum, stores new allSum
    - $\cdot$  P1's cache now has dirty data, but memory not updated
    - P2 loads allSum from memory, adds its mySum, stores allSum
    - P2's cache also has dirty data
    - Eventually P1 and P2's cached data will go to memory
    - Regardless of write-back order, the final value ends up wrong

University of Notre Dame

#### Lecture 28 - Parallel Processing

# Why multi-core ?

- Difficult to make single-core clock frequencies even higher
- Deeply pipelined circuits:
  - heat problems
  - speed of light problems
  - difficult design and verification
  - large design teams necessary
  - server farms need expensive air-conditioning
- Many new applications are multithreaded
- General trend in computer architecture (shift towards more parallelism)



Let's look back to Lecture 01

Seems like lots of trouble. Why do it? Because we sort of have to...

University of Notre Dame

Lecture 28 - Parallel Processing





Lecture 28 - Parallel Processing



# SMT not a "true" parallel processor

- Enables better threading (e.g. up to 30%)
- OS and applications perceive each simultaneous thread as a separate "virtual processor"
- The chip has only a single copy of each resource
- Compare to multi-core:
   each core has its own copy of resources

Multi-core: threads can run on separate cores



University of Notre Dame

## Lecture 28 - Parallel Processing

University of Notre Dame



University of Notre Dam

#### Lecture 28 - Parallel Processing

# Combining Multi-core and SMT

- Cores can be SMT-enabled (or not)
- The different combinations:
  - Single-core, non-SMT: standard uniprocessor
  - Single-core, with SMT
  - Multi-core, non-SMT
  - Multi-core, with SMT:
- The number of SMT threads:
  - 2, 4, or sometimes 8 simultaneous threads
- · Intel calls them "hyper-threads"

University of Notre Dame

#### Lecture 28 – Parallel Processing

# SMT Dual-core: all four threads can run concurrently



#### Lecture 28 - Parallel Processing

# Comparison: multi-core vs SMT

Advantages/disadvantages?

# Comparison: multi-core vs SMT

- Multi-core:
  - Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture)
     However, great with thread-level parallelism
- SMT
  - Can have one large and fast superscalar core
  - Great performance on a single thread
  - Mostly still only exploits instruction-level parallelism

# The memory hierarchy

- If simultaneous multithreading only:
   all caches shared
- · Multi-core chips:
  - L1 caches private
  - L2 caches private in some architectures and shared in others
- · Memory is always shared

University of Notre Dame

# Lecture 28 - Parallel Processing

University of Notre Dame

- Dual-core
   Intel Xeon processors
- Each core is
   hyper-threaded
- Private L1 caches
- Shared L2 caches



# Lecture 28 - Parallel Processing Designs with private L2 caches

University of Notre Dame

| ) R E 1                    | ) R E O   |  |
|----------------------------|-----------|--|
| C L1 cache                 | CL1 cache |  |
| L2 cache                   | L2 cache  |  |
| memory                     |           |  |
| Both L1 and L2 are private |           |  |

Examples: AMD Opteron, AMD Athlon, Intel Pentium D



A doolgin Man 20 odonoo

Example: Intel Itanium 2

University of Notre Dame

# Lecture 28 - Parallel Processing 48 Multithreading 6 - Replicate registers, PC, etc. 7 - Fast switching between threads 7 - Switch threads after each cycle 7 - Interleave instruction execution 7 - If one thread stalls, others are executed 7 - Coarse-grain multithreading 6 - Only switch on long stall (e.g., L2-cache miss) 5 - Simplifies hardware, but doesn't hide short stalls (eg, data hazards) 6