Lecture 26

GPU Wrap Up

## Suggested Readings

### Readings

- H&P: Chapter 7 especially 7.1-7.8
  - (Over next 2 weeks)
- Introduction to Parallel Computing
  - <u>https://computing.llnl.gov/tutorials/parallel\_comp/</u>
- POSIX Threads Programming
  - <u>https://computing.llnl.gov/tutorials/pthreads/</u>
- How GPUs Work
  - <u>www.cs.virginia.edu/~gfx/papers/pdfs/59 HowThingsWork.pdf</u>

| University of Notre Dame                                                                                                                                            | University of Notre Dame                                                                                                                                                                                                                                                                                               |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CSE 30321 – Lecture 26 –GPU Wrap Up                                                                                                                                 | 3 CSE 30321 – Lecture 26 –GPU Wrap Up                                                                                                                                                                                                                                                                                  |
| <section-header><section-header><section-header><section-header><complex-block></complex-block></section-header></section-header></section-header></section-header> | <ul> <li>Necessary processing</li> <li>Example problem: <ul> <li>Generic CPU pipeline</li> <li>GPU-based vs. Uni-processor Z-buffer problem</li> </ul> </li> <li>What does a GPU architecture look like? <ul> <li>Explain in context of SIMD</li> </ul> </li> <li>Applicability to other computing problems</li> </ul> |
| University of Notre Dame                                                                                                                                            | University of Notre Dame                                                                                                                                                                                                                                                                                               |

## **Recap: How is a frame rendered?**

Helpful to consider how the 2 standard graphics APIs – OpenGL and Direct 3D – work.

- These APIs define a logical graphics pipeline that is mapped onto GPU hardware and processors – along with programming models and languages for the programmable stages
- In other words, API takes primitives like points, lines and polygons, and converts them into pixels
- How does the graphics pipeline do this?
  - First, important to note that "pipeline" does not mean the 5 stage pipeline we talked about earlier
  - Pipeline describes sequence of steps to prepare image/ scene for rendering

# Recap: How is a frame rendered? (Direct3D pipeline)

CSE 30321 – Lecture 26 – GPU Wrap Up





#### University of Notre Dame

CSE 30321 – Lecture 26 – GPU Wrap Up

## **Example: Z-buffer**



http://blog.yoz.sk/examples/pixelBenderDisplacement/zbuffer1Map.jpg

University of Notre Dame CSE 30321 – Lecture 26 – GPU Wrap Up

## **GPUs**

- GPU = <u>Graphics Processing Unit</u>
  - Efficient at manipulating computer graphics
  - Graphics accelerator uses custom HW that makes mathematical operations for graphics operations fast/efficient
    - Why SIMD? Do same thing to each pixel
- Often part of a *heterogeneous* system
  - GPUs don't do all things CPU does
  - Good at some specific things
    - · i.e. matrix-vector operations
- GPU HW:
  - No multi-level caches
  - Hide memory latency with threads
    - To process all pixel data
  - GPU main memory oriented toward bandwidth



**University of Notre Dame** 

CSE 30321 – Lecture 26 – GPU Wrap Up

**Peak Performance:** 

 $16 \text{ MP} \times \frac{8 \text{ SP}}{\text{MP}} \times \frac{2 \text{ FLOPs / instruction}}{\text{SP}} \times \frac{1 \text{ instruction}}{\text{clock}} \times \frac{1.35 \times 10^9 \text{ clocks}}{\text{clock}} = \frac{345.6 \text{ GFLOPS}}{1000 \text{ GeV}}$ 

 $16 \text{ MP} \times \frac{32 \text{ SP}}{\text{MP}} \times \frac{2 \text{ FLOPs / instruction}}{\text{SP}} \times \frac{1 \text{ instruction}}{\text{clock}} \times \frac{0.75 \times 10^9 \text{ clocks}}{\text{s}} = \frac{768 \text{ GFLOPS}}{\text{s}}$ 

8800 has floating point multiply-add instruction

Peak performance obtained if all SP run in parallel:

More limited knowledge about HW, but assume same

- (see your project for utility of this instruction)

•

•

•

What about 500?

Speedup ~ 2.22

multiply-add instruction:

- Consistent with Moore's Law:

# **Example NVIDIA architecture**

CSE 30321 – Lecture 26 – GPU Wrap Up



FIGURE A.2.5 Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14 streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared memory. Copyright © 2009 Elsevier, Inc. All rights reserved.

University of Notre Dame

#### CSE 30321 – Lecture 26 – GPU Wrap Up

# **Cache and Memory in NVIDIA GPU**

- On 8800, each MP just 16 Kbytes of cache
  - Shared by 8 SPs
- Also, each MP = 8192 registers
  - 512 per SP
- Memory:

11

- 768 Mbytes DRAM @ 900 MHz
  - DRAM is wide
    - 8 Bytes
    - Want >> bandwidth given parallel nature of GPU
    - Often DRAM components specific to GPU
- How much data can memory interface deliver to GPU?

 $6 \text{ Partitions} \times \frac{8 \text{ Bytes}}{\text{Transaction}} \times \frac{2 \text{ Transactions}}{\text{clock}} \times \frac{0.9 \times 10^9 \text{ clocks}}{\text{s}} = \frac{86.4 \text{ GBytes}}{\text{s}}$ 

Should see 2X improvement per technology generation

500 = 40 nm technology, 8800 = 65 nm technology

### Data consumption vs. Data delivery

- What if each FLOP required 1 or 2 data words? •
  - (Assume data word = 32 bits OR 4 bytes)
- What data delivery rate is required?

| [4,8] Bytes | 245.6 GFLOPS | <u>1382.4 GBytes</u> | 2764.8 GBytes |
|-------------|--------------|----------------------|---------------|
|             | `s           | s                    | s s           |

- But, our memory interface can only deliver 1/16<sup>th</sup> or • 1/32<sup>th</sup> of this...
  - How do we mask memory latency?

### CSE 30321 – Lecture 26 – GPU Wrap Up

## Threads mask memory latencies

- Each SP supports HW-based threads •
- In 8800, MP manages group of 32 threads called a warp
  - Ideally:
    - All 32 threads execute the same instruction on different data

Tesla Multiprocessor

1 2 3 4 5 6 7 8

ELSE

IF

- True SIMD!
- All 8 SPs busy each CC...
- More realistically:
  - Don't always get true SIMD
  - Why?
    - Problem not 100% parallelizable
    - Still need to deal with conditions, etc. (i.e. BEQ...)
  - Interesting note:
    - No branch prediction HW
    - Just assign each execution path to a thread and pick 1!



13

- Graphics accelerator uses custom HW that makes mathematical operations for graphics operations fast/ efficient
  - Why SIMD? Do same thing to each pixel



- 8800 supports 24
  - warps
  - Can switch warps every 2 or 4 CCs

**University of Notre Dame** 

Warp0

# **Programming GPUs**

- API language compilers target industry standard intermediate languages instead of machine instructions
  - GPU driver software generates optimized GPU-specific machine instructions
- Also, SW support for non graphics programming:
  - NVIDIA has graphics cards that support API extension to C – CUDA ("Computer Unified Device Architecture")
    - Allows specialized functions from a normal C program to run on GPU's stream processors
    - Allows C programs that can benefit from integrated GPU(s) to use where appropriate, but also leverage conventional CPU

## A bit more on CUDA...

- Developed by NVIDIA to program GPU processors
- Unified C/C++ programming for heterogeneous CPU-GPU system
  - Runs on CPU, sends work to GPU
    - Involves data transfer from main memory & "thread dispatch"
      - "Thread dispatch" is piece of program for GPU
      - Programmer can specify # of threads/block + number of blocks to run on GPU
      - Can make threads within block share same local memory
        - » Therefore communicate with loads and stores
      - CUDA compiler allocates registers to threads

### University of Notre Dame

19

### CSE 30321 – Lecture 26 –GPU Wrap Up

# **GPUs for other problems**

- · More recently:
  - GPUs found to be more efficient than general purpose CPUs for many complex algorithms
    - Often things with massive amount of vector ops
  - Example:
    - ATI, NVIDIA team with Stanford to do GPU-based computation for protein folding
    - Found to offer up to 40 X improvement over more conventional approach

#### University of Notre Dame

### CSE 30321 – Lecture 26 – GPU Wrap Up

# **GPU Example**

