#### Lecture 01 Introduction to CSE 30321

#### Suggested reading: None.

#### My contact information...

- Michael Niemier
  - (Mike is fine.)
- Contact information:
  - 380 Fitzpatrick Hall
  - mniemier@nd.edu
  - (574) 631-3858
- Office hours:
  - Mon: noon-1:30
  - Thurs: 2:30-4:00
- About me...

#### Fundamental lesson(s) for this lecture:

- 1. Why *Computer Architecture* is important for your major – (CS, CPEG, EE, MATH, etc...)
- 2. What should you be able to do when you finish class?

#### Why should I care?

- If you're interested in SW...
  - Understanding how processor HW operates, how it's organized, and the interface between HW and a HLL will make you a better programmer.
    - · see "the matrix multiply" & "the 4- to 8-core" examples...
- If you're interested in HW...
  - Technology is rapidly changing how microprocessor architectures are designed and organized, and has introduced new metrics like "performance per Watt"
    - The material covered in this class is the baseline required to both understand and to help develop the "state of the art"
    - · see "the technology drive to multi-core" material

#### A little history... programs

First Draft of a Report

on the EDVAC

by

John von Neumann

Contract No. W-670-ORD-4926

Between the United States Army Ordnance Department and the

University of Pennsylvania

Moore School of Electrical Engineering University of Pennsylvania June 30, 1945

Stored program model has been around for a long time...

Program

counter

(index)

**Program memory** 

Data Memory

**Processing Logic** 

# How we process information hasn't changed much since 1930s and 1940s





#### A little history... Zuse's paradigm

- Konrad Zuse (1938) Z3 machine
  - Use binary numbers to encode information
  - Represent binary digits as on/off state of a current switch



5

#### Transistors used to manipulate/store 1s & 0s

Switch-level representation





Cross-sectional view

If we (i) apply a suitable voltage to the gate & (ii) then apply a suitable voltage between source and drain, current will flow.

#### Moore's Law

"Cramming more components onto integrated circuits."

- G.E. Moore, Electronics 1965

- Observation: DRAM transistor density doubles annually
  - Became known as "Moore's Law"
  - · Actually, a bit off:
    - Density doubles every 18 months (now more like 24)
    - (in 1965 they only had 4 data points!)
- Corollaries:
  - Cost per transistor halves annually (18 months)
  - Power per transistor decreases with scaling
  - · Speed increases with scaling
    - Of course, it depends on how small you try to make things
      - » (I.e. no exponential lasts forever)

Remember these!

9



#### Feature sizes...



Figure 2 2005 Definition of Pitches

#### Moore's Law

#### Moore's Law

- Moore's Curve is a self-fulfilling prophecy
  - 2X every 2 years means ~3% per month
     I.e. ((1 X 1.03) \* 1.03)\*1.03... 24 times = ~2
  - Can use 3% per month to judge performance features
  - If feature adds 9 months to schedule...it should add at least 30% to performance
    - (1.03<sup>9</sup> = 1.30 ⇒ 30%)

## A bit on device performance...

- One way to think about switching time:
  - Charge is carried by electrons
  - Time for charge to cross channel = length/speed
- What about power (i.e. heat)?

Thus, to make a device faster, we want to either increase V<sub>ds</sub> or decrease feature sizes (i.e. L)

- <u>Dynamic</u> power is:  $P_{dyn} = C_L V_{dd}^2 f_{0-1}$ 

•  $C_L = (e_{ox}WL)/d$ 

• = L<sup>2</sup>/(mV<sub>ds</sub>)

- e<sub>ox</sub> = dielectric, WL = parallel plate area, d = distance between gate and substrate

13

#### **Summary of relationships**

- (+) If V increases, speed (performance) increases
- (-) If V increases, power (heat) increases
- (+) If L decreases, speed (performance) increases
- (?) If L decreases, power (heat) does what?
  - P could improve because of lower C
  - P could increase because >> # of devices switch
  - P could increase because >> # of devices switch faster!

# Need to carefully consider tradeoffs between speed and heat

#### So, what's happened?



2005 projection was for 5.2 GHz - and we didn't make it in production. Further, we're still stuck at 3+ GHz in production.

#### So, what's happened?



## So, what's happened?

•Speed increases with scaling... •Power decreases with scaling...

#### Why the clock flattening? POWER!!!!



#### So, what's happened?

• What about scaling...



Materials innovations were - and still are - needed

#### (Short term?) solution

High art meets high-tech. st project, titled "CUBE," is a 10' x 10' tra Top 5 Must-Haves





- Processor complexity is good enough
- Transistor sizes can still scale
- Slow processors down to manage power
- Get performance from...

#### **Parallelism**

**Top 5 Must-Haves** 

POWERFUL PROCESSOR

A portrait of performance. "My generative portraits are demanding on the processors in my laptop, as they continuously manipulate video," says Lincoln. Thankfully, the dual-core performance of Intel Centrino processor technology can handle intensive tasks with flying colors.

#### Consider:

- A chip with 1 processor core that takes 1 ns to perform an operation
- A chip with 2 processor cores each of which takes 2 ns to perform an operation



# Oh ... transistors used for memory too

#### "Scaling the memory wall..."

Create a memory hierarchy...



# Oh ... transistors used for memory too



- Why? Put faster memory closer to processing logic...
  - SRAM: avg. improvement: density +25%, speed +20%
  - DRAM: avg. improvement: density +60%, speed +4%
  - DISK: avg. improvement: density +25%, speed +4%

Leads to a "memory wall" - can't get data to processor fast enough

#### The fallout...

- Just a few examples that highlight the importance of • understanding a computer's architecture ... whether you're interested in HW or SW
  - (many more throughout the semester)

#### Example 1: the "memory wall"



#### Example 1: the "memory wall"

Changing how you write your code can – dramatically – improve execution time...

#### · Consider:



#### Example 1: the "memory wall"



#### Assume "standard" matrix multiply code...

 Ideally, (i) 1 row of 10,000 elements AND (ii) a 10,000<sup>2</sup> matrix is in fast memory

```
for(i=0; i<XSIZE; i=i+1) {
    for(j=0; j<YSIZE; j=j+1) {
        r = 0;
        for(k=0; k<XSIZE; k=k+1) {
            r = r + y[i][k] * z[k][j];
        }
        x[i][j] = r;
    }
}</pre>
```

(can apply more efficient storage techniques, but does not eliminate the problem)

#### **Example 1: the memory wall**

- You'll learn why a minor code re-write could lead to 2010% performance improvements this semester
  - (and this this is just for 1000 x 1000 matrices!)
- Foreshadowing...
  - An understanding of the underlying HW is essential

# Example 2: multi-core extensions...









**Practical problems** must be addressed!



29

#### Other issues...

- You may write nice parallel code, but implementation issues must also be considered
  - Example:
    - Assume N cores solve a problem in parallel
    - · At times, N cores need to communicate with one another to continue computation
    - Overhead is NOT parallelizable...and can be significant!



# nanometer process a signal can reach only 5% of the

die's length in a clock cycle" [D. Matzke (Texas Inst Shift from function-centric to communication-centric design

#### Multi-core only as good as its algorithms



#### Other issues...

- · What if you write well-optimized, debugged code for a 4core processor ... which is then replaced with an 8-core processor?
  - Does code need to be re-written, re-optimized?
  - How do you adapt?

# Example #3: technology's changing...



Figure 6: Energy-Delay performance comparision for (A) a CMOS and a HTFET ring-oscillator and (B) a CMOS 32-bit and a TFET 32-bit Adder.

#### You need to know what HW you have!

#### New transistors can affect trends discussed earlier.

 HW that adds 2 numbers together can be more energy efficient with *lower* V for the same delay

But if high performance *really* needed, older devices are better

- Future outlook?
- Heterogeneous cores?
- If task critical, use CMOS, if not, use TFET
- What about things like barrier synchronization?

# Let's discuss some course goals (i.e. what you're going to learn)

33

## Goal #1

- At the end of the semester, you should be able to ...
  - ...describe the fundamental components required in a single core of a modern microprocessor
    - (Also, explain how they interact with each other, with main memory, and with external storage media...)



#### Goal #2: (example: which plane is best?)

#### Which is best?

| 737-800 162 3.060 530 63.5                                |
|-----------------------------------------------------------|
| 737-800 162 3,060 530 63.5                                |
| 747-8I         467         8000         633         257.5 |
| 777-300 368 5995 622 222                                  |
| 787-8 230 8000 630 153                                    |



## **Goal #2**

- At the end of the semester, you should be able to...
  - ...compare and contrast different computer architectures to determine which one performs better...

| Example<br>AMD<br>Smarter Choice | Athlon <sup>•</sup> | Pentium The                      | ntel® Pentium®<br>ne Intel® Pentium<br>veryday computin<br>Learn more | n® dual-core |                | ivers great pe | erformance, low |
|----------------------------------|---------------------|----------------------------------|-----------------------------------------------------------------------|--------------|----------------|----------------|-----------------|
| Processor                        | AMD Athlon™         |                                  |                                                                       |              |                | Front          |                 |
| Model                            | 3200+               | Processor<br>Number <sup>1</sup> | Architecture                                                          | Cache        | Clock<br>Speed | Side           | Dual-core       |
| OPN Tray                         | ADA3200AEP5AR       |                                  |                                                                       |              | $\frown$       | Dus            |                 |
| OPN PIB                          | ADA3200BOX          | E2220                            | 65 nm                                                                 | 1MB L2       | 2.40 GHz       | 800 MHz        | 1               |
| Operating Mode 32 Bit            | Yes                 | E2200                            | 65 nm                                                                 | 1MB L2       | 2.20 GHz       | 800 MHz        | 1               |
| Operating Mode 64 Bit            | Yes                 | E2180                            | 65 nm                                                                 | 1MB L2       | 2.00 GHz       | 800 MHz        | 1               |
| Revision                         | CG                  | E2160                            | 65 nm                                                                 | 1MB L2       | 1.80 GHz       | 800 MHz        | 1               |
| Core Speed (MHz)                 | 2000                | E2140                            | 65 nm                                                                 | 1MB L2       | 1.60 GHz       | 800 MHz        | 1               |
| Voltages                         | 1.50V               |                                  |                                                                       |              | $\bigcirc$     |                | •               |

If you want to do X, which processor is best?

37

#### Goal #3

- At the end of the semester, you should be able to...
  - ... apply knowledge about a processor's datapath, different memory hierarchies, performance metrics, etc. to design a microprocessor that (a) meets a target set of performance goals and (b) is realistically implementable

|                                               | Climate Agency Awards \$350 Million For Supercomputers                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |  |  |  |  |  |
|-----------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| Example                                       | The National Oceanic and Atmospheric Administration will pay CSC and Cray to plan, build, and<br>operate high-performance computers for climate prediction research.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |  |  |  |  |  |
|                                               | By <u>J. Nicholas Hoover</u><br>InformationWeek                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |  |  |  |  |  |
| <6 MHz                                        | MP3 Decode                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |  |  |  |  |  |
| efficient in th<br>65nm LP pro<br>playback to | imized the MP3 decoder for its HiFi DSPs. This MP3 decoder now runs at the lowest power and is the most<br>ie industry, requiring just 5.7 MHz when running at 128 Kbps, 44.1 KHz and dissipating 0.45 mW in TSMC's<br>ocess (including memories). This makes Tensilica's Xtensa HiFi 2 Audio Engine ideal for adding MP3<br>cellular phones, where current carrier requirements are for 100 hours of playback time on a battery charge,<br>ng to 200 hours in the near future.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |  |  |  |  |  |
|                                               | ING TO ZUO TIOUTS IN THE THEAT FULLURE. a comparative seven and the seve |  |  |  |  |  |

development computer, NOAA found it requires one the power of which will be ultimately measured in petaflops, which would make the future

machine one of the world's most powerful supercomputers

#### Goal #3: (motivation)

Which plane would you rather fly to South Bend? ٠



Photo Credits http://www.airliners.net/aircraft-data/stats.main?id=196 ges.businessweek.com/ss/07/06/0615 boe 787/image/787interior.ipg

#### Goal #4

- At the end of the semester, you should be able to...
  - ... understand how code written in a high-level language (e.g. C) is eventually executed on-chip...

#### Example

#### In C:

void insertionSort(int numbers[], int array\_size) int i, j, index; for (i=1; i < array size; i++)</pre> index = numbers[i]; do { while ((j > 0) && (numbers[j-1] > index)) numbers[j] = numbers[j-1];
j = j - 1; numbers[j] = index; }

#### In Java:

public static void insertionSort(int[] list, int length) { int firstOutOfOrder. location, temp:

for(firstOutOfOrder = 1; firstOutOfOrder < length; firstOutOfOrder++) { if(list[firstOutOfOrder] < list[firstOutOfOrder - 1]) { temp = list[firstOutOfOrder]; location = firstOutOfOrder;

#### list[location] = list[location-1]; location--

while (location > 0 && list[location-1] > temp);

list[location] = temp:

Both programs could be run on the same processor... How does this happen?

nich

e

## Goal #5: (motivation, part 1)



For this image, and this perspective, which pixels should be rendered?

- Solution provided by z-buffering algorithm
- Depth of each object in 3D scene used to paint 2D image
- Algorithm steps through list of polygons
  - # of polygons tends to be >> (for more detailed scene)
  - # of pixels/polygon tends to be small

Image source: www.cs.unc.edu/~pmerrell/comp575/Z-Buffering.ppt

# Goal #5: (motivation, part 2)



A simple three-dimensional scene Begin:



#### Often a dynamic data structure (e.g. linked list)

Given: A list of polygons {P1,P2,....,Pn} Output: A COLOR array, which display the intensity of the visible polygon surfaces. Initialize:

> note : z-depth and z-buffer(x,y) is positive..... z-buffer(x,y)=max depth; and COLOR(x,y)=background color.

# for(each polygon P in the polygon list) do{ for(each pixel(x,y) that intersects P) do{ Calculate z-depth of P at (x,y) If (z-depth < z-buffer[x,y]) then{ z-buffer(x,y]=z-depth; COLOR(x,y)=Intensity of P at(x,y); } } } display COLOR array.</pre>

Z-buffer representation

## Goal #5 Motivation (part 3)



# Goal #5

- At the end of the semester, you should be able to...
  - ...use knowledge about a microprocessors underlying hardware (or "architecture") to write more efficient software...



# Goal #6

- At the end of the semester, you should be able to...
  - ...explain and articulate why modern microprocessors now have more than 1 core and how SW must adapt to accommodate the now prevalent multi-core approach to computing
- Why?
  - For 8, 16 core chips to be practical, we have to be able to use them



