# **Rescheduling?**

## · Could we meet 2 days a week instead of 3?



# Lecture 01 Introduction to CSE 40547 / 60547

University of Notre Dame

3

**George Boole** 

**Konrad Zuse** 

CSE 30321 - Lecture 01 - Introduction to CSE 40547

# A history of computing

- Today:
  - I'd like to start by explaining "how we got to where we are" when we look at modern information processing systems
- My Focus:
  - Devices
  - How devices are organized into an architecture
  - How a system-level architecture might address an <u>application</u> level task

With emerging technologies with sub-100 nm feature sizes, it's important to consider devices, architectures, and applications simultaneously – which we will do too!

(Things didn't use to be this way)

**University of Notre Dame** 

## University of Notre Dame, Department of Computer Science & Engineering

CSE 30321 - Lecture 01 - Introduction to CSE 40547

# Four People

#### Claude Shannon



Proves that circuits based on electromechanical relays could be used to solve Boolean algebra problems

#### John Atanasoff



Suggests correct mode of computation is to use electronic binary digits

### information Binary numbers can be represented with on/off state of current switch

numbers to encode

Proposes a

for algebraic

**Use binary** 

Α

complete system

logical operations

AnB

В

Math via Boolean logic

# **Binary Math**

- How do computers add numbers? ٠
  - (binary arithmetic ... e.g. all 1s and 0s)
  - What number (in decimal) is 110010 in binary?
    - $1x2^5 + 1x2^4 + 0x2^3 + 0x2^2 + 1x2^1 + 0x2^0$
    - $\cdot 32 + 16 + 0 + 0 + 2 + 0 = 50$
  - What is 110010 + 000011 ?\_\_\_\_\_ 110010 +000011110101
    - $1x2^5 + 1x2^4 + 0x2^3 + 1x2^2 + 0x2^1 + 1x2^0$

$$\cdot \ 32 + 16 + 0 + 4 + 0 + 1 = 53$$

- (multiplication works just like decimal multiplication)
  - $0 = 0 \times 0$ 0x1 = 01x0 = 01x1 = 1- e.g.

**University of Notre Dame** 

CSE 30321 - Lecture 01 - Introduction to CSE 40547

Can perform AND, OR ops with switches



With AND, OR, NOT, can implement any function.



From Computer Desktop Encyclopedia

OUT

0

1

1

**University of Notre Dame** CSE 30321 - Lecture 01 - Introduction to CSE 40547

# Switches now transistors on IC



Switch-level representation AND OR AND Α В OUT Α 0 Α n 0

0

Output

В



0

1

1

1

#### CSE 30321 - Lecture 01 - Introduction to CSE 40547

# Historically, this idea seems to have worked out rather well...

- Long since predominant mode of information processing
  - Represent binary digits as on/off state of a current switch



**University of Notre Dame** 

CSE 30321 - Lecture 01 - Introduction to CSE 40547

Expo



University of Notre Dame



# **Acknowledgements**

- These slides contain material developed and copyright by:
  - Arvind (MIT)
  - Krste Asanovic (MIT/UCB)
  - Joel Emer (Intel/MIT)
  - James Hoe (CMU)
  - John Kubiatowicz (UCB)
  - David Patterson (UCB)
- MIT material derived from course 6.823
- UCB material derived from course CS252

Let's start with relay and vacuum tube machines ... ideas that evolved from this

work still persist and influence work

today



11

# **Linear Equation Solver**





## 1930's:

- Atanasoff built the Linear Equation Solver.
- It had 300 tubes!
- Special-purpose binary digital calculator
- Dynamic RAM (stored values on refreshed capacitors)

## Application:

Linear and Integral differential equations

## Background:

 Vannevar Bush's Differential Analyzer --- an analog computer

## Technology:

Tubes and Electromechanical relays

Atanasoff decided that the correct mode of computation was using electronic binary digits.

1/19/2010

CS152, Spring 2010



13

# **Electronic Numerical Integrator** and Computer (ENIAC)

- Inspired by Atanasoff and Berry, Eckert and Mauchly designed and built ENIAC (1943-45) at the University of Pennsylvania
- The first, completely electronic, operational, general-purpose analytical calculator!
  - 30 tons, 72 square meters, 200KW
- Performance
  - Read in 120 cards per minute
  - Addition took 200 µs, Division 6 ms
  - 1000 times faster than Mark I

Application: Ballistic calculations

angle = f (location, tail wind, cross wind,

propellant charge, ... )

Not verv reliable!

WW-2 Effort



# **Electronic Discrete Variable** Automatic Computer (EDVAC)

- ENIAC's programming system was external
  - Sequences of instructions were executed independently of the results of the calculation
  - Human intervention required to take instruction
- Eckert, Mauchly, John von Neumann and designed EDVAC (1944) to solve this pro
  - Solution was the stored program computer
    - $\Rightarrow$  "program can be manipulated as



von Neumann's stored program model warrants a slightly longer discussion



air density, temperature, weight of shell,







# The IBM 650 (1953-4)



# This idea has staying power!

How we process information hasn't changed much since 1930s and 1940s

# Programmer's view of the IBM 650



A drum machine with 44 instructions

Instruction: 60 1234 1009

 "Load the contents of location 1234 into the distribution; put it also into the upper accumulator; set lower accumulator to zero; and then go to location 1009 for the next instruction."

Good programmers optimized the placement of instructions on the drum to reduce latency!

 1/19/2010
 CS152, Spring 2010
 21

 CSE 30321 - Lecture 01 - Introduction to CSE 40547
 23

# Stored program model persists with device miniaturization

# Computers in mid 50's

- Hardware was expensive
- Stores were small (1000 words)
   ⇒ No resident system software!
- Memory access time was 10 to 50 times slower than the processor cycle
  - ⇒ Instruction execution time was totally dominated by the *memory reference time*.
- The ability to design complex control circuits to execute an instruction was the central design concern as opposed to the speed of decoding or an ALU operation
- Programmer's view of the machine was inseparable from the actual hardware implementation
  - 1/19/2010
- CS152, Spring 2010



1964

# IBM 360: A General-Purpose Register (GPR) Machine



- Processor State
  - 16 General-Purpose 32-bit Registers
     » may be used as index and base register
    - » Register 0 has some special properties
  - 4 Floating Point 64-bit Registers
  - A Program Status Word (PSW)
     » PC, Condition codes, Control flags
- · A 32-bit machine with 24-bit addresses
  - But no instruction contains a 24-bit address!
- Data Formats
  - 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double-words

The IBM 360 is why bytes are 8-bits long today!



## **IBM 360: Initial Implementations**

|               | Model 30        | Model 70              |
|---------------|-----------------|-----------------------|
| Storage       | 8K - 64 KB      | 256K - 512 KB         |
| Datapath      | 8-bit           | 64-bit                |
| Circuit Delay | 30 nsec/level   | 5 nsec/level          |
| Local Store   | Main Store      | Transistor Registers  |
| Control Store | Read only 1µsec | Conventional circuits |

*IBM 360 instruction set architecture (ISA) completely hid the underlying technological differences between various models.* 

*Milestone: The first true ISA designed as portable hardware-software interface!* 

## With minor modifications it still survives today!

# One caveat to stored program model: Field Programmable Gate Arrays



29



### Remember this: We'll revisit this idea later in the semester!

| University of Notre Dame                           |   |
|----------------------------------------------------|---|
|                                                    |   |
| CSE 30321 - Lecture 01 - Introduction to CSE 40547 | 3 |
|                                                    |   |

# Stored program model – *in face of transistor scaling on integrated circuits* – not without challenges



CMOS IC

Solid-state transistors

Result: exponential transistor density increase...

32

30

**University of Notre Dame** 

CSE 30321 - Lecture 01 - Introduction to CSE 40547

# Processor-DRAM Memory Gap (latency)

# Challenge #1: Memory is still (relatively) slow!

# Remember Computers in mid 50's

- Memory access time was 10 to 50 times slower than the processor cycle
  - ⇒ Instruction execution time was totally dominated by the memory reference time.



## **Solution: Memory Hierarchies** (The principle of locality...)

- ...says that most programs don't access all code or data • uniformly
  - i.e. in a loop, small subset of instructions might be executed over and over again...
  - ...& a block of memory addresses might be accessed sequentially...
- · This has lead to "memory hierarchies"
- Some important things to note: •
  - Fast memory is expensive
  - Levels of memory usually smaller/faster than previous

**University of Notre Dame** 

CSE 30321 - Lecture 01 - Introduction to CSE 40547

**Question:** 

- Levels of memory usually "subset" one another
  - · All the stuff in a higher level is in some level below it

# An example memory hierarchy



**University of Notre Dame** 

36

# **DRAM vs. SRAM: Different Technology Processes**

CSE 30321 - Lecture 01 - Introduction to CSE 40547



35





Challenge #2:

Chips have gotten hot!

#### 38

# Moore's Law

"Cramming more components onto integrated circuits."

- G.E. Moore, Electronics 1965

- Observation: DRAM transistor density doubles annually
  - · Became known as "Moore's Law"
  - Actually, a bit off:
    - Density doubles every 18 months (now more like 24)
    - (in 1965 they only had 4 data points!)

#### - Corollaries:

- Cost per transistor halves annually (18 months)
- Power per transistor decreases with scaling
- · Speed increases with scaling
  - Of course, it depends on how small you try to make things
    - » (I.e. no exponential lasts forever)



37

# **Summary of relationships**

- (+) If V increases, speed (performance) increases
- · (-) If V increases, power (heat) increases
- (+) If L decreases, speed (performance) increases
- (?) If L decreases, power (heat) does what?
  - P could improve because of lower C
  - P could increase because >> # of devices switch
  - P could increase because >> # of devices switch faster!

# Need to carefully consider tradeoffs between speed and heat

University of Notre Dame CSE 30321 - Lecture 01 - Introduction to CSE 40547

## A funny thing happened on the way to 45 nm

CSE 30321 - Lecture 01 - Introduction to CSE 40547



2005 projection was for 5.2 GHz - and we didn't make it in production. Further, we're still stuck at 3+ GHz in production.

University of Notre Dame

CSE 30321 - Lecture 01 - Introduction to CSE 40547

#### 44

# A funny thing happened on the way to 45 nm A funny thing happened on the way to 45 nm

43

- •Speed increases with scaling...
- •Power decreases with scaling...

## Why the clock flattening? POWER!!!!



•Power decreases with scaling... Technology 350 250 180 130 90 65 500 45 22 Node (nm) 100 **Iormalized Power** Dynamic powe 0.01 0.0001 Static Power (leakage) 0.0000001 2010 2015 1990 1995 2000 2005 2020

## A funny thing happened on the way to 45 nm

#### What about scaling...



### Materials innovations were - and still are - needed

**University of Notre Dame** 

#### CSE 30321 - Lecture 01 - Introduction to CSE 30321

# Another solution: parallelism

High art meets high-tech. Lincoln's latest project, titled "CUBE," is a 10' x 10' translucent structure outfitte

- with video cameras, uniquely combining sculpture, portraiture and archi With Intel® Centrine® processor technology inside, a notebook become other things as well portable studio, canvas, inspiration tool.
- Top 5 Must-Haves
- POWERFUL PROCESSOR
- [POWERFUL PROLESSUE A partrait of performance. Wy generative portraits are demanding on t processors in my laptog, as they continuously manipulate video, "says Li Thankfully, the dual-cores performance of Intel Centrino processor tech can handle intensive tasks with flying colors.

DIZZYING TRANSFER SPEEDS Art (at 30 frames per second). Data transferring up to 20% fast

s Lincoln to store footage from 24 video o

HIGH-SPEED WIRELESS Always Connected. With up to twice the range and Bx the speed connected to a Wireless N home network? Lincoln can download

```
or shop for art books anywhere, anytime
ENHANCED VIDEO
```

High-def (redefined). Lincoln can vier like" clarity, thanks to stanning multi high-def video,experience.

# The power of art. Lincoln's







 Processor complexity is good enough Transistor sizes can still scale

- Slow processors down to manage power
- Get performance from...

## **Parallelism**

47

**Top 5 Must-Haves** 

#### POWERFUL PROCESSOR

A portrait of performance. "My generative portraits are demanding on the processors in my laptop, as they continuously manipulate video," says Lincoln. Thankfully, the dual-core performance of Intel Centrino processor technology can handle intensive tasks with flying colors.

(i.e. 1 processor, 1 ns clock cycle VS. 2 processors, 2 ns clock cycle)



#### Small and Efficient

As microprocessor transistors become smaller, stopping undesired them decreases this leakage but current leakage becomes more difficult. This leakage leads to shortened battery life. Intel's coming chips use a new insulation material to prevent this, reducing power consumption. SILICON GATE

Current transistors use extremely thin silicon dioxide insulators, which lead to current leakage. Thickening reduces the electric charge passing through, impeding performance.

New transistors use a hafniumbased insulator and a metal gate electrode. Hafnium provides stronger electrical coupling, so the insulator can be made thicker to reduce leakage without degrading the performance of the transistor.



#### University of Notre Dame, Department of Computer Science & Engineering

CSE 30321 - Lecture 01 - Introduction to CSE 40547

# Even solutions have limitations

More caching?

(What about "state bloat"?)

# Impact of modern processing principles (Lots of "state")

- User:
  - state used for application execution
- Supervisor:
  - state used to manage user state
- Machine:
  - state that configures the system
- Transient:
  - state used during instruction execution
- Access-Enhancing:
  - state used to simplify translation of other state names
- Latency-Enhancing:
  - state used to reduce latency to other state values

University of Notre Dame, Department of Computer Science & Engineering

CSE 30321 - Lecture 01 - Introduction to CSE 40547

## Impact of modern processing principles (Why so much latency enhancing state?)



#### Lots of "state" – but how much is *directly* associated with a computation?

What if you want to add 2, 32-bit numbers together?

#### University of Notre Dame

CSE 30321 - Lecture 01 - Introduction to CSE 30321

## Impact of modern processing principles (Total State vs. Time)



University of Notre Dame, Department of Computer Science & Engineering

#### 54

# This idea has been extended...

Quad core chips...

53

55

7, 8, and 9 core chips...





Practical problems must be addressed!

Advances in parallel programming are necessary!

stop?



University of Notre Dame, Department of Computer Science & Engineering CSE 30321 - Lecture 01 - Introduction to CSE 30321

University of Notre Dame

More cores?

CSE 30321 - Lecture 01 - Introduction to CSE 30321

# Impediments to Parallel Performance

### Contention for access to shared resources

- i.e. multiple accesses to limited # of memory banks may dominate system scalability
- **Programming languages, environments, & methods:** 
  - Need simple semantics that can expose computational properties to be exploited by large-scale architectures

## **Algorithms**

- What if you write good code for a 4-core chip, and then get an 8-core chip?

## Cache coherency

- P1 writes, P2 can read
  - Protocols can enable \$ coherency but add overhead

# Impediments to Parallel Performance • Latency

- Is already a major source of performance degradation
- Architecture charged with hiding local latency
  - (that's why we talked about registers & caches)
- Hiding global latency is also task of programmer
  - · (I.e. manual resource allocation)
- Today:
  - access to DRAM in 100s of CCs
  - round trip remote access in 1000s of CCs
  - multiple clock cycles to cross chip or to communicate from core-to-core
    - Not "free"

Overhead where no actual processing is done.

University of Notre Dame, Department of Computer Science & Engineering

Overhead where no actual processing is done.

# **Pentium III Die Photo**

Deterministic connections as needed.



<sup>1</sup>st Pentium III, Katmai: 9.5 M transistors, 12.3 \* 10.4 mm in 0.25-mi. with 5 layers of aluminum

- · EBL/BBL Bus logic, Front, Back
- MOB Memory Order Buffer
- Packed FPU MMX FI. Pt. (SSE)
- IEU Integer Execution Unit
- FAU FI. Pt. Arithmetic Unit
- MIU Memory Interface Unit
- DCU Data Cache Unit
- PMH Page Miss Handler
- DTLB Data TLB
- BAC Branch Address Calculator
- RAT Register Alias Table
- SIMD Packed FI. Pt.
- RS Reservation Station
- BTB Branch Target Buffer
- IFU Instruction Fetch Unit (+I\$)
- ID Instruction Decode
- ROB Reorder Buffer
- MS Micro-instruction Sequencer

University of Notre Dame, Department of Computer Science State Computer Architecture, TAMU, Prot. Lawrence

CSE 30321 - Lecture 01 - Introduction to CSE 30321

# Some Perspective...



Shift from function-centric to communication-centric design





http://dx.doi.org/10.1109/ASSCC.2009.5357230

:30 http://dx.doi.org/10.1109/ISSCC.2010.5434077 mana

University of Notre Dame, Department of Computer Science & Engineering

# **Impediments to Parallel Performance**

- All **†**'ed items also affect Fraction<sub>parallelizable</sub>
  - (and hence speedup)



# Multi-core only as good as algorithms that use it



#### University of Notre Dame, Department of Computer Science & Engineering

CSE 30321 - Lecture 01 - Introduction to CSE 40547

# Summary

- Now:
  - Devices get smaller, but also run slower!
  - Performance comes from parallelism
  - But to parallelize, need new algorithms, software support
  - Also must overcome non-parallelizable overheads that degrade performance

- Low hanging fruit very much gone
  - New logic (and memory!) devices that (a) don't have same inherent problems as switch-based logic and/or (b) enable new system architectures are sought...

## Summary

- For many years, could double performance just by making device smaller
  - Clock rates increased with manageable power impact



#### University of Notre Dame

#### CSE 30321 - Lecture 01 - Introduction to CSE 40547

# Motivating example 1

- Brain-inspired computation:
  - New, memristive devices may enable neuromorphic computer architectures...



63

<sup>- ...</sup> 

## Most complex information-management system in the universe...



Brain

1.4 kg

1350 cm<sup>3</sup>

10<sup>8</sup> MIPS

10<sup>19</sup>bit/s

10<sup>8</sup> MIPS

=20.000 000 W

65

67

|                                    | De     | II 8250 (Pentium® 4)                     |
|------------------------------------|--------|------------------------------------------|
| ALE HAN                            | Mass   | ~25 kg                                   |
| ATH A PARTY                        | Volume | 34200 cm <sup>3</sup>                    |
|                                    | MIPS   | ~10 <sup>3</sup> MIPS                    |
| Cerebral Cortex<br>Corpus collosum | BIT    | <10 <sup>16</sup> bit/s                  |
| Corebellum                         | Power  | 200 W                                    |
| Spinal cord                        |        | ~ 5 MIPS/W                               |
|                                    |        | 5x10 <sup>6</sup> k <sub>B</sub> T / bit |

| Corpe colours<br>Coretellum<br>Spiral cord | Power | 200 W<br>~ 5 MIPS/W                      | 30 W (max)<br>3x10 <sup>6</sup> MIPS/W |
|--------------------------------------------|-------|------------------------------------------|----------------------------------------|
|                                            |       | 5x10 <sup>6</sup> k <sub>B</sub> T / bit | t 700 k <sub>B</sub> T/bit             |
|                                            | When  | will computer ha                         | rdware match the                       |
| A CMOS machine                             |       | human br                                 | ain?                                   |
| at the limits of                           |       |                                          | A STATE                                |
| scaling would use                          | Del   | I 8250                                   |                                        |
| prodigious                                 |       |                                          | Cavebral Curses                        |

x 10<sup>5</sup>

x 10<sup>5</sup>

amounts of power 10<sup>3</sup> MIPS

prodigious

CSE 30321 - Lecture 01 - Introduction to CSE 40547

# **Example 3: Universal memories**

entium®4



# Example 2: Back to relays???



Possible advantages: no leakage, programmable, ....

**University of Notre Dame** 

## Where We're At

- 1. CMOS is a hard act to follow! None of the proposed post-CMOS switch candidates appear to be "drop-in" replacements.
- 2. Electron/charge state variables so far superior to alternatives. Not necessarily the best, if better architectures are developed.

## Where We're Headed

```
Analog for
example...
```

| Device Proposals |  |       |                                |      |      |
|------------------|--|-------|--------------------------------|------|------|
|                  |  | New S | Switch Research                |      |      |
| X Workshop       |  | New   | New Switch-Industry Deployment |      |      |
| 2009             |  | 2010  |                                | 2015 | 2020 |

- MIND Workshop on Architectures for New Devices 08/2009

1 December, 2008

- Monthly Center Chief Operating Officer coordination
- NRI / MIND Read-outs for member companies

Now, onto the syllabus...