### Lecture 04 Interconnect Overhead

Specific topics include a short review of logic scaling, the impact of technology scaling on interconnect, how interconnect scaling impacts the current solution to problems associated with logic scaling (multi-core architectures), and information processing "tokens"

### **Background Slides**



University of Notre Dame

University of Notre Dame



• Shift from function-centric to communication-centric design

**University of Notre Dame** 

CSE 30321 - Lecture 04 - Interconnect Overhead

# **NW topologies**



Figure 9: Placement of Routers used to Estimate Area (Lower Left Quadrant)

From Balfour, Dally, Supercomputing

## **Dally Paper Slides**

University of Notre Dame

CSE 30321 - Lecture 04 - Interconnect Overhead

## **Preferred NW configurations**

#### Table 3: Preferred Network Configurations

|         | H              | $t_{ m r}$ | $B_{ m C}$ | w   | $B_{ m B}$ | $T_{ m c}$ | $T_s$    | $T_0$ |
|---------|----------------|------------|------------|-----|------------|------------|----------|-------|
| Mesh    | $6\frac{1}{4}$ | <b>2</b>   | 16         | 192 | 3,072      | 5.3        | 3        | 17.8  |
| MeshX2  | $6\frac{1}{4}$ | <b>2</b>   | 32         | 192 | $6,\!144$  | 5.3        | 3        | 17.8  |
| Torus   | $5^{-}$        | <b>2</b>   | 32         | 288 | 9,216      | 4.0        | <b>2</b> | 14.0  |
| CMesh   | $3\frac{1}{8}$ | 3          | 16         | 288 | $4,\!608$  | 2.1        | <b>2</b> | 11.5  |
| CMeshX2 | $3\frac{1}{8}$ | 3          | 32         | 288 | 9,216      | 2.1        | <b>2</b> | 11.5  |
| FTree   | $4\frac{3}{8}$ | <b>2</b>   | 64         | 144 | 9,216      | 4.4        | 4        | 13.1  |
| FClos   | $4\frac{3}{8}$ | <b>2</b>   | 32         | 144 | 4,608      | 3.5        | 4        | 12.2  |



(a) Completion Time by Pattern

**University of Notre Dame** 





(c) Network Power Dissipation



CSE 30321 - Lecture 04 – Interconnect Overhead



CSE 30321 - Lecture 04 – Interconnect Overhead

From Balfour, Dally, Supercomputing





Figure 11: Workload Packet Latency Distribution for Uniform Random Traffic Pattern



#### Figure 12: Offered Latency for CMeshX2 Network

Contention for access to shared resources

- i.e. multiple accesses to limited # of memory banks may dominate system scalability
- **Programming languages, environments, & methods:** 
  - Need simple semantics that can expose computational properties to be exploited by large-scale architectures

#### 🛨 Algorithms

Core0/2

Core4/2

- What if you write good code for a 4-core chip, and then get an 8-core chip?
- **Cache coherency** 
  - P1 writes, P2 can read
    - · Protocols can enable \$ coherency but add overhead

# Impediments to Parallel Performance

CSE 30321 - Lecture 04 – Interconnect Overhead

#### Latency

- Is already a major source of performance degradation
- Architecture charged with hiding local latency
  - (that's why we talked about registers & caches)
- Hiding global latency is also task of programmer
  - (I.e. manual resource allocation)
- Today:
  - access to DRAM in 100s of CCs
  - round trip remote access in 1000s of CCs
  - multiple clock cycles to cross chip or to communicate from core-to-core
    - Not "free"

Overhead where no actual processing is done.

#### **University of Notre Dame**

CSE 30321 - Lecture 04 – Interconnect Overhead

- All **†**'ed items also affect Fraction<sub>parallelizable</sub>
  - (and hence speedup)



**University of Notre Dame** 



Likely to see HW support for parallel processor configurations: Coherency

**On-chip IC NWs** 

...takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. As performance scales the processor dissipates between 25 W and 125 W. ... 567 mm<sup>2</sup> processor on 45 nm CMOS integrates 48 IA-32 cores and 4 DDR3 channels in a 2D-mesh network. Cores communicate through message passing using 384 KB of on-die shared memory. Fine-grain power management

19

# Impediments to Parallel Performance

21

# Multi-core only as good as algorithms that use it

Speedup vs. (# of Cores, % Parallel)



University of Notre Dame