# **Speculative Multithreaded Processors**

#### Guri Sohi and Amir Roth

Computer Sciences Department University of Wisconsin-Madison

### **Outline**

- Trends and their implications
- Workloads for future processors
- Program parallelization and speculative threads
  - speculative control-driven threads
  - speculative data-driven threads
- Sample applications and research issues
- Summary



### **Driving Factors**

- •Ma tch upcoming technology trends
- •Ma tch upcoming software trends
- •Ma tch upcoming technology constraints
- •Ma tch upcoming design constraints
- •Lear n, and exploit, new program behaviors



Speculative Multithreaded Processors HiPC 2000, December 18-20, 2000 Slide

# Hardware/Design Trends

- Incr easing wire delays
- Incr easing memory latencies
- •Deeper pipelines
- •Design comple xity
- •V erification complexity
- •P ower issues



### **Implications of Trends**

- •Distr ibuted microarchitectures
- •Cluster ed superscalar, with multithreading
- •Chip m ultiprocessor

Question: what to run on underlying microarchitecture?



Speculative Multithreaded Processors HiPC 2000, December 18-20, 2000 Slide

# Work for Distributed/Multithreaded Processor

- Independent programs
  - increase overall processing throughput
  - works well in server environment
- Independent thr eads of multithreaded application
  - increase overall throughput
  - compatible with software trends?

#### •Rela ted threads

- e.g., for reliability
- •But what about speeding up single program execution?
  - $^{\circ}$  single program speed will continue to be important
  - how to "parallelize" or "multithread" single program?



### **Program Parallelization**

•What does it mean to parallelize?

how to divide program into multiple portions

- •What constrains parallelization?
  - dependences (especially ambiguous)
- •Ho w to overcome constraints?
  - use speculation



Speculative Multithreaded Processors HiPC 2000, December 18-20, 2000

# **Program Parallelization -- Theme I**

### •T raditional view: control-driven threads

- divide work into multiple groups of instructions
  - conservative assumptions about dependences constrain parallelization
- each group is specified using traditional control-driven (von Neumann) semantics

#### •A ne wer view: multiscalar

- use dependence speculation to overcome constraints
- commercial example: Sun MAJC



Slide

# **Multiscalar: Speculative Control-Driven Threads**





HiPC 2000, December 18-20, 2000

# **Program Parallelization -- Theme II**

#### Traditional view: data-driven threads

- divide work into (dependent) computations
- each computation is represented in a data-driven manner

#### A ne wer view: speculative data-driven threads

- use speculation to facilitate thread creation
- create threads only for important events



### Motivation for Data-driven Threads

- •Pr ogram execution: processing of low-latency instructions, with pauses for high-latency events
- •P arallelizing low-latency instructions isn't crucial
- •Ov erlapping high-latency events is what matters!
- "Thr eads" should create high-latency events early



Speculative Multithreaded Processors HiPC 2000, December 18-20, 2000 Slide 11

# **Speculative Data-Driven Threads**

- •Use dependence r elationships to isolate thread(s) of code from main program thread
  - use speculation to facilitate creation
- •Ex ecute threads (speculatively) in parallel with "main program"
  - "assist" main thread via side-effects
  - o don't impact architectural correctness



# Application: Cache Misses and Branch Mispredicts

|                       |        | Memory             |        | Control |                  |       |  |
|-----------------------|--------|--------------------|--------|---------|------------------|-------|--|
| Spec2000<br>Benchmark | # inst | % dyn.<br>memops   | % miss | # inst  | % dyn.<br>branch | %misp |  |
| bzip2                 | 24     | 3                  | 63     | 62      | 26               | 77    |  |
| crafty                | 35     | 2                  | 54     | 51      | 7                | 30    |  |
| eon                   |        | Insufficient misse | es     | 24      | 13               | 71    |  |
| gap                   | 66     | 1                  | 28     | 123     | 10               | 65    |  |
| gcc                   | 122    | 4                  | 5      | 122     | 6                | 34    |  |
| gzip                  | 15     | 21                 | 75     | 46      | 14               | 55    |  |
| mcf                   | 42     | 35                 | 69     | 32      | 20               | 71    |  |
| parser                | 70     | 4                  | 42     | 80      | 10               | 38    |  |
| perl                  | 74     | 1                  | 26     | 43      | 7                | 61    |  |
| twolf                 | 116    | 7                  | 60     | 87      | 39               | 73    |  |
| vortex                | 71     | 1                  | 22     | 83      | 1                | 41    |  |
| vpr (route)           | 55     | 13                 | 67     | 72      | 16               | 75    |  |

WISCONSIN M A D I S O N Speculative Multithreaded Processors HiPC 2000, December 18-20, 2000 Slide 13



#### Perfecting a small set of instructions provides significant performance much of that of a perfect branch predictor and data cache





# Sample Performance Results

#### VPR (ROUTE)

- •200M instruction sample (starting at 14.1B on 20B run)
- •100M instruction warm-up for caches/predictors

#### 32% SPEEDUP: 16% FROM PRE-FETCHING, 16% FROM BRANCHES

|          | Cache M   | isses (p | rimary L1) | Branch Mispredictions |      |            |
|----------|-----------|----------|------------|-----------------------|------|------------|
|          | number    | rate     | /1000 inst | number                | rate | /1000 inst |
| base     | 2,850,000 | 3.3%     | 14.3       | 1,400,000             | 7.3% | 7.0        |
| w/slices | 1,340,000 | 1.6%     | 6.7        | 420,000               | 2.2% | 2.1        |



### **Sample Applications**

- •Cache pr efetching/management
- •Computing branch outcomes
- •TLB pr efetching/management
- •I/O pr efetching
- Multipr ocessor communication management
- •Other a pplications where high-latency events need to be "created" early



Speculative Multithreaded Processors HiPC 2000, December 18-20, 2000 Slide

### Some Research Issues

- •Ho w to divide control-driven program into datadriven threads?
- •When to divide pr ogram?
- •Ho w to represent data-driven threads?
- Manag ing mixed thread workloads



#### Summary

- •Har dware and design trends will lead to distributed/multithreaded processors
- •Man y options for running different thread types on underlying microarchitecture
- •Ov ercome constraints to "parallelizdion" techniques with speculation
  - speculative control-driven threads
  - speculative data-driven threads
- •Most of the r esearch still needs to be done



Speculative Multithreaded Processors HiPC 2000, December 18-20, 2000 Slide 19