

# Multiprocessor Kernel Performance Profiling

#### Alex Mirgorodskii

mirg@cs.wisc.edu

Computer Sciences Department
University of Wisconsin
1210 W. Dayton Street
Madison, WI 53706-1685
USA

Kperfmon-MP March 12, 2001



## Kperfmon: Overview

- Specify a resource
  - Almost any function or basic block in the kernel
- Apply a *metric* to the resource:
  - Number of entries to a function or basic block
  - Wall clock time, CPU time (virtual time)
  - All Sparc Hardware Counters: cache misses,
     branch mispredictions, instructions per cycle, ...
- Visualize the metric data in real time



## Kperfmon-MP: Goals

#### Modify uniprocessor Kperfmon to provide:

- Safe operation on SMP machines
  - Thread safety
  - Migration safety
- New feature: Per-CPU performance data
  - More detailed performance data
  - Reduce cache coherence traffic caused by the tool



# Kperfmon: Technology

- Use the *KernInst* framework to:
  - Insert measurement code in the kernel at run time
  - Sample accumulated metric values from the user space periodically
- No need for kernel recompilation
  - Works with stock SPARC Solaris 7 kernels
  - Supports both 32-bit and 64-bit kernels
- No need for rebooting
  - Important for 24 x 7 systems



## Kperfmon System





## Kperfmon instrumentation

- Counter primitive
  - Number of entries to a function or a basic block
- Wall clock timer primitive
  - Real time spent in a function
- CPU timer primitive
  - Excludes time while the thread was switched-out
  - Can count more than just timer ticks
    - All HW-counter metrics use this mechanism



#### Non-MP Counter primitive

Code Patch Area



Data Area

cnt

- Atomic, thread-safe update
- Lightweight
- No register save/restore required

# Non-MP Wall clock timer primitive



- Inclusive (includes time in callees)
- Keeps accumulating if switched-out





- Exclude the time spent while switched out
  - Instrument context switch routines
- HW counter metrics are based on this mechanism





- Exclude the time spent while switched out
  - Instrument context switch routines
- HW counter metrics are based on this mechanism





- Exclude the time spent while switched out
  - Instrument context switch routines
- HW counter metrics are based on this mechanism





- Exclude the time spent while switched out
  - Instrument context switch routines
- HW counter metrics are based on this mechanism





- Exclude the time spent while switched out
  - Instrument context switch routines
- HW counter metrics are based on this mechanism





- Exclude the time spent while switched out
  - Instrument context switch routines
- HW counter metrics are based on this mechanism





- Exclude the time spent while switched out
  - Instrument context switch routines
- HW counter metrics are based on this mechanism





- Exclude the time spent while switched out
  - Instrument context switch routines
- HW counter metrics are based on this mechanism



## Kperfmon-MP: Goals

#### Modify uniprocessor Kperfmon to provide:

- Safe operation on SMP machines
  - Thread safety
  - Migration safety
- New feature: Per-CPU performance data
  - More detailed performance data
  - Reduce cache coherence traffic caused by the tool



#### Thread Safety



Non-MP timer allocation routine

Id [head], R1 add R1, 4, R1 st R1, [head]

- Used on switch-out to save the paused timers
- Context switch is serial on uniprocessors
  - No thread safety problems there
- Context switches may be concurrent on SMPs!
  - Multiple threads are being scheduled simultaneously
  - The allocation code is no longer safe



#### Thread Safety



#### MP timer allocation routine

```
alloc:

Id [head], R1

add R1, 4, R2

cas [head], R1, R2

cmp R1, R2

bne alloc
```

- Context switches may be concurrent on SMPs
- Use the atomic cas instruction to ensure safety



#### Per-CPU performance data



Code Patch Area

rd cpu#, r0
ldx cnt[r0], r1
add r1, 1, r2
casx r2, cnt[r0]



- Instrumentation code is shared by all CPUs
- Per-CPU copies of the primitive's data
  - Two copies are never placed in the same cache line





- Wall timer started on CPU0, stopped on CPU1
- Counters and CPU timers are not affected





- Wall timer started on CPU0, stopped on CPU1
- Counters and CPU timers are not affected





- Wall timer started on CPU0, stopped on CPU1
- Counters and CPU timers are not affected





- Wall timer started on CPU0, stopped on CPU1
- Counters and CPU timers are not affected





- Wall timer started on CPU0, stopped on CPU1
- Counters and CPU timers are not affected





- Wall timer started on CPU0, stopped on CPU1
- Counters and CPU timers are not affected





































#### Conclusion

- Techniques for correct MP profiling:
  - Atomic memory updates to ensure thread safety
  - Virtualized timers to handle thread migration
- Per-CPU data collection is important
  - Provides detailed performance information
  - Introduces fewer coherence cache misses



#### Future Work

- New metrics
  - Locality of CPU assignments
  - Per-thread performance data
- Formal verification of instrumentation code for migration/preemption problems
- Ports to other architectures and OS'es



# The Big Picture



http://www.cs.wisc.edu/paradyn



## The Big Picture



- Demo: Wednesday, Room 6372
- Available for download on request
  - mailto: mirg@cs.wisc.edu
  - Public release in April