Using Dynamic Binary Instrumentation to Create Faster, Validated, Multi-Core Simulations

A PhD Thesis by

Vincent M. Weaver

May 2010

Abstract

The Memory Wall continues to be a problem with modern systems design. While the steady increase in processor speeds has abated somewhat, Moore's Law continues to provide more transistors to chip designers. This leads to an increase in the number of processors and threads located per chip, which increases the demands on memory systems. Current simulation technology is not able to keep up, leading to sacrifices in methodology and accuracy in order to get results in reasonable time.

Because cycle-accurate simulators are so slow, various methods for reducing execution time can be used. Unfortunately these methods can introduce variations in results of between 10-50\% when compared to full reference input sets. Limitations of academic simulators also constrain the architectures under study, with results generated for obsolete or uninteresting systems.

We analyze the performance and accuracy of various limited-execution methodologies. We investigate how deterministic execution affects the measurement of error. We then evaluate using Dynamic Binary Instrumentation (DBI) as an alternative to cycle-accurate simulation. We compare our results to actual systems using hardware performance counters. We look first at a simple 32-bit RISC system, and then look at more complex 64-bit x86 based systems. Finally we investigate the feasibility of using the same methodology for modern multi-processors simulations.

Biographical Sketch
Dedication (in Tengwar)
Acknowledgments
Table of Contents
List of Tables
List of Figures
1 Introduction
2 Related Work
- 2.1 Reduced Execution Validations
- 2.2 SimPoint Validation
- 2.3 Performance Counter Validation
- 2.4 Single-core DBI-Based Simulation
  - 2.4.1 Valgrind
  - 2.4.2 Pin
  - 2.4.3 Qemu
  - 2.4.4 TAXI
- 2.5 Multi-core Simulation
  - 2.5.1 CMP$im
  - 2.5.2 Other
- 2.6 Cycle-Accurate x86 Simulators
- 2.7 Simulator Validations
- 2.8 Multi-processor Phase Detection
- 2.9 Deterministic Execution
- 2.10 Performance Counter based CPI Prediction
3 Methods of Reducing Simulation Time
- 3.1 Running a Small Portion from the Beginning
- 3.2 Un-guided Fast-forwarding
- 3.3 Reduced Input Sets
- 3.4 Statistics-based Sampling
- 3.5 SimPoint
  - 3.5.1 BBV Generation
  - 3.5.2 x86 Evaluation
  - 3.5.3 x86_64 Results
  - 3.5.4 Cross-Platform MIPS Results
  - 3.5.5 Summary
- 3.6 SimPoint Limitations
4 Single-Core Validation Concerns
- 4.1 Hardware Performance Counters
  - 4.1.1 Performance Counter Evaluation
  - 4.1.2 Sources of Hardware Counter Variation
  - 4.1.3 Counter Variation Findings
  - 4.1.4 Intra-machine results
  - 4.1.5 Inter-machine Results
- 4.2 Deterministic Execution
- 4.2.1 Virtual Memory Layout
- 4.2.2 System Effects
- 4.2.3 Sources of DBI Tool Variation
- 4.3 Summary
5 32-Bit RISC Results
- 5.1 SESC Cycle-accurate Simulator
- 5.2 Reference Hardware
- 5.3 DBI-based Simulator
- 5.4 Benchmarks
- 5.5 Results
  - 5.5.1 Absolute Results
  - 5.5.2 Relative Results
  - 5.5.3 Summary
6 64-Bit CISC Results
- 6.1 RISC/CISC differences
- 6.2 Modern CPU Features
- 6.3 uop Concerns
- 6.4 Evaluation Methodology
  - 6.4.1 Valgrind DBI-based Simulator
  - 6.4.2 m5 Cycle-accurate Simulator
  - 6.4.3 Reference Hardware
  - 6.4.4 Benchmarks
- 6.5 Absolute Results
  - 6.5.1 Phase Behavior Results
  - 6.5.2 L1 Instruction Cache
  - 6.5.3 Data Accesses per Thousand Instructions
  - 6.5.4 L1 Data Cache
- 6.6 L2 Cache
- 6.7 Branch Predictor
- 6.8 CPI
- 6.9 Relative Results
  - 6.9.1 L1 Instruction Cache
  - 6.9.2 L1 Data Cache
  - 6.9.3 L2 Cache
  - 6.9.4 Branch Predictor
  - 6.9.5 CPI
  - 6.10 Summary
7 Multi-Core Validation Concerns
- 7.1 Performance Counters
- 7.2 Deterministic Execution
8 Multi-Core Results
- 8.1 Methodology
  - 8.1.1 Performance Counters
  - 8.1.2 DBI Simulation
  - 8.1.3 Cycle-accurate Simulation
- 8.2 Results
- 8.3 Summary
9 Conclusion and Future Work
- 9.1 Results Summary
- 9.2 Future Work
- 9.3 Conclusion
A The Lost Art of Assembly Language Programming
- A.1 Benefits of Code Density
- A.2 Methodology
- A.3 Architectural Notes
- A.4 Code Density Findings
- A.5 Density of Compiler-Generated Binaries
- A.6 RelatedWork
- A.7 Conclusions and Future Work
B Cache Latencies
C Instruction Counts
D Simulation Timings
E CPI Phase Plots
- E.1 32-bit x86
- E.2 64-bit x86_64
F Multi-architecture Phase Plots
G L1 Data Cache Accesses per Instruction Phase Plots
H L1 Data Cache Accesses per uop Phase Plots
I Valgrind exp-bbv Tool Code Listing
J Qemu BBV Patch Code Listing
K R12000 Branch Predictor KernelModule
L SESC R12000 Configuration File
Bibliography

Download: vmw_thesis.pdf (14MB)
(It is a large file due to the large numbers of graphs in the Appendices)

Back to my publications page

Using Dynamic Binary Instrumentation to Create Faster, Validated, Multi-Core Simulations

A PhD Thesis by

Vincent M. Weaver

May 2010

Abstract

Table of Contents