Using Dynamic Binary Instrumentation to Create Faster, Validated,
Multi-Core Simulations
A PhD Thesis by
Vincent M. Weaver
May 2010
Abstract
The Memory Wall continues to be a problem
with modern systems design. While the steady increase in processor
speeds has abated somewhat, Moore's Law continues
to provide more transistors to chip designers. This leads
to an increase in the number of processors and threads located per chip,
which increases the demands on memory systems. Current simulation
technology is not able to keep up, leading to sacrifices in
methodology and accuracy in order to get results in reasonable time.
Because cycle-accurate simulators are so slow, various methods
for reducing execution time can be used. Unfortunately these
methods can introduce variations in results of between 10-50\%
when compared to full reference input sets.
Limitations of academic simulators also constrain the architectures
under study, with results generated for obsolete or uninteresting
systems.
We analyze the performance and accuracy of various limited-execution methodologies.
We investigate how deterministic execution affects the measurement
of error. We then evaluate using Dynamic Binary Instrumentation (DBI) as
an alternative to cycle-accurate simulation.
We compare our results to actual systems using hardware performance counters.
We look first at a simple 32-bit RISC system, and then
look at more complex 64-bit x86 based systems.
Finally we investigate the feasibility of using the same methodology
for modern multi-processors simulations.
Table of Contents
- Biographical Sketch
- Dedication (in Tengwar)
- Acknowledgments
- Table of Contents
- List of Tables
- List of Figures
- 1 Introduction
- 2 Related Work
- 2.1 Reduced Execution Validations
- 2.2 SimPoint Validation
- 2.3 Performance Counter Validation
- 2.4 Single-core DBI-Based Simulation
- 2.4.1 Valgrind
- 2.4.2 Pin
- 2.4.3 Qemu
- 2.4.4 TAXI
- 2.5 Multi-core Simulation
- 2.6 Cycle-Accurate x86 Simulators
- 2.7 Simulator Validations
- 2.8 Multi-processor Phase Detection
- 2.9 Deterministic Execution
- 2.10 Performance Counter based CPI Prediction
- 3 Methods of Reducing Simulation Time
- 3.1 Running a Small Portion from the Beginning
- 3.2 Un-guided Fast-forwarding
- 3.3 Reduced Input Sets
- 3.4 Statistics-based Sampling
- 3.5 SimPoint
- 3.5.1 BBV Generation
- 3.5.2 x86 Evaluation
- 3.5.3 x86_64 Results
- 3.5.4 Cross-Platform MIPS Results
- 3.5.5 Summary
- 3.6 SimPoint Limitations
- 4 Single-Core Validation Concerns
- 4.1 Hardware Performance Counters
- 4.1.1 Performance Counter Evaluation
- 4.1.2 Sources of Hardware Counter Variation
- 4.1.3 Counter Variation Findings
- 4.1.4 Intra-machine results
- 4.1.5 Inter-machine Results
- 4.2 Deterministic Execution
- 4.2.1 Virtual Memory Layout
- 4.2.2 System Effects
- 4.2.3 Sources of DBI Tool Variation
- 4.3 Summary
- 5 32-Bit RISC Results
- 5.1 SESC Cycle-accurate Simulator
- 5.2 Reference Hardware
- 5.3 DBI-based Simulator
- 5.4 Benchmarks
- 5.5 Results
- 5.5.1 Absolute Results
- 5.5.2 Relative Results
- 5.5.3 Summary
- 6 64-Bit CISC Results
- 6.1 RISC/CISC differences
- 6.2 Modern CPU Features
- 6.3 uop Concerns
- 6.4 Evaluation Methodology
- 6.4.1 Valgrind DBI-based Simulator
- 6.4.2 m5 Cycle-accurate Simulator
- 6.4.3 Reference Hardware
- 6.4.4 Benchmarks
- 6.5 Absolute Results
- 6.5.1 Phase Behavior Results
- 6.5.2 L1 Instruction Cache
- 6.5.3 Data Accesses per Thousand Instructions
- 6.5.4 L1 Data Cache
- 6.6 L2 Cache
- 6.7 Branch Predictor
- 6.8 CPI
- 6.9 Relative Results
- 6.9.1 L1 Instruction Cache
- 6.9.2 L1 Data Cache
- 6.9.3 L2 Cache
- 6.9.4 Branch Predictor
- 6.9.5 CPI
- 6.10 Summary
- 7 Multi-Core Validation Concerns
- 7.1 Performance Counters
- 7.2 Deterministic Execution
- 8 Multi-Core Results
- 8.1 Methodology
- 8.1.1 Performance Counters
- 8.1.2 DBI Simulation
- 8.1.3 Cycle-accurate Simulation
- 8.2 Results
- 8.3 Summary
- 9 Conclusion and Future Work
- 9.1 Results Summary
- 9.2 Future Work
- 9.3 Conclusion
- A The Lost Art of Assembly Language Programming
- A.1 Benefits of Code Density
- A.2 Methodology
- A.3 Architectural Notes
- A.4 Code Density Findings
- A.5 Density of Compiler-Generated Binaries
- A.6 RelatedWork
- A.7 Conclusions and Future Work
- B Cache Latencies
- C Instruction Counts
- D Simulation Timings
- E CPI Phase Plots
- E.1 32-bit x86
- E.2 64-bit x86_64
- F Multi-architecture Phase Plots
- G L1 Data Cache Accesses per Instruction Phase Plots
- H L1 Data Cache Accesses per uop Phase Plots
- I Valgrind exp-bbv Tool Code Listing
- J Qemu BBV Patch Code Listing
- K R12000 Branch Predictor KernelModule
- L SESC R12000 Configuration File
- Bibliography
Download: vmw_thesis.pdf (14MB)
(It is a large file due to the large numbers of graphs in the Appendices)
Back to my publications page