Deterministic Performance Counter Work

I have been researching the inherent determinism (and overcount) of hardware performance counters.

Currently this work has mostly been about finding deterministic events on x86_64 machines, as well as finding sources of non-determinism and overcount.

TL;DR Summary

On x86, most performance counters are not deterministic. This includes things like "userspace retired instructions" that you'd expect to be the same from run to run. This is even true if you write assembly language programs with exact numbers of instructions to avoid OS and application variations.

Why? It turns out interrupts (both software and hardware) will increment most retired instructions counters. And interrupts are usually unpredictable.

Why does this happen, on both AMD and Intel?

The current theory is the iret (interrupt return) instruction gets broken into four uops: some that finish in kernelspace and some in userspace. The latter gets counted as a userspace instruction and makes things all non-deterministic. The unaffected counters (often stores, or conditional branches) are ones that do not count iret instructions.

Here is an ongoing paper with the most recent results:

deterministic_counters.pdf (18 March 2021)

A summary of my findings as of April 2013 can be found in the following ISPASS 2013 paper:

V.M. Weaver, D. Terpstra, S. Moore. "Non-Determinism and Overcount on Modern Hardware Performance Counter Implementations", IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2013), Austin, Texas, April 2013.
The IEEE would prefer you obtain this paper through their IEEE Explore interface here.
You can also view my personal copy of the paper. Warning! IEEE Copyright rules apply! ispass2013_deterministic.pdf
Here are the slides from the talk I gave at ISPASS: ispass2013_deterministic_slides.pdf

I presented an earlier version on this work at the FHPM 2010 Workshop:

A somewhat older and more disorganized list of the results I find are presented broken-out here.

Source Code

This project has two phases. The first involved generating a large hand-coded assembly benchmark that was used to find non-determinisms in the hardware performance counters on a variety of x86_64 systems.

The second phase is automating the code generating and having some sort of automatic search for such problems. This work is underway.

The most recent version of the tool used to gather this information can be obtained via:
git clone https://github.com/deater/deterministic

An older stable release can be downloaded: deterministic-0.23.tar.bz2 (2M) 15 February 2013

Uses of this research

The most high-profile use of this work is by the RR deterministic replay team at Mozilla.

You can read an article that references the work here.

Back to my projects page