PAPI_read() overhead reduction with rdpmc

Quick Summary

We found a 3 to 10x reduction in latency by using the perf_event rdpmc interface in PAPI_read() rather than using the traditional read() system call. The improvement is even higher if the system involved has the KPTI meltdown-avoidance patch installed.

Brief Results

These are before/after on PAPI with and without rdpmc support enabled, just before the PAPI 5.6 release (which enabled rdpmc by default).

Median Speedup in rdtsc cycles (across a million runs):
VendorMachineread() cycles rdpmc cyclesSpeedup
IntelPentium II25333846.6x
Intel Pentium 4 3728 704 5.3x
Intel Core 2 1634 199 8.2x
Intel Atom 3906 392 10.0x
Intel Ivybridge 885 149 5.9x
Intel Haswell 913 142 6.4x
Intel Haswell-EP 820 125 6.6x
Intel Broadwell 1030 145 7.1x
Intel Broadwell-EP 750 118 6.4x
Intel Skylake 942 144 6.5x
AMD fam10h Phenom II 1252 205 6.1x
AMD fam15h A10 2457 951 2.6x
AMD fam15h Opteron 2186 644 3.4x
AMD fam16h A8 1632 205 8.0x

Historical Comparison of Interfaces on a Core2 Machine
InterfaceKernelRead resultsslowdown vs perf_event rdpmc
perf_event rdpmc 3.16 199 ---
perfctr rdpmc 2.6.32 200 1x
perfmon2 read() 2.6.30 1216 6.1x
perf_event read() 3.16 1587 8.0x
perf_event read() 4.8 1634 8.2x
perf_event read() KPTI4.15-rc7 3173 15.9x

Overhead Introduced by Meltdown KPTI workaround
Processorrdpmc resultsread PTI=off resultsread PTI=on results
Core2 199 1634 (8.2x) 3173 (15.9x)
Haswell 139 958 (6.9x) 1411 (10.2x)
Skylake 142 978 (6.9x) 1522 (10.7x)

Full Results, Papers/Slides

Raw Data

The raw data can be found in this git repository:

Older Results

Some much older results can be found here
Back to my projects page