PAPI_read() overhead reduction with rdpmc
Quick Summary
We found a 3 to 10x reduction in latency by using the perf_event
rdpmc interface in PAPI_read() rather than using the traditional read()
system call. The improvement is even higher if the system involved has
the KPTI meltdown-avoidance patch installed.
Brief Results
These are before/after on PAPI with and without rdpmc support enabled,
just before the PAPI 5.6 release (which enabled rdpmc by default).
Median Speedup in rdtsc cycles (across a million runs):
Vendor | Machine | read() cycles |
rdpmc cycles | Speedup |
Intel | Pentium II | 2533 | 384 | 6.6x |
Intel | Pentium 4 | 3728 | 704 | 5.3x |
Intel | Core 2 | 1634 | 199 | 8.2x |
Intel | Atom | 3906 | 392 | 10.0x |
Intel | Ivybridge | 885 | 149 | 5.9x |
Intel | Haswell | 913 | 142 | 6.4x |
Intel | Haswell-EP | 820 | 125 | 6.6x |
Intel | Broadwell | 1030 | 145 | 7.1x |
Intel | Broadwell-EP | 750 | 118 | 6.4x |
Intel | Skylake | 942 | 144 | 6.5x |
AMD | fam10h Phenom II | 1252 | 205 | 6.1x |
AMD | fam15h A10 | 2457 | 951 | 2.6x |
AMD | fam15h Opteron | 2186 | 644 | 3.4x |
AMD | fam16h A8 | 1632 | 205 | 8.0x |
Historical Comparison of Interfaces on a Core2 Machine
Interface | Kernel | Read results | slowdown vs perf_event rdpmc |
perf_event rdpmc | 3.16 | 199 | --- |
perfctr rdpmc | 2.6.32 | 200 | 1x |
perfmon2 read() | 2.6.30 | 1216 | 6.1x |
perf_event read() | 3.16 | 1587 | 8.0x |
perf_event read() | 4.8 | 1634 | 8.2x |
perf_event read() KPTI | 4.15-rc7 | 3173 | 15.9x |
Overhead Introduced by Meltdown KPTI workaround
Processor | rdpmc results | read PTI=off results | read PTI=on results |
Core2 | 199 | 1634 (8.2x) | 3173 (15.9x) |
Haswell | 139 | 958 (6.9x) | 1411 (10.2x) |
Skylake | 142 | 978 (6.9x) | 1522 (10.7x) |
Full Results, Papers/Slides
Raw Data
The raw data can be found in this git repository:
Older Results
Some much older results can be found here
Back to my projects page