PAPI_read() overhead reduction with rdpmc
Quick Summary
We found a 3 to 10x reduction in latency by using the perf_event
rdpmc interface in PAPI_read() rather than using the traditional read()
system call.  The improvement is even higher if the system involved has
the KPTI meltdown-avoidance patch installed.
Brief Results
These are before/after on PAPI with and without rdpmc support enabled,
just before the PAPI 5.6 release (which enabled rdpmc by default).
Median Speedup in rdtsc cycles (across a million runs):
| Vendor | Machine | read() cycles | rdpmc cycles | Speedup | 
|---|
| Intel | Pentium II | 2533 | 384 | 6.6x | 
| Intel | Pentium 4 | 3728 | 704 | 5.3x | 
| Intel | Core 2 | 1634 | 199 | 8.2x | 
| Intel | Atom | 3906 | 392 | 10.0x | 
| Intel | Ivybridge | 885 | 149 | 5.9x | 
| Intel | Haswell | 913 | 142 | 6.4x | 
| Intel | Haswell-EP | 820 | 125 | 6.6x | 
| Intel | Broadwell | 1030 | 145 | 7.1x | 
| Intel | Broadwell-EP | 750 | 118 | 6.4x | 
| Intel | Skylake | 942 | 144 | 6.5x | 
 
| AMD | fam10h Phenom II | 1252 | 205 | 6.1x | 
| AMD | fam15h A10 | 2457 | 951 | 2.6x | 
| AMD | fam15h Opteron | 2186 | 644 | 3.4x | 
| AMD | fam16h A8 | 1632 | 205 | 8.0x | 
Historical Comparison of Interfaces on a Core2 Machine
| Interface | Kernel | Read results | slowdown vs perf_event rdpmc | 
|---|
| perf_event rdpmc | 3.16 | 199 | --- | 
| perfctr rdpmc | 2.6.32 | 200 | 1x | 
| perfmon2 read() | 2.6.30 | 1216 | 6.1x | 
| perf_event read() | 3.16 | 1587 | 8.0x | 
| perf_event read() | 4.8 | 1634 | 8.2x | 
| perf_event read() KPTI | 4.15-rc7 | 3173 | 15.9x | 
Overhead Introduced by Meltdown KPTI workaround
| Processor | rdpmc results | read PTI=off results | read PTI=on results | 
|---|
| Core2 | 199 | 1634 (8.2x) | 3173 (15.9x) | 
| Haswell | 139 | 958 (6.9x) | 1411 (10.2x) | 
| Skylake | 142 | 978 (6.9x) | 1522 (10.7x) | 
Full Results, Papers/Slides
Raw Data
The raw data can be found in this git repository:
Older Results
Some much older results can be found here
Back to my projects page