PAPI_read() overhead reduction with rdpmc

Quick Summary

We found a 3 to 10x reduction in latency by using the perf_event rdpmc interface in PAPI_read() rather than using the traditional read() system call. The improvement is even higher if the system involved has the KPTI meltdown-avoidance patch installed.

Brief Results

These are before/after on PAPI with and without rdpmc support enabled, just before the PAPI 5.6 release (which enabled rdpmc by default).

Median Speedup in rdtsc cycles (across a million runs):

Vendor	Machine	read() cycles	rdpmc cycles	Speedup
Intel	Pentium II	2533	384	6.6x
Intel	Pentium 4	3728	704	5.3x
Intel	Core 2	1634	199	8.2x
Intel	Atom	3906	392	10.0x
Intel	Ivybridge	885	149	5.9x
Intel	Haswell	913	142	6.4x
Intel	Haswell-EP	820	125	6.6x
Intel	Broadwell	1030	145	7.1x
Intel	Broadwell-EP	750	118	6.4x
Intel	Skylake	942	144	6.5x
AMD	fam10h Phenom II	1252	205	6.1x
AMD	fam15h A10	2457	951	2.6x
AMD	fam15h Opteron	2186	644	3.4x
AMD	fam16h A8	1632	205	8.0x

Historical Comparison of Interfaces on a Core2 Machine

Interface	Kernel	Read results	slowdown vs perf_event rdpmc
perf_event rdpmc	3.16	199	---
perfctr rdpmc	2.6.32	200	1x
perfmon2 read()	2.6.30	1216	6.1x
perf_event read()	3.16	1587	8.0x
perf_event read()	4.8	1634	8.2x
perf_event read() KPTI	4.15-rc7	3173	15.9x

Overhead Introduced by Meltdown KPTI workaround

Processor	rdpmc results	read PTI=off results	read PTI=on results
Core2	199	1634 (8.2x)	3173 (15.9x)
Haswell	139	958 (6.9x)	1411 (10.2x)
Skylake	142	978 (6.9x)	1522 (10.7x)

Full Results, Papers/Slides

Slides from ESPT'17 Workshop presentation:
espt2017_weaver_slides.pdf
Paper from ESPT'17 Workshop:
2017_espt_draft.pdf (22 Jan 2018, 324k)
Yan Liu's Masters Thesis:
Optimizing PAPI for Low-Overhead Counter Measurement (December 2017, 3MB)

Raw Data

The raw data can be found in this git repository:

git clone https://github.com/deater/papi_performance

Older Results

Some much older results can be found here

Back to my projects page