Floating Point Operations
PAPI_FP_OPS
Nehalem
for pretty graphs see: here
I made a trivial matrix matrix multiply example and ran
some tests on Nehalem, on both 32 and 64 bit. You can
see the results below (expected count is roughly 270M,
so the results shown are close).
Event 64 bit 32 bit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~~~~
FP_COMP_OPS_EXE:MMX 0 0
FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 270,212,912 0
FP_COMP_OPS_EXE:SSE_FP 271,052,506 0
FP_COMP_OPS_EXE:SSE_FP_PACKED 0 0
FP_COMP_OPS_EXE:SSE_FP_SCALAR 269,182,020 0
FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 0
FP_COMP_OPS_EXE:SSE2_INTEGER 12,524 11,005
FP_COMP_OPS_EXE:X87 348 269,146,335
Currently on Nehalem
PAPI_FP_OPS = FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION +
FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION
PAPI_FP_INS = FP_COMP_OPS_EXE:SSE_FP
You'll notice that PAPI_FP_INS gives roughly the same result
as PAPI_FP_OPS, and by using one less counter. So in theory
we can change PAPI_FP_OPS to be
PAPI_FP_OPS = FP_COMP_OPS_EXE:SSE_FP +
FP_COMP_OPS_EXE:X87
and have results that work for both 32/64-bit.
And yes, it looks like you can't have multiple umasks on FP_COMP_OPS_EXE
which would solve all of our problems.
I ran further tests for matrix matrix multiply, using the "float" type
instead of "double" to see what happens when using single precision.
Event 64 bit 64 bit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~~~~
FP_COMP_OPS_EXE:MMX 0 0
FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 270,212,912 434,006
FP_COMP_OPS_EXE:SSE_FP 271,052,506 269,394,462
FP_COMP_OPS_EXE:SSE_FP_PACKED 0 0
FP_COMP_OPS_EXE:SSE_FP_SCALAR 269,182,020 269,626,065
FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 268,145,490
FP_COMP_OPS_EXE:SSE2_INTEGER 12,524 6,670
FP_COMP_OPS_EXE:X87 348 72
Event 32 bit 32 bit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~ ~~~~~~~~~~~
FP_COMP_OPS_EXE:MMX 0 0
FP_COMP_OPS_EXE:SSE_DOUBLE_PRECISION 0 0
FP_COMP_OPS_EXE:SSE_FP 0 0
FP_COMP_OPS_EXE:SSE_FP_PACKED 0 0
FP_COMP_OPS_EXE:SSE_FP_SCALAR 0 0
FP_COMP_OPS_EXE:SSE_SINGLE_PRECISION 0 0
FP_COMP_OPS_EXE:SSE2_INTEGER 11,005 7,481
FP_COMP_OPS_EXE:X87 269,146,335 270,182,265
As a side note, on Nehalem the following code generated with -O3:
movsd 2822190(%rip),%xmm0
add $0x1,%eax
movsd 2822187(%rip),%xmm1
cmp %edi,%eax
mulsd %xmm1,%xmm0
addsd %xmm0,%xmm2
jne 402472 do_flops+22
generates 2 flops/loop while the -O0 results:
movsd a(%rip), %xmm1
movsd b(%rip), %xmm0
mulsd %xmm1, %xmm0
movsd -16(%rbp), %xmm1
addsd %xmm1, %xmm0
movsd %xmm0, -16(%rbp)
addl $1, -4(%rbp)
generate 3 flops/loop.
Back to Counter Information