exp-bbv  --  a tool to generate Simpoint BBVs (basic block vectors)

       by Vince Weaver  vince _at_ csl.cornell.edu
       
              
Background
~~~~~~~~~~
  The Simpoint algorithm (see http://www.cse.ucsd.edu/~calder/simpoint/
  for more information) is a method of speeding up architectural 
  simulations by exploiting the fact that programs have phase behavior,
  enabling you to extrapolate total performance while only simulating small 
  portions of the program.
  
  The general idea depends on the property that most program show
  phase-based behavior.  This means that at various times during program
  execution you will encounter intervals of time where the code behaves
  very similarly to an interval that happened already in the past.
  If you can detect these intervals and group them together, you can
  approximate the total behavior of the program by only simulating the bare 
  minimum number of intervals, and then scaling these results up to predict 
  what the actual results would have been if you simulated the entire program.
  
  This is useful in computer architecture research, as running a 
  benchmark on a cycle-accurate simulator can cause a slowdown of
  1000x, making it take days, weeks, or even longer to run until
  completion.  If you utilize Simpoints, you can simulate a small fraction 
  of the program but still get a good estimate of what the behavior would
  be for the entire program.
  
  See Sherwood et al [1] for more info.
  
  In order to calculate the phases, you need to collect so-called 
  Basic Block Vectors (BBVs) from the benchmark of interest.  These are a 
  list of all Basic Blocks entered [a basic block is a section of code
  with only one entry point and one exit point], with a count of how 
  many times each block was run.  This count is also multiplied by
  the number of instructions that are in the basic block.  The
  multiplication is meant to weigh the count so that instructions in 
  small Basic Blocks aren't counted as more important than instructions 
  in large Basic Blocks.
  
  The Basic Block Vector count is dumped at certain fixed intervals.
  A common way to do things is to do this every 100 million committed
  instructions.  You can use smaller or larger intervals.  Having
  fewer (bigger) intervals makes analysis go faster, and it also avoids 
  problems with warm-up when simulating (since when using simpoints
  you tend to fast-forward or otherwise skip to the interval of interest,
  you are starting from scratch metric wise.  If you are looking at things
  like caches or branch predictors, you are going to get wrong results
  at first because your structures are empty, rather than full of previous
  state they would have if you ran the whole way up until now.  Eventually
  the structures "warm up" and start producing the proper results.  Smaller
  intervals mean more of your simulation results will be skewed by the
  wrong values that are returned before the structures are ready).
  
  The actual on-disk format of a BBV files looks something like this:
  
T:45:1024 :189:99343
T:11:78573 :15:1353  :56:1
T:18:45 :12:135353 :56:78 314:4324263

   and so on.  Each new interval starts with a T.   Then there is a colon
   followed by a value.  This is a unique number identifying a basic block.
   This is followed by another colon, and then followed by a value that is the
   number of times the block was entered multiplied by the number of 
   instructions in the block.

   The Simpoint utility takes a BBV file as described above as an input.  
   You run simpoint like this:
   
      ./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \
           -loadFVFile results.valgrind.bb.gz \
	   -k 5 -saveSimpoints results.simpts \
	   -saveSimpointWeights results.weights
      
   The Simpoint program does random linear projection using 15-dimensions,
   then does k-mean clustering to calculate which intervals are the one
   of interest.
   
   In this case we specified we want 5 with the -k 5 option.
   
   So as output we will have the "results.simpts" file which will simply
   list the 5 intervals of interest (which when we multiply by 100 million 
   [or whatever our interval size we chose] will give us the offset in number 
   of instructions from the beginning of the program to start simulating).
   
   If we then simulate 100 million instructions from the start of
   each of those intervals, gather our statistics, then weight them
   according to the weights printed in the "results.weights" file,
   we then should have an approximation of how the entire program
   would have behaved, in a small fraction of the time.
  
  
Valgrind Implementation
~~~~~~~~~~~~~~~~~~~~~~~

   This valgrind plugin generates the BBV output as described above.
   
   Luckily valgrind makes it easy to instrument code.  Ideally we would
   instrument at the basic block level, but this is complicated
   by the fact that valgrind uses super-blocks, which only have one
   entry point but can have multiple exit points.
   
   For this implementation we actually instrument at the super-block level.
   Having super-blocks can be seen as a benefit; it might give a better
   indication of phase to the clustering algorithms than plain basic blocks
   would.  Super-block generated BBVs give similar results when compared
   to real basic-block generated BBVs made by the PIN tool.  It is possible
   to use actual basic blocks by using the 
   "--vex-guest-chase-thresh=0" option to valgrind.
   
   Once started, valgrind begins translating the program being run
   in an as-needed fashion.  The binary code is translated into an
   intermediate language, this is then instrumented, and then the
   resulting code is compiled into a super-block and cached.  This
   final code is the code that is executed.

   At initial translation time we instrument every instruction,
   and have it call our BBV routine.  This is a lot of overhead,
   but it is simpler to implement than instrumenting all of the
   basic blocks inside of a super-block.  On x86 we also note
   if the instruction is a "rep" or "fldcw" instruction for 
   special-case handling.  (See "The REP Prefix" in the next section for why).
   
   While executing, our routine is called before where each individual
   instruction was in the original code.  We look up our current superblock
   in an Ordered Set to find a structure that holds block-specific
   statistics (we use the entry point address as the index into
   the hash table).  We then incrememnt by one the instruction count for this 
   superblock (assuming it's not a rep instruction).
    
   We also update the master instruction counter by one at the same time.
   If this overflows the interval size (by default 100 million, but this
   is configurable with a command line option) then we run the BBV generation
   code.  This routine prints the current BBV line to the output file, and 
   resets all the superblock counters to zero.
   
   
Pitfalls
~~~~~~~~
   
  The REP prefix
  ~~~~~~~~~~~~~~
  
   This plugin was validated against actual x86/x86_64 machines
   using performance counters.  With valgrind we were getting a much
   larger retired instruction count than the perf counters or the PIN 
   implementation tool were reporting.
   
   It turns out that the "rep" prefix (in conjuction with a 
   string instruction) may repeat an instruction many times, yet
   the processor only counts this as one committed instruction.
   This is documented in various Intel manuals.  Without special
   handling, valgrind counts each repetition.
      
   Our plugin takes this into account;  if we are in a rep 
   instruction we only update the various instruction counts
   every 4096 repeats, not every repeat.
   
 The FLDCW instruction
 ~~~~~~~~~~~~~~~~~~~~~
 
  The "fldcw" instruction on x86/x86_64 is counted as one committed
  instruction on most architectures, but on the Pentium 4 "NetBurst"
  architecture it counts as two.
  
  Valgrind only counts it as one, which is fine.  We record the number
  of fldcw instructions so that we can make comparisons with 
  Pentium 4 machines.  In the future we might enable counting
  per-interval these instructions.
  
 The art benchmark
 ~~~~~~~~~~~~~~~~~
 
   The spec2k benchmark suite includes "art" which is a C floating point
   benchmark.  When run under valgrind, the benchmark finished with half
   of the number of instructions committed than when run on actual hardware.
   
   It turns out that valgrind uses 64-bit floating point math for
   portability reasons, but by default on x86-Linux programs use
   80-bit floating point math.  The art benchmark unwisely uses
   the "==" C operator to compare two floating point numbers, and due
   to rounding errors between the 80-bit and 64-bit versions 
   the 64-bit version can finish early, while still generating
   the proper expected output from art.
   
   This makes it hard to compare simpoint results, because you will
   get different simpoints if you have only half the number of instructions.
   
   For comparison purposes, the following code was added to the beginning
   of the art benchmark to force it to use 64-bit math.  (There are other
   methods for working around this, including using the Intel compiler 
   which has options to force 64-bit math, or using the -msse2 option of gcc
   which uses 64-bit sse2 math.  Neither of these would work for the 
   Pentium III we were gathering the performance counter data on).
   
           #include <fpu_control.h>
   
           fpu_control_t cw;
	   _FPU_GETCW(cw);
	   cw &= ~_FPU_EXTENDED;
	   cw |= _FPU_DOUBLE;
	   _FPU_SETCW(cw);


 The dealII benchmark
 ~~~~~~~~~~~~~~~~~

  The SPEC 2006 benchmark "dealII" has a similar problem to the one
  described previously for art.  In this case, due to the 
  64-bit floating point being used, a comparison against an
  error epsilon never becomes true, and the benchmark gets
  stuck in an infinite loop.
  
  You can fix things to work by making:
     const long double tolerance= 5e-16;
  In the function QGauss<1>::QGauss (const unsigned int n)
    in quadrature_lib.cc



Validation
~~~~~~~~~~
 
 The code has been tested on x86 and compared against real 
 hardware.  This is described in a paper [2].
 
 The code has been compiled and run on PowerPC but has not been
 in any way validated.
 
 
Installing with Valgrind 3.4.1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  * Download valgrind from www.valgrind.org
    This tools assumes version 3.4.1
    
  * Unpack valgrind
  
  * Enter into the valgrind-3.4.1 directory and unpack this
    plugin (tar -xzvf exp-bbv.tar.gz)
    
  * Go back to the main valgrind-3.4.1 directory.
  
  * Edit the file "configure".  Search for the string none/docs/Makefile
    and immediately after it put: exp-bbv/Makefile
    
  * Now run "./configure" as you would if you were normally installing
    valgrind.
    
  * Run make, then make install of valgrind.
  
  * Now enter into the ./exp-bbv directory, and run make and make install.
  
  * The simpoint plugin should now be ready...

Installing, using Valgrind SVN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+ get SVN version of valgrind
     svn co svn://svn.valgrind.org/valgrind/trunk valgrind

+ unpack the exp-bbv tree in the ./valgrind tree

+ edit the global version of Makefile.am and add
  exp-bbv to the EXP_TOOLS list
  
+ Add
      exp-bbv/Makefile
      exp-bbv/docs/Makefile
      exp-bbv/tests/Makefile
      exp-bbv/x86/Makefile
  to the end of the AC_OUTPUT list in configure.in
  
+ run ./autogen.sh

+ run configure as described in the "Installing with Valgrind 3.4.1" section
  


Using
~~~~~

   You run the tool something like this:
   
   valgrind --tool=exp-bbv --vex-guest-chase-thresh=0  \
            --interval_size=100000000 --bbfile=out.bbv /bin/ls
   
   The --vex-guest-chase-thresh=0 option forces valgrind to use basic blocks
      (as opposed to super blocks).
   The "interval_size" parameter is optional, it defaults to 100 million.
   The "bbfile" option specifies the output file name.  It defaults to out.bb
   /bin/ls is just a placeholder for any executable you'd like to run.

Performance
~~~~~~~~~~~
  Using valsim slows down execution by roughly a factor of 50.
  It depends on the machine being run on, and the benchmark.
  
  On the spec2000 benchmarks running on a 3.4GHz Pentium D processor,
    the slowdown ranges from 24x (mcf) to 340x (vortex.2).

References
~~~~~~~~~~
[1]  T Sherwood, E Perelman, G. Hamerly, B. Calder.  Automatically
     Characterizing Large Scale Program Behavior.  ASPLOS X, 2002.
     
[2]  V.M. Weaver, S.A. McKee, "Using Dynamic Binary Instrumentation to
     Generate Multi-Platform Simpoints: Methodology and Accuracy", 3rd EC
     International Conference on High Performance Embedded Architectures and
     Compilers (HiPEAC'08), Göteborg, Sweden, January 2008.
 
