qemu_bbv --  a tool to generate SimPoint BBVs (basic block vectors)

       by Vince Weaver  vince _at_ csl.cornell.edu
       
              
Background
~~~~~~~~~~
  The SimPoint algorithm (see http://www.cse.ucsd.edu/~calder/simpoint/
  for more information) is a method of speeding up architectural 
  simulations by exploiting the fact that programs have phase behavior,
  enabling you to extrapolate total performance while only simulating small 
  portions of the program.
  
  The general idea depends on the property that most program show
  phase-based behavior.  This means that at various times during program
  execution you will encounter intervals of time where the code behaves
  very similarly to an interval that happened already in the past.
  If you can detect these intervals and group them together, you can
  approximate the total behavior of the program by only simulating the bare 
  minimum number of intervals, and then scaling these results up to predict 
  what the actual results would have been if you simulated the entire program.
  
  This is useful in computer architecture research, as running a 
  benchmark on a cycle-accurate simulator can cause a slowdown of
  1000x, making it take days, weeks, or even longer to run until
  completion.  If you utilize SimPoints, you can simulate a small fraction 
  of the program but still get a good estimate of what the behavior would
  be for the entire program.
  
  See Sherwood et al [1] for more info.
  
  In order to calculate the phases, you need to collect so-called 
  Basic Block Vectors (BBVs) from the benchmark of interest.  These are a 
  list of all Basic Blocks entered [a basic block is a section of code
  with only one entry point and one exit point], with a count of how 
  many times each block was run.  This count is also multiplied by
  the number of instructions that are in the basic block.  The
  multiplication is meant to weigh the count so that instructions in 
  small Basic Blocks aren't counted as more important than instructions 
  in large Basic Blocks.
  
  The Basic Block Vector count is dumped at certain fixed intervals.
  A common way to do things is to do this every 100 million committed
  instructions.  You can use smaller or larger intervals.  Having
  fewer (bigger) intervals makes analysis go faster, and it also avoids 
  problems with warm-up when simulating (since when using SimPoints
  you tend to fast-forward or otherwise skip to the interval of interest,
  you are starting from scratch metric wise.  If you are looking at things
  like caches or branch predictors, you are going to get wrong results
  at first because your structures are empty, rather than full of previous
  state they would have if you ran the whole way up until now.  Eventually
  the structures "warm up" and start producing the proper results.  Smaller
  intervals mean more of your simulation results will be skewed by the
  wrong values that are returned before the structures are ready).
  
  The actual on-disk format of a BBV files looks something like this:
  
T:45:1024 :189:99343
T:11:78573 :15:1353  :56:1
T:18:45 :12:135353 :56:78 314:4324263

   and so on.  Each new interval starts with a T.   Then there is a colon
   followed by a value.  This is a unique number identifying a basic block.
   This is followed by another colon, and then followed by a value that is the
   number of times the block was entered multiplied by the number of 
   instructions in the block.

   The Simpoint utility takes a BBV file as described above as an input.  
   You run simpoint like this:
   
      ./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \
           -loadFVFile results.qemu.bb.gz \
	   -k 5 -saveSimpoints results.simpts \
	   -saveSimpointWeights results.weights
      
   The Simpoint program does random linear projection using 15-dimensions,
   then does k-mean clustering to calculate which intervals are the one
   of interest.
   
   In this case we specified we want 5 with the -k 5 option.
   
   So as output we will have the "results.simpts" file which will simply
   list the 5 intervals of interest (which when we multiply by 100 million 
   [or whatever our interval size we chose] will give us the offset in number 
   of instructions from the beginning of the program to start simulating).
   
   If we then simulate 100 million instructions from the start of
   each of those intervals, gather our statistics, then weight them
   according to the weights printed in the "results.weights" file,
   we then should have an approximation of how the entire program
   would have behaved, in a small fraction of the time.
  
  
Qemu Implementation
~~~~~~~~~~~~~~~~~~~~~~~

   This qemu patch generates the BBV output as described above.
   
   Qemu is not designed for architecture research, so some intrusive changes
   have to be made to the code to get things to work.
   
   First, the "TranslationBlock" structure is modified so that it 
     can hold a "unique_id" field.
     
   Second, the "tb_alloc()" function is modified so that when a
     TranslationBlock (i.e., a BasicBlock) is translated it is given a
     unique id.
   
   Third, we add code at the instruction decode site for the various
     architectures.  This code inserts into the new (translated)
     basic block as call to our "dump_pc()" routine after each
     original instruction.
            
     This dump_pc() function updates the BBV count and total count,
       and if the total count is more than the interval size writes
       out a line of BBV data and resets all the counters.
     
       Note - on x86 processors a rep-prefixed instruction only
         counts as *one* retired instruction.  We make sure
	 to handle this properly.
     
     
   The patch might also contain some minor fixes the architectural
      simulation that have not made it upstream into the main Qemu
      repository yet.
     
     
Pitfalls
~~~~~~~~
   
  Stack size and gcc
  ~~~~~~~~~~~~~~~~~~
  
  The SPEC2000 gcc benchmarks may segfault part way through execution.
  By default the linux-user qemu binary does not allocating enough stack.  
  By default qemu-i386 only allocates 512kB of stack, and gcc requires
  at least 8MB.  To up this, use the "-s" option to the qemu-i386
  executable.
	   
  REPZ and REPNZ
  ~~~~~~~~~~~~~~
  
  On x86 the REPZ and REPNZ prefixes cause a string instruction to repeat
  many times.  On the x86 architecture, each instruction should only
  count as *one* instruction, not the number of repeats.
  Our code takes this into account.
     
Validation
~~~~~~~~~~
 
 The code has been tested on x86 and compared against real 
 hardware.  The results are described in our HiPEAC 2008 paper[2].
 
 Further validation is being done on the SPARC, x86_64, and MIPS
 results.
  
 
Installing
~~~~~~~~~~

  * Download git tree from 13 January 2010
    of qemu using git
    
    git clone git://git.savannah.nongnu.org/qemu.git

    (note, my git skills are still weak so you're on your own
     getting the right tree).
      
  * enter the git directory
  
  * run ./configure and then run make to build all of qemu

  * now patch with the included qemu_bbv patchset.  In the svn directory do
      patch -p1 < ./qemu_bbv_patch.20100113
      
  * Optionally patch using the alpha and mips patches included if you
    care about those architectures.
      
  * now enter your linux-user emulator of interest and run "make"
    For example, to re-build the i386-linux-user one
    
      cd i386-linux-user
      make qemu-i386
      
    And it should build the patched version for you.


Using
~~~~~

   You run the tool something like this:
   
   ./i386-linux-user/qemu-i386 -s 8048048 program
   
   where program is the program you want to run, and the -s option sets the
   stack size to use (gcc from SPEC2000 needs at least 8MB like above.  Other
   programs are probably fine with the default of 512kB).
   
   The interval_size is hardcoded to 100 million.
   The output file is hardcoded to qemusim.bbv in the same directory.


Limitations
~~~~~~~~~~~
 + Currently only user-space syscall-by-proxy mode is supported.


Performance
~~~~~~~~~~~
  Using qemusim slows down execution by an average of around
  27x over native on a Pentium D system.
    
  This is because we instrument every instruction, we could
    definitely speed this up by instrumenting only at the basic block level.

Accuracy
~~~~~~~~
  The total instruction count is considerably off of the values returned by
  pin, valgrind, and perfctr2 for a few benchmarks:
     art (it ignores forced 64-bit fpu mode, uses 80bit)
     
  Special care needs to be used to get the SPEC benchmarks to run 
  deterministically.  Most important is having your UNIX environment
  variables consistent across runs, and also turning off heap/stack
  randomization on more recent versions of Linux.  See our
  IISWC'08 paper for more details on this.

Thanks
~~~~~~
  To Shane Brennan who sent me his preliminary qemu/bbv code although I 
     ended up not using it.
  To Stuart Brady whose post on gathering PC values on the qemu list
     helped me figure out where to get started on this.

References
~~~~~~~~~~
[1]  T Sherwood, E Perelman, G. Hamerly, B. Calder.  Automatically
     Characterizing Large Scale Program Behavior.  ASPLOS X, 2002.

[2]  V.M. Weaver, S.A. McKee, "Using Dynamic Binary Instrumentation to
     Generate Multi-Platform Simpoints: Methodology and Accuracy", 3rd EC
     International Conference on High Performance Embedded Architectures and
     Compilers (HiPEAC'08), GÃ¶teborg, Sweden, January 2008.