qemusim --  a tool to generate SimPoint BBVs (basic block vectors)

       by Vince Weaver  vince _at_ csl.cornell.edu
       
              
Background
~~~~~~~~~~
  The SimPoint algorithm (see http://www.cse.ucsd.edu/~calder/simpoint/
  for more information) is a method of speeding up architectural 
  simulations by exploiting the fact that programs have phase behavior,
  enabling you to extrapolate total performance while only simulating small 
  portions of the program.
  
  The general idea depends on the property that most program show
  phase-based behavior.  This means that at various times during program
  execution you will encounter intervals of time where the code behaves
  very similarly to an interval that happened already in the past.
  If you can detect these intervals and group them together, you can
  approximate the total behavior of the program by only simulating the bare 
  minimum number of intervals, and then scaling these results up to predict 
  what the actual results would have been if you simulated the entire program.
  
  This is useful in computer architecture research, as running a 
  benchmark on a cycle-accurate simulator can cause a slowdown of
  1000x, making it take days, weeks, or even longer to run until
  completion.  If you utilize SimPoints, you can simulate a small fraction 
  of the program but still get a good estimate of what the behavior would
  be for the entire program.
  
  See Sherwood et al [1] for more info.
  
  In order to calculate the phases, you need to collect so-called 
  Basic Block Vectors (BBVs) from the benchmark of interest.  These are a 
  list of all Basic Blocks entered [a basic block is a section of code
  with only one entry point and one exit point], with a count of how 
  many times each block was run.  This count is also multiplied by
  the number of instructions that are in the basic block.  The
  multiplication is meant to weigh the count so that instructions in 
  small Basic Blocks aren't counted as more important than instructions 
  in large Basic Blocks.
  
  The Basic Block Vector count is dumped at certain fixed intervals.
  A common way to do things is to do this every 100 million committed
  instructions.  You can use smaller or larger intervals.  Having
  fewer (bigger) intervals makes analysis go faster, and it also avoids 
  problems with warm-up when simulating (since when using SimPoints
  you tend to fast-forward or otherwise skip to the interval of interest,
  you are starting from scratch metric wise.  If you are looking at things
  like caches or branch predictors, you are going to get wrong results
  at first because your structures are empty, rather than full of previous
  state they would have if you ran the whole way up until now.  Eventually
  the structures "warm up" and start producing the proper results.  Smaller
  intervals mean more of your simulation results will be skewed by the
  wrong values that are returned before the structures are ready).
  
  The actual on-disk format of a BBV files looks something like this:
  
T:45:1024 :189:99343
T:11:78573 :15:1353  :56:1
T:18:45 :12:135353 :56:78 314:4324263

   and so on.  Each new interval starts with a T.   Then there is a colon
   followed by a value.  This is a unique number identifying a basic block.
   This is followed by another colon, and then followed by a value that is the
   number of times the block was entered multiplied by the number of 
   instructions in the block.

   The Simpoint utility takes a BBV file as described above as an input.  
   You run simpoint like this:
   
      ./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \
           -loadFVFile results.valgrind.bb.gz \
	   -k 5 -saveSimpoints results.simpts \
	   -saveSimpointWeights results.weights
      
   The Simpoint program does random linear projection using 15-dimensions,
   then does k-mean clustering to calculate which intervals are the one
   of interest.
   
   In this case we specified we want 5 with the -k 5 option.
   
   So as output we will have the "results.simpts" file which will simply
   list the 5 intervals of interest (which when we multiply by 100 million 
   [or whatever our interval size we chose] will give us the offset in number 
   of instructions from the beginning of the program to start simulating).
   
   If we then simulate 100 million instructions from the start of
   each of those intervals, gather our statistics, then weight them
   according to the weights printed in the "results.weights" file,
   we then should have an approximation of how the entire program
   would have behaved, in a small fraction of the time.
  
  
Qemu Implementation
~~~~~~~~~~~~~~~~~~~~~~~

   This qemu patch generates the BBV output as described above.
   
   Qemu is not designed for architecture research, so some intrusive changes
   have to be made to the code to get things to work.
   
   First, the "TranslationBlock" structure is modified so that it 
     can hold a "unique_id" field.
     
   Second, the "tb_alloc()" function is modified so that when a
     TranslationBlock (i.e., a BasicBlock) is translated it is given a
     unique id.
   
   Third, target-i386/cpu.h is modified so that USE_CODE_COPY is not defined.
     This option enables straight code copying on x86/x86 translation
     instead of dynamic translation.  If we leave it enabled we basically
     do not instrument ALU instructions.  So this has to be disabled.
     
   Fourth, we add the actual simpoint generation code.
     + ./target-i386/translate.c is modified so that
       gen_intermediate_code_internal() (which is called at BBV translation)
       passes along the tb->unique_id value when
       disas_insn(dc, pc_ptr, tb->unique_id) decodes an instruction
     + ./target-i386/translate.c is also modified so that 
       disas_insn() takes the extra unique_id code as well
       as adds a call to  gen_op_dump_pc(s->pc,unique_id);
       before each instruction.
     + ./target-i386/op.c if modified to have a function
       op_dump_pc() which just calls  helper_dump_pc()
     + ./target-i386/helper.c is modified to have the function
       helper_dump_pc() which actually does the BBV generation.
       
       This helper function updates the BBV count and total count,
       and if the total count is more than the interval size writes
       out a line of BBV data and resets all the counters.
     
       Note - on x86 processors a rep-prefixed instruction only
         counts as *one* retired instruction.  We make sure
	 to handle this properly.
     
   Fifth, signal.c is patched to not pass along SIGSTP and SIGCONT signals.
     This is not normally needed, but the batch queue system I use to submit
     lots of jobs at once will suspend/resume jobs using this mechanism. 
     Without this patch the qemu jobs would kill themselves instead
     of suspending.
     
   Finally, qemu 0.9.1 has broken support for mmap().  When a large
     enough number of mmap() calls happens, eventually they will
     start failing.  This only tends to happen with the facerec
     spec2k benchmark, but does hit a lot of the spec2k6 benchmarks.
     This patchset contains a quick workaround to this problem, a proper
     fix will probably be developed at some later time.
     
     
Pitfalls
~~~~~~~~
   
  Stack size and gcc
  ~~~~~~~~~~~~~~~~~~
  
  The SPEC2000 gcc benchmarks would segfault part way through execution.
  It took a while to track this down, but the problem was the linux-user
  qemu binary was not allocating enough stack.  By default qemu-i386
  only allocates 512kB of stack space to emulate, and gcc requires
  at least 8MB.  To up this, use the "-s" option to the qemu-i386
  executable.
	   
  REPZ and REPNZ
  ~~~~~~~~~~~~~~
  
  On x86 the REPZ and REPNZ prefixes cause a string instruction to repeat
  many times.  On the x86 architecture, each instruction should only
  count as *one* instruction, not the number of repeats.
  Our code takes this into account.
     
Validation
~~~~~~~~~~
 
 The code has been tested on x86 and compared against real 
 hardware.  These is a forthcoming paper on this (the research is done).
  
 
Installing
~~~~~~~~~~

  * Download qemu version 0.9.1 from http://fabrice.bellard.free.fr/qemu/
    
  * Unpack qemu
  
  * cd into the qemu directory and do:
      patch -p1 < qemu-simpoint.patch
    
    It should show the files in question being patched.
  
  * Now run "./configure" and configure and build qemu as normal.
    
  * The simpoint modified qemu should now be ready in
      ./i386-linux-user/qemu-i386


Using
~~~~~

   You run the tool something like this:
   
   ./i386-linux-user/qemu-i386 -s 8048048 program
   
   where program is the program you want to run, and the -s option sets the
   stack size to use (gcc from SPEC2000 needs at least 8MB like above.  Other
   programs are probably fine with the default of 512kB).
   
   The interval_size is hardcoded to 100 million.
   The output file is hardcoded to qemusim.bbv in the same directory.

Performance
~~~~~~~~~~~
  Using qemusim slows down execution by an average of around
  27x over native on a Pentium D system.
    
  This is because we instrument every instruction, we could
    definitely speed this up by instrumenting only at the basic block level.

Accuracy
~~~~~~~~
  The total instruction count is considerably off of the values returned by
  pin, valgrind, and perfctr2 for a few benchmarks:
     art (it ignores forced 64-bit fpu mode, uses 80bit)
     fma3d, gcc, lucas, mcf, vortex (this might be fixed with version 0.2)

Thanks
~~~~~~
  To Shane Brennan who sent me his qemu/simpoint code although I ended up
     not using it.
  To Stuart Brady whose post on gathering PC values on the qemu list
     helped me figure out where to get started on this.

References
~~~~~~~~~~
[1]  T Sherwood, E Perelman, G. Hamerly, B. Calder.  Automatically
     Characterizing Large Scale Program Behavior.  ASPLOS X, 2002.

[2]  V.M. Weaver, S.A. McKee, "Using Dynamic Binary Instrumentation to
     Generate Multi-Platform Simpoints: Methodology and Accuracy", 3rd EC
     International Conference on High Performance Embedded Architectures and
     Compilers (HiPEAC'08), GÃ¶teborg, Sweden, January 2008.
