qemu_bbv -- a tool to generate SimPoint BBVs (basic block vectors) by Vince Weaver vince _at_ csl.cornell.edu Background ~~~~~~~~~~ The SimPoint algorithm (see http://www.cse.ucsd.edu/~calder/simpoint/ for more information) is a method of speeding up architectural simulations by exploiting the fact that programs have phase behavior, enabling you to extrapolate total performance while only simulating small portions of the program. The general idea depends on the property that most program show phase-based behavior. This means that at various times during program execution you will encounter intervals of time where the code behaves very similarly to an interval that happened already in the past. If you can detect these intervals and group them together, you can approximate the total behavior of the program by only simulating the bare minimum number of intervals, and then scaling these results up to predict what the actual results would have been if you simulated the entire program. This is useful in computer architecture research, as running a benchmark on a cycle-accurate simulator can cause a slowdown of 1000x, making it take days, weeks, or even longer to run until completion. If you utilize SimPoints, you can simulate a small fraction of the program but still get a good estimate of what the behavior would be for the entire program. See Sherwood et al [1] for more info. In order to calculate the phases, you need to collect so-called Basic Block Vectors (BBVs) from the benchmark of interest. These are a list of all Basic Blocks entered [a basic block is a section of code with only one entry point and one exit point], with a count of how many times each block was run. This count is also multiplied by the number of instructions that are in the basic block. The multiplication is meant to weigh the count so that instructions in small Basic Blocks aren't counted as more important than instructions in large Basic Blocks. The Basic Block Vector count is dumped at certain fixed intervals. A common way to do things is to do this every 100 million committed instructions. You can use smaller or larger intervals. Having fewer (bigger) intervals makes analysis go faster, and it also avoids problems with warm-up when simulating (since when using SimPoints you tend to fast-forward or otherwise skip to the interval of interest, you are starting from scratch metric wise. If you are looking at things like caches or branch predictors, you are going to get wrong results at first because your structures are empty, rather than full of previous state they would have if you ran the whole way up until now. Eventually the structures "warm up" and start producing the proper results. Smaller intervals mean more of your simulation results will be skewed by the wrong values that are returned before the structures are ready). The actual on-disk format of a BBV files looks something like this: T:45:1024 :189:99343 T:11:78573 :15:1353 :56:1 T:18:45 :12:135353 :56:78 314:4324263 and so on. Each new interval starts with a T. Then there is a colon followed by a value. This is a unique number identifying a basic block. This is followed by another colon, and then followed by a value that is the number of times the block was entered multiplied by the number of instructions in the block. The Simpoint utility takes a BBV file as described above as an input. You run simpoint like this: ./SimPoint.3.2/bin/simpoint -inputVectorsGzipped \ -loadFVFile results.qemu.bb.gz \ -k 5 -saveSimpoints results.simpts \ -saveSimpointWeights results.weights The Simpoint program does random linear projection using 15-dimensions, then does k-mean clustering to calculate which intervals are the one of interest. In this case we specified we want 5 with the -k 5 option. So as output we will have the "results.simpts" file which will simply list the 5 intervals of interest (which when we multiply by 100 million [or whatever our interval size we chose] will give us the offset in number of instructions from the beginning of the program to start simulating). If we then simulate 100 million instructions from the start of each of those intervals, gather our statistics, then weight them according to the weights printed in the "results.weights" file, we then should have an approximation of how the entire program would have behaved, in a small fraction of the time. Qemu Implementation ~~~~~~~~~~~~~~~~~~~~~~~ This qemu patch generates the BBV output as described above. Qemu is not designed for architecture research, so some intrusive changes have to be made to the code to get things to work. First, the "TranslationBlock" structure is modified so that it can hold a "unique_id" field. Second, the "tb_alloc()" function is modified so that when a TranslationBlock (i.e., a BasicBlock) is translated it is given a unique id. Third, we add code at the instruction decode site for the various architectures. This code inserts into the new (translated) basic block as call to our "dump_pc()" routine after each original instruction. This dump_pc() function updates the BBV count and total count, and if the total count is more than the interval size writes out a line of BBV data and resets all the counters. Note - on x86 processors a rep-prefixed instruction only counts as *one* retired instruction. We make sure to handle this properly. The patch might also contain some minor fixes the architectural simulation that have not made it upstream into the main Qemu repository yet. Pitfalls ~~~~~~~~ Stack size and gcc ~~~~~~~~~~~~~~~~~~ The SPEC2000 gcc benchmarks may segfault part way through execution. By default the linux-user qemu binary does not allocating enough stack. By default qemu-i386 only allocates 512kB of stack, and gcc requires at least 8MB. To up this, use the "-s" option to the qemu-i386 executable. REPZ and REPNZ ~~~~~~~~~~~~~~ On x86 the REPZ and REPNZ prefixes cause a string instruction to repeat many times. On the x86 architecture, each instruction should only count as *one* instruction, not the number of repeats. Our code takes this into account. Validation ~~~~~~~~~~ The code has been tested on x86 and compared against real hardware. The results are described in our HiPEAC 2008 paper[2]. Further validation is being done on the SPARC, x86_64, and MIPS results. Installing ~~~~~~~~~~ * Download git tree from 13 January 2010 of qemu using git git clone git://git.savannah.nongnu.org/qemu.git (note, my git skills are still weak so you're on your own getting the right tree). * enter the git directory * run ./configure and then run make to build all of qemu * now patch with the included qemu_bbv patchset. In the svn directory do patch -p1 < ./qemu_bbv_patch.20100113 * Optionally patch using the alpha and mips patches included if you care about those architectures. * now enter your linux-user emulator of interest and run "make" For example, to re-build the i386-linux-user one cd i386-linux-user make qemu-i386 And it should build the patched version for you. Using ~~~~~ You run the tool something like this: ./i386-linux-user/qemu-i386 -s 8048048 program where program is the program you want to run, and the -s option sets the stack size to use (gcc from SPEC2000 needs at least 8MB like above. Other programs are probably fine with the default of 512kB). The interval_size is hardcoded to 100 million. The output file is hardcoded to qemusim.bbv in the same directory. Limitations ~~~~~~~~~~~ + Currently only user-space syscall-by-proxy mode is supported. Performance ~~~~~~~~~~~ Using qemusim slows down execution by an average of around 27x over native on a Pentium D system. This is because we instrument every instruction, we could definitely speed this up by instrumenting only at the basic block level. Accuracy ~~~~~~~~ The total instruction count is considerably off of the values returned by pin, valgrind, and perfctr2 for a few benchmarks: art (it ignores forced 64-bit fpu mode, uses 80bit) Special care needs to be used to get the SPEC benchmarks to run deterministically. Most important is having your UNIX environment variables consistent across runs, and also turning off heap/stack randomization on more recent versions of Linux. See our IISWC'08 paper for more details on this. Thanks ~~~~~~ To Shane Brennan who sent me his preliminary qemu/bbv code although I ended up not using it. To Stuart Brady whose post on gathering PC values on the qemu list helped me figure out where to get started on this. References ~~~~~~~~~~ [1] T Sherwood, E Perelman, G. Hamerly, B. Calder. Automatically Characterizing Large Scale Program Behavior. ASPLOS X, 2002. [2] V.M. Weaver, S.A. McKee, "Using Dynamic Binary Instrumentation to Generate Multi-Platform Simpoints: Methodology and Accuracy", 3rd EC International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'08), Göteborg, Sweden, January 2008.