***** CHAPTER 4 ***** ILP: Instruction Level Parallelism Pipeline CPI = Ideal pipeline CPI + Structural Stalls + RAW Stalls + WAR Stalls + WAW stalls + Control Stalls Technique Type of stall reduced Loop unrolling Control Basic pipeline scheduling RAW Scoreboarding RAW Register Renaming RAW Branch Prediction Control Multiple issue Ideal CPI Compiler dependence analysis Ideal CPI & data Software Pipelining Ideal CPI & data Speculation data & control memory disambiguation RAW to mem ILP: Hide the stalls of the pipeline by instruction scheduling. Stalls from: FP ALU, double loads/stores, etc. Simplest form is through loop unrolling: for (i=0; i<=1000; i++) x[i] = x[i] + s; Dependencies: Data dependencies which produce RAW hazard ADD r1, r0, r3 LD r5, 0(r1) Name dependence: WAW, WAR hazards Antidependence: WAR hazard with r1 ADD r5, r1, r3 SUB r1, r6, r7 Output dependence: WAW hazard with r1 ADD r1, r2, r3 SUB r1, r4, r5 Control dependence: S0; if (p1) S1; S2; Maintaining exception ordering Maintaining dataflow ordering Dynamic Scheduling: Example: DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 Can execute SUBD while DIVD,ADDD. Building hardware to do out-of-order execution and completion Out-of-order execution: Performing instr. out of order but garrenting in-order correctness Out-of-order completion: Allowing instr. to complete out of order. Challenging to handle exception right. Scoreboarding: (Note: scoreboarding does not imply register renaming) Allow inst. to exec out of order if there are resources avail. Four steps: Issue: Instruction is issued only if no hazard on dest register. Otherwise, instruction fifo is stalled. (WAW) Read Operands: Source register is available is no instruction upstream is going to write to it. (RAW) Execution: Functional unit does it thang. Write result: Checks whether other upstream instructions need dest rest value before its changed (WAR) Costs: Lots of wires. Need a big buss connecting all of the functional units. Limitations: Amount of parallelism available in instructions. Number of scoreboard entries. Number of functional units. Antidependences and output dependences (fixed with reg renaming) Tomasulo Approach Use register renaming to eleminate WAR and WAW dependences. During issue rename destinations to free names. With more agressive ILP, Tomasulo needed to maintain deep pipes. Branch Prediction Branch prediction buffer (aka branch history table) Small buffer indexed by LS Bits of PC indicates whether branch was taken in past. Need more than one bit of branch predition: Basic loop will miss on first and last iterations Two bit prediction: Needs two misses predictions before changing Basic loop will miss only on last. Rarely need more than 2 bits. Size of table doesn't need to be that large. 4K entries seems to work fine. Correlating predictors (aka two-level prediction): Use result from last branch to determine which predictor to use for this one. Helps with nested if's. Use shift register to keep track of history. Branch Target buffer: Keep track of PC which branch will jump to. Predicting future PC as to not stall instruction fetch. Basic table of recent branch PC's and resulting target PC if branch is taken. Optimization: Store target instruction instead of target PC. Multiple Issue: Two flavors: Superscalar and VLIW Superscalar: Dynamic or static scheduling of instructions which can be performed in parallel. Simplest would be to allow FP and Integer pipes to operate in parallel. Carefulness about WAW and WAR through memory. Chalenges: Lots of hardware, lots of potential hazards. VLIW: Very long instruction word. One instruction might do two integer operations, two floating point, two mem ops, and a branch. For achitectures that need to rely on smart compilers. Chalenges: Large instruction bandwidth, code compatibility, compiler technology. Compiler Support fo ILP: Finding and fixing dependencies: Eliminating loop-carried dependencies. Loop unrolling. Elminiating name dependencies. Software pipelining: Interleave instructions from different iterations to eleminate dependencies. Trace scheduling: Compiler finds trace through code and reorders computation to result in the least number of instructions. Hardware Support for more ILP: Conditional instructions. Eleminates branch. a = b?c:d; Poison bits: Result registers have bit attached to specular execution indicating that they should cause an execption if taken. LW* R14, 0(R2) // Speculative, doesn't cause exeception unless R14 is used. Boosting: Instructions are boosted if executed speculatively. Boosted instructions are commited only when branch is confirmed. Hardware Based Speculation: Combines 3 ideas: Dynamic branch prediction, speculation to keep working past branches, and dynamic scheduling. Reorder Buffer: Provide a reorder buffer to maintain in-order instruction commit. Hold temporary values until stored back into registers. Instruction are only commited once they reach top or reorder buffer. If branches are mis-predicted, reorder buffer is flushed. Execeptions only taken at top of reoder buffer. (SKIPPED PERFORMANCE EVALUTATION)