***** CHAPTER 4 *****

ILP: Instruction Level Parallelism

Pipeline CPI = Ideal pipeline CPI +
	 Structural Stalls +
	 RAW Stalls +
	 WAR Stalls +
	 WAW stalls +
	 Control Stalls

Technique 			Type of stall reduced
Loop unrolling			Control
Basic pipeline scheduling 	RAW
Scoreboarding			RAW
Register Renaming		RAW
Branch Prediction		Control 
Multiple issue			Ideal CPI
Compiler dependence analysis	Ideal CPI & data
Software Pipelining		Ideal CPI & data
Speculation			data & control
memory disambiguation		RAW to mem

ILP:
  Hide the stalls of the pipeline by instruction scheduling.
  Stalls from:  FP ALU, double loads/stores, etc.

  Simplest form is through loop unrolling:
  for (i=0; i<=1000; i++)
	x[i] = x[i] + s;

Dependencies:
  Data dependencies which produce  RAW hazard
    ADD r1, r0, r3
    LD r5, 0(r1)

  Name dependence:  WAW, WAR hazards 
    Antidependence:  WAR hazard with r1
      ADD r5, r1, r3
      SUB r1, r6, r7
    Output dependence:  WAW hazard with r1
      ADD r1, r2, r3
      SUB r1, r4, r5
  
  Control dependence:
    S0;  
    if (p1)
      S1;
    S2;  
    
    Maintaining exception ordering
    Maintaining dataflow ordering

Dynamic Scheduling:
  Example:
  DIVD F0, F2, F4
  ADDD F10, F0, F8
  SUBD F12, F8, F14  

  Can execute SUBD while DIVD,ADDD.
  Building hardware to do out-of-order execution and completion

  Out-of-order execution:
   Performing instr. out of order but garrenting in-order correctness
  Out-of-order completion:
   Allowing instr. to complete out of order.
   Challenging to handle exception right.

Scoreboarding:
  (Note: scoreboarding does not imply register renaming)
  Allow inst. to exec out of order if there are resources avail.

  Four steps:
    Issue:  Instruction is issued only if no hazard on dest register.
            Otherwise, instruction fifo is stalled.  (WAW)

    Read Operands:  Source register is available is no instruction
                    upstream is going to write to it. (RAW)

    Execution:      Functional unit does it thang.
  
    Write result:   Checks whether other upstream instructions
                    need dest rest value before its changed (WAR)
    
  Costs:
    Lots of wires.  Need a big buss connecting all of the functional
    units.

  Limitations:
    Amount of parallelism available in instructions.
    Number of scoreboard entries.
    Number of functional units.
    Antidependences and output dependences (fixed with reg renaming)

Tomasulo Approach

  Use register renaming to eleminate WAR and WAW dependences.
  During issue rename destinations to free names.
  With more agressive ILP, Tomasulo needed to maintain deep pipes.

Branch Prediction

  Branch prediction buffer (aka branch history table)
    Small buffer indexed by LS Bits of PC indicates whether branch was
taken in past.

  Need more than one bit of branch predition:
    Basic loop will miss on first and last iterations

  Two bit prediction:  Needs two misses predictions before changing
    Basic loop will miss only on last.

  Rarely need more than 2 bits.  Size of table doesn't need to be that
large.  4K entries seems to work fine.

  Correlating predictors (aka two-level prediction):
    Use result from last branch to determine which predictor to use
for this one.  Helps with nested if's.  Use shift register to keep
track of history.

Branch Target buffer:
  Keep track of PC which branch will jump to.  Predicting future PC
as to not stall instruction fetch.  Basic table of recent branch PC's
and resulting target PC if branch is taken.

  Optimization:  Store target instruction instead of target PC.

Multiple Issue:
  Two flavors: Superscalar and VLIW

  Superscalar:
    Dynamic or static scheduling of instructions which can be
performed in parallel.  Simplest would be to allow FP and Integer
pipes to operate in parallel.  Carefulness about WAW and WAR through
memory.
    Chalenges:  Lots of hardware, lots of potential hazards.

  VLIW:
    Very long instruction word.  One instruction might do two integer
operations, two floating point, two mem ops, and a branch.  For
achitectures that need to rely on smart compilers.
    Chalenges:  Large instruction bandwidth, code compatibility,
compiler technology.

Compiler Support fo ILP:

  Finding and fixing dependencies:
    Eliminating loop-carried dependencies.
    Loop unrolling.
    Elminiating name dependencies.

  Software pipelining:
    Interleave instructions from different iterations to eleminate 
dependencies.


  Trace scheduling:
    Compiler finds trace through code and reorders computation to
result in the least number of instructions.

Hardware Support for more ILP:

  Conditional instructions.  Eleminates branch.
     a = b?c:d;

  Poison bits:
    Result registers have bit attached to specular execution
indicating that they should cause an execption if taken.
    LW* R14, 0(R2)  // Speculative, doesn't cause exeception 
unless R14 is used.

  Boosting:
    Instructions are boosted if executed speculatively.  Boosted
instructions are commited only when branch is confirmed.

Hardware Based Speculation:

  Combines 3 ideas:  Dynamic branch prediction, speculation to
keep working past branches, and dynamic scheduling.

  Reorder Buffer:
    Provide a reorder buffer to maintain in-order instruction commit.
    Hold temporary values until stored back into registers.
    Instruction are only commited once they reach top or reorder
buffer.
    If branches are mis-predicted, reorder buffer is flushed.  
    Execeptions only taken at top of reoder buffer.


(SKIPPED PERFORMANCE EVALUTATION)