Chapter 4: Advanced Pipelining and Instruction-level Parallelism
---


- Pipeline CPI = Ideal CPI + Structural stalls + RAW stalls + WAR
  		 stalls + WAW stalls + Control stalls


- 	Technique:				 	Reduces:
	---------					-------

	Loop unrolling 					control stalls
	
	Basic pipeline scheduling			RAW stalls

	Dynamic scheduling w/ scoreboarding		RAW stalls

	Dynamic scheduling w/ register renaming		WAR and WAW stalls

	Dynamic branch prediction			Control stalls

	Issuing multiple instructions per cycle		Ideal CPI

	Compiler dependences analysis			Ideal CPI / data stalls

	Software pipelining and trace scheduling	Ideal CPI / data stalls

	Speculation					all data/control stalls

	Dynamic memory disambiguation			RAW stalls w/ memory


- to obtain substantial performance, we need to exploit parallelism
  across multiple basic blocks;

- unrolling improves the performance of loops substantially, although
  it increases the code size (duhh);

- dependences:

	- data dependences;

	- name dependences (no value being transmitted between
	  instructions):

		+ antidependence (WAR hazard);

		+ output dependence (WAW hazard);

	- control dependence;

- register renaming is used to remove name dependences (it couldn't be
  different);

- an instruction that is control dependent on a branch cannot be moved
  before the branch;

- an instruction that is not control dependent on a branch cannot be
  moved after the branch;

- this is clear when performing loop unrolling, as the branches from
  the various iterations do not allow the compiler to exploit ILP
  between iterations; solution: remove the branches! of course, we 
  need to assume something about the number of iterations;

- the two properties critical to program correctness, and normally
  preserved by control dependence, are the exception behavior and the
  data flow;

- control dependence is preserved by implementing control hazard
  detection that causes control stalls; techniques to reduce control
  stalls:

	- delayed branches;

	- loop unrolling;

	- conditional instructions;

	- compiler- and hardware-based speculation;

* a loop is parallel if it can be written without a cycle in the
* dependences (S1 depends on S2, but S2 does not depend on S1);

- dynamic scheduling offer several advantages: it enables handling
  some cases when dependences are not known at compile time (they may
  involve a memory reference);

- dynamic scheduling introduces out-of-order completion, which create
  problems to handle exceptions;

- divide the ID stage in two stages: 

	- issue: decode instruction, check for structural hazards;

	- read operands: wait until no hazards, then read operands;

- WAR hazards, which did NOT exist in the DLX floating-point or
  integer pipelines, may arise when instructions execute out of order;
  WAW hazards must also be detected;

- Scoreboard
  ----------

* the scoreboard decides when the instruction can read its operands
* and begin execution.  If the scoreboard decides the instruction
* cannot execute immediately, it monitors every change in the hardware
* and decides when the instruction CAN execute.  it also controls when
* an instruction can write its result into the destination register;
* thus, all hazard detection and resolution is centralized in the
* scoreboard;

- each (FP) operation undergoes four steps:

	* ISSUE - if a functional unit for the instruction is free and
	* no other active instructin has the same destination register,
	* the scoreboard issues the instruction to the functional unit
	* and updates its internal data structure;

	* READ OPERANDS - a source operand is available if no earlier
	* issued active instruction is going to write it, or if the
	* register containing the operand is being written by a
	* currently active functional unit;

	* EXECUTION - the functional unit begins execution upon
	* receiving operands; when the result is ready, it notifies
	* the scoreboard that it has completed execution;

	* WRITE RESULT - an instruction is blocked in this stage when
	* a previous instruction is still to read the destination
	* register (WAR hazard);

- because the operands for an instruction are read only when both
  operands are available in the register file, this scoreboard does
  not take advantage of forwarding;  however, this is not that bad, as
  instructions will write their results as soon as they complete
  execution, rather than wait for a fixed pipeline stage (such as WB);

- three parts to the scoreboard: instruction status, functional unit
  status, and register result status;

- main cost of the CDC scoreboard: large number of buses - about four
  times as many as would be required if the processor only executed
  instructions in order;

- the recently increasing interest in dynamic scheduling is motivated
  by attempts to issue more instructions per cycle and by ideas like
  speculation, that naturally build on dynamic scheduling;

- a scoreboard is limited in several factors:

	- amount of parallelism available among instructions;

	- number of scoreboard entries;
	
	- number and types of functional units;

	- the presence of antidependences and output dependences,
	  which led to WAR and WAW stalls;


- The Tomasulo approach (register renaming)
  -----------------------------------------
 
- combines key elements of the scoreboarding scheme with register
  renaming, used to avoid WAR and WAW hazards;

* in Tomasulo's scheme, register renaming is provided by the
* reservation stations, which buffer operands of instructions waiting
* to issue, and by the issue logic; the basic idea is tha a reservation
* station fetches and buffers an operand as soon as it is available,
* eliminating the need to get the operand from a register; in
* addition, pending instructions designate the reservation station
* that will provide their input;

- the use of register renaming to avoid WAR and WAW hazards is the
  main difference between scoreboarding and the Tomasulo's scheme;

- two other differences in organization, compared to the scoreboard:

	- hazard detection and execution control are distributed: the
	  reservation stations are responsible for that;

	- results are passed directly to functional units from the
	  reservation stations where they are buffered, rather than 
 	  going through the registers; this is done by putting the
	  results in the common data bus (CDB);

- in Tomasulo, there are only three steps:

	* ISSUE - get an instruction from the floating-point queue; if
	* the operation is a floating point operation, issue it if
	* there is an empty reservation station and send the operands
	* to it if they are available at the registers;  if the
	* operation is a load or store, it can be issued if there is
	* an available buffer;

	* EXECUTE - if any of the operands is not available, monitor
	* the CDB;  when all operands are available, execute the
	* operation;

	* WRITE RESULT - when the result is available, write it on the
	* CDB and from there into the registers and any reservation
	* stations waiting for the result;

- when an instruction has issued and is waiting for a result, it
  refers to the operand by the reservation station number, rather than
  by the number of the destination register;

* other approaches could use a register set with additional registers
* or a structure like the reorder buffer;

- for a loop, if we predict that branches are taken, using reservation
  stations will allow multiple executions of the loop to proceed at
  once; the loop is dynamically unrolled;

- dynamic address disambiguation is performed at the load and store
  buffers;

- the major disadvantage of this approach is the complexity of the
  scheme, which requires a large amount of hardware;  also, the
  performance is limited by the single completion bus (CDB);

* the advantages of the Tomasulo approach versus compiler scheduling
* for an efficient single-issue pipeline are probably fewer than the
* costs of implementation; however, register renaming becomes more
* important as processors become more aggressive in their issue
* capability;

- Reducing branch penalties with dynamic hardware prediction
  ----------------------------------------------------------

- last chapter dealt with static branch prediction; also saw delayed
  branches, scheduled by the compiler;  this section deals with
  dynamic branch prediction, starting with simpler techniques;

1) Branch-prediction buffers
   -------------------------

- also called branch history table: small memory indexed by the lower
  portion of the address of the branch instruction; a bit says if the 
  branch was recently taken or not;

- performance shortcoming: even if a branch is almost always taken, we
  will likely predict incorrectly twice, rather than once, when it is
  not taken;

- to remedy this, two-bit prediction schemes are often used; in this
  case, a prediction must miss twice before it is changed;

- studies on n-bit predictors have shown that two-bit predictors do
  almost as well;

* while this scheme is useful for most pipelines, the DLX pipeline
* finds out both whether the branch is taken and what the target of
* the branch is at roughly the same time;  remember that the branch
* does a compare of a register against zero during the ID stage, which
* is when the effective address is also computed;

- general accuracy of two-bit branch-prediction buffers for SPEC89
  benchmark: from 82% to 99% accuracy;

* branch predictors that use the behavior of other branches to make a
* prediction are called correlating predictors or two-level predictors;

- example of predictor that uses 1 bit of correlation: two separate
  prediction bits, one prediction assuming the last executed branch 
  was not taken and another used it the last branch was taken; note 
  that in general the last branch executed is NOT the current branch;

- the example given in the book is called a (1,1) predictor since it
  uses the behavior of the last branch to choose from among a pair of
  one-bit predictors.  In the general case an (m,n) predictor uses the
  behavior of the last "m" branches to choose from 2^m branch
  predictors, each of which is a n-bit predictor for a single branch;

* experiments show that a (2,2) predictor with 1K entries is much
* better than a (0,2) predictor with 4K entries (no correlation);


2) Branch-target buffers
   ---------------------

- to reduce the branch penalty on DLX, we need to know from what
  address to fetch by the end of IF; a branch prediction cache that
  stores the predicted address for the next instruction after a branch 
  is called a branch-target buffer or a branch-target cache;

- if the PC of the fetched instruction matches a PC in the buffer,
  then the corresponding predicted PC is used as the next PC; if a
  match is found, fetching begins immediately at the predicted PC;

* we only need to store the predicted-taken branches in the
* branch-target buffer;

- if prediction was wrong, fetched instruction is invalidated (turned
  into a no-op);

- it seems that penalty for a misprediction is 2 cycles instead of 1
  because the buffer needs to be updated;


- Taking advantage of more ILP with multiple issue
  ------------------------------------------------

- objective: decrease the CPI to less than 1;

- superscalar processors issue more than one instruction at a time;
  they can be statically scheduled by the compiler or dynamically, in
  hardware, based on techniques like Tomasulo and scoreboarding;

- VLIWs, in contrast, issue a fixed number of instructions formatted
  either as one large instruction or as a fixed instruction packet;
  they are inherently statically scheduled by the compiler;

* a little problem with a two-issue superscalar version of DLX: loads
* have a latency of 3 instructions now (the other instruction in the
* same cycle and the two of the next one);  the branch delay also
* becomes 3 instructions;

- another approach: VLIW (very long instruction word): since the
  burden of choosing the instructions to be issued simultaneously falls
  on the compiler, the hardware in a superscalar to make these issue
  decisions is unneded in a VLIW processor;

- difficulties in expanding issue rate:

	- inherent limitations of ILP in programs;

	- difficulties in bulding the underlying hardware;

	- limitations specific to either a superscalar or VLIW
	  implementation;

- finding independent instructions to issue becomes even more
  difficult when dealing with multi-cycle operations, such as FP in DLX;

* duplicating functional units is not a problem, as the hardware cost
* scales linearly;  however, there is a large increase in the memory
* bandwidth and register-file bandwidth;

- excessive number of read and write ports cannot be supported without
  an increase in silicon area of the register file and possible
  degradation of clock speed;

- binary code compatiblity is the major logistical problem of VLIW;

- it is not clear whether a superscalar machine is better than a
  vector machine to handle loop-intensive scientific applications;

- Compiler support for exploiting ILP
  -----------------------------------

- if on iteration i, the loop references element i-5, it is said to
  have dependence distance of 5; the greater the distance, the more
  potential parallelism obtained by loop unrolling;

- in general, determining whether a dependence in a loop exists is
  NP-complete;

- renaming can be used by the compiler to eliminate output dependences
  and antidependences;

- pointers and indirect indexing make dependence detection basically
  impossible;

- software pipelining: reorganizes the loop such that each iteration
  in the software-pipelined code is made from instructions chosen from
  different iterations of the original loop;  can be thought of as
  symbolic loop unrolling;

- trace scheduling: composed of two processes: trace selection and
  trace compaction;  a trace is a sequence of basic blocks and can be
  generated by loop unrolling and branch prediction; remember that this
  is done statically by the compiler;

- loop unrolling, software pipelining, and trace scheduling all aim at
  trying to increase the amopunt of ILP that can be exploited by a
  processor issuing more than one instruction on every clock cycle; the
  effectiveness of each of these techniques and their suitability for
  various architectural approaches are among the hottest topics being
  actively pursued by researches and designers of high-speed processors;

- Hardware support for extracting more parallelism
  ------------------------------------------------

- when branch behavior is not well known, compiler techniques alone
  may not be enough;

- to expoint more parallelism, designers have explored an idea called
  speculation, which allows the execution of an instruction before the
  processor knows that the instruction should execute;

- first approach: compiler chooses to make an instruction speculative
  and the hardware helps by making it easier to ignore the outcome of
  an incorrectly speculated instruction;


1) Conditional or predicated instructions
   --------------------------------------

- a condition is executed with the instruction, which is turned into a no-op
  if the condition is false;

* conditional instructions can be used to remove the branch
* instructions in simple code sequences, such as :

	if (A=0) {		      S	  T   A	
	    S = T;      ====>	CMOVE R2, R3, R1
	}

- in the example above, the conditional move is used to change a
  control dependence into a data dependence;

- we must guarantee that the speculated instruction does not introduce
  an exception (e.g., when a conditional instruction is moved before a
  branch on which it is dependent);

- the usefulness of conditional instructions is limited by several
  factors:

	- the ones that are annulled still take execution time;
	
	- they are most useful when the condition can be evaluated
	  early;

	- its use is limited when the control flow involves more than
	  a simple alternative sequence;

	- they may have some speed penalty compared with normal
	  instructions, which may show up as a higher-cycle count for
	  such instructions or a slower clock rate overall (current
	  architectures provide only simple conditional instructions - 
	  usually conditional move is the only one);

2) Compiler speculation with hardware support
   ------------------------------------------

- in moving instruction across a branch the compiler must ensure that
  exception behavior is not changed and the dynamic data dependence
  remains the same;

- three methods have been investigated to support more aggressive
  speculation without erroneous exception behavior:

	- hardware and OS ignore exceptions for speculated
	  instructions;

	- poison bits are attached to the result register written by
	  speculated instructions when the instructions cause
	  exceptions;  they cause a fault when a normal instruction
	  tries to use the "poisoned" register;

	- a mechanism is used to indicate that an instruction is
	  speculative;  the hardware then buffers its results until
	  the instruction is no longer speculative;

* exceptions that can be resumed can be accepted and processed for
* speculative instructions just as if they were normal instructions;

- speculative instructions can have their results saved in temporary
  registers and discarded if necessary;

- Poison bits
  -----------

- the use of poison bits allows compiler speculation with less change
  to the exception behavior;  programs that caused termination
  without speculation will still cause exceptions when instructions are
  speculated;

* a poison bit is added to every register and another bit is added to
* every instruction to indicate that it is speculative; the poison bit
* of the destination register i set whenever a speculative instruction
* results in a terminating exception; all other exceptions are handled
* immediately; if a speculative instruction uses a register with a
* poison bit turned on, the destination register of the instruction
* has it poison bit turned on; if a normal instruction tries to use a
* poisoned register, it causes a fault;

- Speculative instructions with renaming
  --------------------------------------

- the main disadvantages of the previous schemes are the need to
  introduce copies to deal with register renaming and the possibility
  of exhausting the registers;

- an alternative is to move instructions past branches, flagging them
  as speculative, and providing renaming and buffering in the
  hardware, much as Tomasulo's algorithm does;  this scheme is caled 
  boosting;

- a boosted instruction is executed speculatively based on a branch;
  its results can be used by other boosted instructions;  results of
  boosted instructions are commited to registers only after the branch 
  is resolved;

* now the same register can be used as destination of a boosted
* instruction, as the results are not speculatively commited to the
* register;

	LW	R1, 0(R3)
	LW+	R1, 0(R2)	// result commited only if brach is taken
	BNEZ	R1, L3
	ADD	R1, R1, #4
   L3:	SW 	0(R3), R1


- Hardware-based speculation
  --------------------------

- combines three ideas: dynamic branch prediction to choose which
  instructions to execute, speculation to allow the execution of
  instructions before the control dependences are resolved, and dynamic
  scheduling to deal with the scheduling of different combinations of
  basic blocks;

- advantages of hardware-based speculation:

	- able to disambiguate memory references;

	- better when hardware-based branch prediction is better than
	  software-based branch prediction done at compile time;

	- maintains a completely precise exception model even for speculated
  	  instructions;

	- does not require compensation or bookkeeping code;

- main disadvantage: complex and requires substantial hardware
  resources;

- when using Tomasulo, we must separate the bypassing of results among
  instructions, which is needed to execute an instruction
  speculatively, from the actual completion of an instruction; we also
  need to include another stage, the commit stage;

* the reorder buffer is used to hold the results of speculated
* instructions and is also used to pass results among instructions
* that may be speculated;  hence, the reorder buffer is a source of
* operands for instructions, just as the reservation stations provide
* operands in Tomasulo's algorithm; the register file is not updated
* until the instruction commits;  the CDB now is connected to the
* reorder buffer, which is then connected to the register file;

- each entry in the reorder buffer has three fields: the instruction
  type, the destination field, and the value field; it completely 
  replaces the load and store buffers;

- four steps in execution:

	* ISSUE - issue the instruction if there is an empty
	* reservation station and an empty slot in the reoder buffer;

	* EXECUTE - if operand is not available, monitor the CDB; when
	* they are available, execute the instruction;

	* WRITE RESULT - when the result is available, write it on the
	* CDB (with the reorder buffer tag sent when the instruction
	* issued) and from the CDB into the reorder buffer; mark this
	* reservation station as available;

	* COMMIT - when an instruction, other then a branch with
	* incorrect prediction, reaches the head of the reorder buffer
	* and its result is present in the buffer, update the register
	* with the result (or perform a memory write if instruction is
	* a store); when a branch with incorrect prediction reaches the
	* head of the reorder buffer, it indicated that the
	* speculation was wrong;  the reorder buffer is flushed and
	* execution is restarted at the correct successor of the
	* branch;

- as the reorder buffer is cleared on a branch misprediction,
  prediction accuracy, misprediction detection, and misprediction
  recovery increase in importance;

- if a speculated instruction raises an exception, the exception is
  recorded in the reorder buffer entry;

- to measure available ILP, experiments shown in section 4.7 assume
  infinite virtual registers (for renaming), perfect branch
  prediction, perfect jump prediction, and perfect memory disambiguation;

- with perfect machine, available parallelism was measured to lie
  between 18 and 150 instruction issues per cycle;

- as the instruction window decreases in size (was assumed to be
  infinite), so does the available parallelism;

- as expected, branch prediction is critical;  available parallelism
  decreases as the perfect branch prediction is replaced by real
  predictors;

- the use of finite register set also decreases ILP substantially;

- assuming real capabilities, FP applications show much better 
  parallelism than integer applications;