Chapter 3: Pipelining
---


1. Instruction fetch (IF)
-------------------------

	IR <- Mem[PC];
	NPC <- PC + 4;

2. Instruction decode / register fetch (ID)
-------------------------------------------

	A <- Regs[IR6..10];
	B <- Regs[IR11..15];
	Imm <- ((IR16)^16##IR16..31)

3. Execution / effective address (EX)
-------------------------------------

  Memory reference:

	ALUOutput <- A + Imm;

  Register-Register ALU instruction:

	ALUOutput <- A func B;

  Register-immediate ALU instruction:

	ALUOutput <- A op Imm;

  Branch:

	ALUOutput <- NPC + Imm;
	Cond <- (A op 0)

4. Memory access / branch completion (MEM)
------------------------------------------

  Memory reference:

	LMD <- Mem[ALUOutput] or
	Mem[ALUOutput] <- B;

  Branch:

	if (cond) PC <- ALUOutput else PC <- NPC;

5. Write-back (WB)
------------------

  Register-register ALU:

	Regs[IR16..20] <- ALUOutput;
	
  Register-immediate ALU:

	Regs[IR11..15] <- ALUOutput;

  Load instruction:

	Regs[IR11..15] <- LMD;


- the use of different caches for instruction and data eliminates the
  conflict for a single memory that would arise between instruction
  fetch and data memory access for two instructions;

- imbalance among pipe stages reduces performance since the clock can
  run no faster than the slowest pipeline stage;

- hazards: structural hazards, data hazards, and control hazards;

- data hazards:

	- read after write (RAW;
	- write after write (WAW);
	- write after read (WAR);

- the process of letting an instruction move from the ID into the EX
  state of the DLX pipeline is called instruction issue.  For the DLX 
  integer pipeline, all the data hazards can be checked during the ID
  phase.  If a data hazard exists, the instruction is stalled BEFORE
  it is issued;

- when a hazard is detected, we need only change the control portion
  of the ID/EX pipeline register to all 0s, which happens to be a no-op;
  in addition, we simply recirculate the contents of the IF/ID
  registers to hold the stalled instruction;

- the simplest method of dealing with branches is to stall the
  pipeline as soon as we detect the branch until we reach the MEM
  stage, which determines the new PC;

- branch stall: originally 2 cycles (3 if IF needs to be repeated):

	IF	ID	EX	MEM	WB
		IF	stall	stall	IF	ID

- the number of clock cycles in a branch stall can be reduced by two
  steps:

	- find out whether the branch is taken or not taken earlier in
	  the pipeline;

	- compute the taken PC earlier;

- both of these steps should be taken as early as possible in the
  pipeline;

- with a separate adder to calculate the branch address in ID and a
  branch decision also made during ID, there is only a one-clock-cycle
  stall on branches:

	IF	ID	EX	MEM	WB
		IF	-	-	-	-
			IF	ID	EX	MEM	WB

- in some machines, branch hazards are even more expensive in clock
  cycles than in our example, since the time to evaluate the branch
  condition and compute the destination can be even longer;

- large, deeply pipeline machines often have branch penalties of six
  or seven clock cycles;

- 67% of the conditional branches are taken on average;

- 85% of the backwards conditional branches are taken on average
  (usually because of loops);

- predict-not-taken, or predict-untaken strategy:


	- untaken branch:

	branch	IF	ID	EX	MEM	WB
			IF	ID	EX	MEM	WB

	- taken branch:

	branch	IF	ID	EX	MEM	WB
			IF	idle	idle	idle
				IF	ID	EX	MEM 	WB

- because in DLX both the target address and the condition are
  evaluated in the same cycle (ID), there is no advantage in using a
  predict-taken approach;

- another scheme in use in some machines is to use delayed branches,
  filling the branch delay slots with unconditional instructions;

- the limitations on delayed-branch scheduling arise from (1) the
  restrictions on the instructions that are scheduled into the delay
  slots and (2) our ability to predict at compile time whether a
  branch is likely to be taken or not;

- to improve the ability of the compiler to fill branch delay slots,
  most machines with conditional branches have introduced a cancelling
  or nullifying branch. In a cancelling branch, the instruction
  includes the direction that the branch was predicted.  When the branch
  behaves as predicted, the instruction in the slot is simply executed
  as it would normally be.  If not, the instruction is turned into a
  no-op;

- the advantage of cancelling branches is that they eliminate the
  requirements on the instruction placed in the delay slot;

- two ways to statically predict branches: by examination of the
  program behavior and by the use of profile information collected 
  from earlier runs of the program;

- exceptions can be:

	+ synchronous or asynchronous: 

		- synchronous: occurs at the same place every time the
		  program is executed with same data and memory
		  allocation;

		- asynchronous: with the exception of hardware
		  malfunction, asynchronous evens are caused by
		  devices external to the processor and memory;

	+ user requested vs. coerced:

		- user requested: user task directly asks for it;

		- coerced: caused by some hardware event that is not
		  under the control of the user program;

	+ user maskable vs. user nonmaskable:

	+ within versus between instructions:

		- within: occurs in the middle of executing an
		  instruction; are usually synchronous;

		- between: events do not prevent instruction
		  completion;

	+ resume vs. terminate:

		- terminate: program's execution always stops after
		  the exception;

		- resume: program continues after exception handling;

- if a pipeline provides the ability for the machine to handle the
  exception, save the state, and restart without affecting the
  execution of the program, the pipeline or machine is said to be
  restartable (most machines today are restartable);

- when an exception occurs, the pipeline control can take the
  following steps to save the pipeline state safely:
	
	- force a trap instruction into the pipeline on the next IF;

	- until the trap is taken, turn off all writes for the
	  faulting instruction and for all instructions that follow 
 	  in the pipeline (can be done by zeroing into the pipeline 
	  latches);

	- after the exception-handling routing in the operating system
	  receives control, it immediately saves the PC of the
	  faulting instruction.  This value will be used to return from 
	  the exception later;

- when delayed branches are used, it is no longer possible to
  re-create the state of the machine with a single PC because the
  instructions in the pipeline may not be sequentially related.  So, we
  need to save and restore as many PCs as the length of the branch delay
  plus one;

- any machine with demand paging or IEEE arithmetic trap handlers must
  make its exceptions precise;

- exceptions in pipeline stages:

	- IF: page fault on instruction fetch, misaligned access,
	  protection violation;

	- ID: undefined or illegal opcode;

	- EX: arithmetic exception;

	- MEM: page fault on data fetch, misaligned access, protection
	  violation;

	- WB: none

- when pipelining a given instruction set is difficult, designers
  usually pipeline the microinstruction execution: a microinstruction
  is a simple instruction used in sequences to implement a more complex
  instruction set.  They are simple and look a lot like a RISC instr.
  set;

- latency of a functional unit: number of intervening clock cycles
  between an instruction that produces a result and an instruction
  that uses the result; usually the number of stages after the EX
  cycle that an instruction produces a result;

- initiation or repeat interval: number of cycles that must elapse
  between issuing two operations of a given type;

-	Instruction		Latency		  Initiation interval
	-----------		-------		  -------------------

	integer ALU		0			1
	
	data memory (loads)	1			1

	FP add			3			1

	FP multiply		6			1

	FP divide		14			15 (not pipelined)


- checks made in ID before an instruction can issue: structural hazards, 
  RAW data hazard, and WAW data hazard;
	
MIPS R4000
----------

- deeper pipeline allows it to achieve higher clock rates (8 stages);

- in addition to substantially increasing the amount of forwarding
  required, this longer latency pipeline increases both the load and
  branch delays:

	- load: 2 cycles;

	- branch: 3 cycles;

- the MIPS architecture has a single-cycle delayed branch; the R4000
  uses a predict-not-taken strategy for the remaining two cycles;

	- untaken branches: 1 delay slot;

	- taken branches: 1 delay slot + 2 stalls;

- four major causes of pipeline stalls in the R4000:

	- load stalls (moderate);

	- branch stalls (high);

	- FP result stalls (high);

	- FP structural stalls (low);

- the frequency of the last two types of stall show that reducing the
  latency of FP operations should be the first target, rather than
  more pipelining or replication of the functional units;

- limited parallelism in the instruction stream means that increasing
  the number of pipeline stages, called the pipeline depth, will
  eventually increase the CPI, due to dependences that require stalls;

- second, clock skew and latch overhead combine to limit the decrease
  in clock period obtaining by further pipelining (experiments show
  decrease in performance for more than 8 stages);