Chapter 3: Pipelining --- 1. Instruction fetch (IF) ------------------------- IR <- Mem[PC]; NPC <- PC + 4; 2. Instruction decode / register fetch (ID) ------------------------------------------- A <- Regs[IR6..10]; B <- Regs[IR11..15]; Imm <- ((IR16)^16##IR16..31) 3. Execution / effective address (EX) ------------------------------------- Memory reference: ALUOutput <- A + Imm; Register-Register ALU instruction: ALUOutput <- A func B; Register-immediate ALU instruction: ALUOutput <- A op Imm; Branch: ALUOutput <- NPC + Imm; Cond <- (A op 0) 4. Memory access / branch completion (MEM) ------------------------------------------ Memory reference: LMD <- Mem[ALUOutput] or Mem[ALUOutput] <- B; Branch: if (cond) PC <- ALUOutput else PC <- NPC; 5. Write-back (WB) ------------------ Register-register ALU: Regs[IR16..20] <- ALUOutput; Register-immediate ALU: Regs[IR11..15] <- ALUOutput; Load instruction: Regs[IR11..15] <- LMD; - the use of different caches for instruction and data eliminates the conflict for a single memory that would arise between instruction fetch and data memory access for two instructions; - imbalance among pipe stages reduces performance since the clock can run no faster than the slowest pipeline stage; - hazards: structural hazards, data hazards, and control hazards; - data hazards: - read after write (RAW; - write after write (WAW); - write after read (WAR); - the process of letting an instruction move from the ID into the EX state of the DLX pipeline is called instruction issue. For the DLX integer pipeline, all the data hazards can be checked during the ID phase. If a data hazard exists, the instruction is stalled BEFORE it is issued; - when a hazard is detected, we need only change the control portion of the ID/EX pipeline register to all 0s, which happens to be a no-op; in addition, we simply recirculate the contents of the IF/ID registers to hold the stalled instruction; - the simplest method of dealing with branches is to stall the pipeline as soon as we detect the branch until we reach the MEM stage, which determines the new PC; - branch stall: originally 2 cycles (3 if IF needs to be repeated): IF ID EX MEM WB IF stall stall IF ID - the number of clock cycles in a branch stall can be reduced by two steps: - find out whether the branch is taken or not taken earlier in the pipeline; - compute the taken PC earlier; - both of these steps should be taken as early as possible in the pipeline; - with a separate adder to calculate the branch address in ID and a branch decision also made during ID, there is only a one-clock-cycle stall on branches: IF ID EX MEM WB IF - - - - IF ID EX MEM WB - in some machines, branch hazards are even more expensive in clock cycles than in our example, since the time to evaluate the branch condition and compute the destination can be even longer; - large, deeply pipeline machines often have branch penalties of six or seven clock cycles; - 67% of the conditional branches are taken on average; - 85% of the backwards conditional branches are taken on average (usually because of loops); - predict-not-taken, or predict-untaken strategy: - untaken branch: branch IF ID EX MEM WB IF ID EX MEM WB - taken branch: branch IF ID EX MEM WB IF idle idle idle IF ID EX MEM WB - because in DLX both the target address and the condition are evaluated in the same cycle (ID), there is no advantage in using a predict-taken approach; - another scheme in use in some machines is to use delayed branches, filling the branch delay slots with unconditional instructions; - the limitations on delayed-branch scheduling arise from (1) the restrictions on the instructions that are scheduled into the delay slots and (2) our ability to predict at compile time whether a branch is likely to be taken or not; - to improve the ability of the compiler to fill branch delay slots, most machines with conditional branches have introduced a cancelling or nullifying branch. In a cancelling branch, the instruction includes the direction that the branch was predicted. When the branch behaves as predicted, the instruction in the slot is simply executed as it would normally be. If not, the instruction is turned into a no-op; - the advantage of cancelling branches is that they eliminate the requirements on the instruction placed in the delay slot; - two ways to statically predict branches: by examination of the program behavior and by the use of profile information collected from earlier runs of the program; - exceptions can be: + synchronous or asynchronous: - synchronous: occurs at the same place every time the program is executed with same data and memory allocation; - asynchronous: with the exception of hardware malfunction, asynchronous evens are caused by devices external to the processor and memory; + user requested vs. coerced: - user requested: user task directly asks for it; - coerced: caused by some hardware event that is not under the control of the user program; + user maskable vs. user nonmaskable: + within versus between instructions: - within: occurs in the middle of executing an instruction; are usually synchronous; - between: events do not prevent instruction completion; + resume vs. terminate: - terminate: program's execution always stops after the exception; - resume: program continues after exception handling; - if a pipeline provides the ability for the machine to handle the exception, save the state, and restart without affecting the execution of the program, the pipeline or machine is said to be restartable (most machines today are restartable); - when an exception occurs, the pipeline control can take the following steps to save the pipeline state safely: - force a trap instruction into the pipeline on the next IF; - until the trap is taken, turn off all writes for the faulting instruction and for all instructions that follow in the pipeline (can be done by zeroing into the pipeline latches); - after the exception-handling routing in the operating system receives control, it immediately saves the PC of the faulting instruction. This value will be used to return from the exception later; - when delayed branches are used, it is no longer possible to re-create the state of the machine with a single PC because the instructions in the pipeline may not be sequentially related. So, we need to save and restore as many PCs as the length of the branch delay plus one; - any machine with demand paging or IEEE arithmetic trap handlers must make its exceptions precise; - exceptions in pipeline stages: - IF: page fault on instruction fetch, misaligned access, protection violation; - ID: undefined or illegal opcode; - EX: arithmetic exception; - MEM: page fault on data fetch, misaligned access, protection violation; - WB: none - when pipelining a given instruction set is difficult, designers usually pipeline the microinstruction execution: a microinstruction is a simple instruction used in sequences to implement a more complex instruction set. They are simple and look a lot like a RISC instr. set; - latency of a functional unit: number of intervening clock cycles between an instruction that produces a result and an instruction that uses the result; usually the number of stages after the EX cycle that an instruction produces a result; - initiation or repeat interval: number of cycles that must elapse between issuing two operations of a given type; - Instruction Latency Initiation interval ----------- ------- ------------------- integer ALU 0 1 data memory (loads) 1 1 FP add 3 1 FP multiply 6 1 FP divide 14 15 (not pipelined) - checks made in ID before an instruction can issue: structural hazards, RAW data hazard, and WAW data hazard; MIPS R4000 ---------- - deeper pipeline allows it to achieve higher clock rates (8 stages); - in addition to substantially increasing the amount of forwarding required, this longer latency pipeline increases both the load and branch delays: - load: 2 cycles; - branch: 3 cycles; - the MIPS architecture has a single-cycle delayed branch; the R4000 uses a predict-not-taken strategy for the remaining two cycles; - untaken branches: 1 delay slot; - taken branches: 1 delay slot + 2 stalls; - four major causes of pipeline stalls in the R4000: - load stalls (moderate); - branch stalls (high); - FP result stalls (high); - FP structural stalls (low); - the frequency of the last two types of stall show that reducing the latency of FP operations should be the first target, rather than more pipelining or replication of the functional units; - limited parallelism in the instruction stream means that increasing the number of pipeline stages, called the pipeline depth, will eventually increase the CPI, due to dependences that require stalls; - second, clock skew and latch overhead combine to limit the decrease in clock period obtaining by further pipelining (experiments show decrease in performance for more than 8 stages);