Chapter 4: Advanced Pipelining and Instruction-level Parallelism --- - Pipeline CPI = Ideal CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls - Technique: Reduces: --------- ------- Loop unrolling control stalls Basic pipeline scheduling RAW stalls Dynamic scheduling w/ scoreboarding RAW stalls Dynamic scheduling w/ register renaming WAR and WAW stalls Dynamic branch prediction Control stalls Issuing multiple instructions per cycle Ideal CPI Compiler dependences analysis Ideal CPI / data stalls Software pipelining and trace scheduling Ideal CPI / data stalls Speculation all data/control stalls Dynamic memory disambiguation RAW stalls w/ memory - to obtain substantial performance, we need to exploit parallelism across multiple basic blocks; - unrolling improves the performance of loops substantially, although it increases the code size (duhh); - dependences: - data dependences; - name dependences (no value being transmitted between instructions): + antidependence (WAR hazard); + output dependence (WAW hazard); - control dependence; - register renaming is used to remove name dependences (it couldn't be different); - an instruction that is control dependent on a branch cannot be moved before the branch; - an instruction that is not control dependent on a branch cannot be moved after the branch; - this is clear when performing loop unrolling, as the branches from the various iterations do not allow the compiler to exploit ILP between iterations; solution: remove the branches! of course, we need to assume something about the number of iterations; - the two properties critical to program correctness, and normally preserved by control dependence, are the exception behavior and the data flow; - control dependence is preserved by implementing control hazard detection that causes control stalls; techniques to reduce control stalls: - delayed branches; - loop unrolling; - conditional instructions; - compiler- and hardware-based speculation; * a loop is parallel if it can be written without a cycle in the * dependences (S1 depends on S2, but S2 does not depend on S1); - dynamic scheduling offer several advantages: it enables handling some cases when dependences are not known at compile time (they may involve a memory reference); - dynamic scheduling introduces out-of-order completion, which create problems to handle exceptions; - divide the ID stage in two stages: - issue: decode instruction, check for structural hazards; - read operands: wait until no hazards, then read operands; - WAR hazards, which did NOT exist in the DLX floating-point or integer pipelines, may arise when instructions execute out of order; WAW hazards must also be detected; - Scoreboard ---------- * the scoreboard decides when the instruction can read its operands * and begin execution. If the scoreboard decides the instruction * cannot execute immediately, it monitors every change in the hardware * and decides when the instruction CAN execute. it also controls when * an instruction can write its result into the destination register; * thus, all hazard detection and resolution is centralized in the * scoreboard; - each (FP) operation undergoes four steps: * ISSUE - if a functional unit for the instruction is free and * no other active instructin has the same destination register, * the scoreboard issues the instruction to the functional unit * and updates its internal data structure; * READ OPERANDS - a source operand is available if no earlier * issued active instruction is going to write it, or if the * register containing the operand is being written by a * currently active functional unit; * EXECUTION - the functional unit begins execution upon * receiving operands; when the result is ready, it notifies * the scoreboard that it has completed execution; * WRITE RESULT - an instruction is blocked in this stage when * a previous instruction is still to read the destination * register (WAR hazard); - because the operands for an instruction are read only when both operands are available in the register file, this scoreboard does not take advantage of forwarding; however, this is not that bad, as instructions will write their results as soon as they complete execution, rather than wait for a fixed pipeline stage (such as WB); - three parts to the scoreboard: instruction status, functional unit status, and register result status; - main cost of the CDC scoreboard: large number of buses - about four times as many as would be required if the processor only executed instructions in order; - the recently increasing interest in dynamic scheduling is motivated by attempts to issue more instructions per cycle and by ideas like speculation, that naturally build on dynamic scheduling; - a scoreboard is limited in several factors: - amount of parallelism available among instructions; - number of scoreboard entries; - number and types of functional units; - the presence of antidependences and output dependences, which led to WAR and WAW stalls; - The Tomasulo approach (register renaming) ----------------------------------------- - combines key elements of the scoreboarding scheme with register renaming, used to avoid WAR and WAW hazards; * in Tomasulo's scheme, register renaming is provided by the * reservation stations, which buffer operands of instructions waiting * to issue, and by the issue logic; the basic idea is tha a reservation * station fetches and buffers an operand as soon as it is available, * eliminating the need to get the operand from a register; in * addition, pending instructions designate the reservation station * that will provide their input; - the use of register renaming to avoid WAR and WAW hazards is the main difference between scoreboarding and the Tomasulo's scheme; - two other differences in organization, compared to the scoreboard: - hazard detection and execution control are distributed: the reservation stations are responsible for that; - results are passed directly to functional units from the reservation stations where they are buffered, rather than going through the registers; this is done by putting the results in the common data bus (CDB); - in Tomasulo, there are only three steps: * ISSUE - get an instruction from the floating-point queue; if * the operation is a floating point operation, issue it if * there is an empty reservation station and send the operands * to it if they are available at the registers; if the * operation is a load or store, it can be issued if there is * an available buffer; * EXECUTE - if any of the operands is not available, monitor * the CDB; when all operands are available, execute the * operation; * WRITE RESULT - when the result is available, write it on the * CDB and from there into the registers and any reservation * stations waiting for the result; - when an instruction has issued and is waiting for a result, it refers to the operand by the reservation station number, rather than by the number of the destination register; * other approaches could use a register set with additional registers * or a structure like the reorder buffer; - for a loop, if we predict that branches are taken, using reservation stations will allow multiple executions of the loop to proceed at once; the loop is dynamically unrolled; - dynamic address disambiguation is performed at the load and store buffers; - the major disadvantage of this approach is the complexity of the scheme, which requires a large amount of hardware; also, the performance is limited by the single completion bus (CDB); * the advantages of the Tomasulo approach versus compiler scheduling * for an efficient single-issue pipeline are probably fewer than the * costs of implementation; however, register renaming becomes more * important as processors become more aggressive in their issue * capability; - Reducing branch penalties with dynamic hardware prediction ---------------------------------------------------------- - last chapter dealt with static branch prediction; also saw delayed branches, scheduled by the compiler; this section deals with dynamic branch prediction, starting with simpler techniques; 1) Branch-prediction buffers ------------------------- - also called branch history table: small memory indexed by the lower portion of the address of the branch instruction; a bit says if the branch was recently taken or not; - performance shortcoming: even if a branch is almost always taken, we will likely predict incorrectly twice, rather than once, when it is not taken; - to remedy this, two-bit prediction schemes are often used; in this case, a prediction must miss twice before it is changed; - studies on n-bit predictors have shown that two-bit predictors do almost as well; * while this scheme is useful for most pipelines, the DLX pipeline * finds out both whether the branch is taken and what the target of * the branch is at roughly the same time; remember that the branch * does a compare of a register against zero during the ID stage, which * is when the effective address is also computed; - general accuracy of two-bit branch-prediction buffers for SPEC89 benchmark: from 82% to 99% accuracy; * branch predictors that use the behavior of other branches to make a * prediction are called correlating predictors or two-level predictors; - example of predictor that uses 1 bit of correlation: two separate prediction bits, one prediction assuming the last executed branch was not taken and another used it the last branch was taken; note that in general the last branch executed is NOT the current branch; - the example given in the book is called a (1,1) predictor since it uses the behavior of the last branch to choose from among a pair of one-bit predictors. In the general case an (m,n) predictor uses the behavior of the last "m" branches to choose from 2^m branch predictors, each of which is a n-bit predictor for a single branch; * experiments show that a (2,2) predictor with 1K entries is much * better than a (0,2) predictor with 4K entries (no correlation); 2) Branch-target buffers --------------------- - to reduce the branch penalty on DLX, we need to know from what address to fetch by the end of IF; a branch prediction cache that stores the predicted address for the next instruction after a branch is called a branch-target buffer or a branch-target cache; - if the PC of the fetched instruction matches a PC in the buffer, then the corresponding predicted PC is used as the next PC; if a match is found, fetching begins immediately at the predicted PC; * we only need to store the predicted-taken branches in the * branch-target buffer; - if prediction was wrong, fetched instruction is invalidated (turned into a no-op); - it seems that penalty for a misprediction is 2 cycles instead of 1 because the buffer needs to be updated; - Taking advantage of more ILP with multiple issue ------------------------------------------------ - objective: decrease the CPI to less than 1; - superscalar processors issue more than one instruction at a time; they can be statically scheduled by the compiler or dynamically, in hardware, based on techniques like Tomasulo and scoreboarding; - VLIWs, in contrast, issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet; they are inherently statically scheduled by the compiler; * a little problem with a two-issue superscalar version of DLX: loads * have a latency of 3 instructions now (the other instruction in the * same cycle and the two of the next one); the branch delay also * becomes 3 instructions; - another approach: VLIW (very long instruction word): since the burden of choosing the instructions to be issued simultaneously falls on the compiler, the hardware in a superscalar to make these issue decisions is unneded in a VLIW processor; - difficulties in expanding issue rate: - inherent limitations of ILP in programs; - difficulties in bulding the underlying hardware; - limitations specific to either a superscalar or VLIW implementation; - finding independent instructions to issue becomes even more difficult when dealing with multi-cycle operations, such as FP in DLX; * duplicating functional units is not a problem, as the hardware cost * scales linearly; however, there is a large increase in the memory * bandwidth and register-file bandwidth; - excessive number of read and write ports cannot be supported without an increase in silicon area of the register file and possible degradation of clock speed; - binary code compatiblity is the major logistical problem of VLIW; - it is not clear whether a superscalar machine is better than a vector machine to handle loop-intensive scientific applications; - Compiler support for exploiting ILP ----------------------------------- - if on iteration i, the loop references element i-5, it is said to have dependence distance of 5; the greater the distance, the more potential parallelism obtained by loop unrolling; - in general, determining whether a dependence in a loop exists is NP-complete; - renaming can be used by the compiler to eliminate output dependences and antidependences; - pointers and indirect indexing make dependence detection basically impossible; - software pipelining: reorganizes the loop such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop; can be thought of as symbolic loop unrolling; - trace scheduling: composed of two processes: trace selection and trace compaction; a trace is a sequence of basic blocks and can be generated by loop unrolling and branch prediction; remember that this is done statically by the compiler; - loop unrolling, software pipelining, and trace scheduling all aim at trying to increase the amopunt of ILP that can be exploited by a processor issuing more than one instruction on every clock cycle; the effectiveness of each of these techniques and their suitability for various architectural approaches are among the hottest topics being actively pursued by researches and designers of high-speed processors; - Hardware support for extracting more parallelism ------------------------------------------------ - when branch behavior is not well known, compiler techniques alone may not be enough; - to expoint more parallelism, designers have explored an idea called speculation, which allows the execution of an instruction before the processor knows that the instruction should execute; - first approach: compiler chooses to make an instruction speculative and the hardware helps by making it easier to ignore the outcome of an incorrectly speculated instruction; 1) Conditional or predicated instructions -------------------------------------- - a condition is executed with the instruction, which is turned into a no-op if the condition is false; * conditional instructions can be used to remove the branch * instructions in simple code sequences, such as : if (A=0) { S T A S = T; ====> CMOVE R2, R3, R1 } - in the example above, the conditional move is used to change a control dependence into a data dependence; - we must guarantee that the speculated instruction does not introduce an exception (e.g., when a conditional instruction is moved before a branch on which it is dependent); - the usefulness of conditional instructions is limited by several factors: - the ones that are annulled still take execution time; - they are most useful when the condition can be evaluated early; - its use is limited when the control flow involves more than a simple alternative sequence; - they may have some speed penalty compared with normal instructions, which may show up as a higher-cycle count for such instructions or a slower clock rate overall (current architectures provide only simple conditional instructions - usually conditional move is the only one); 2) Compiler speculation with hardware support ------------------------------------------ - in moving instruction across a branch the compiler must ensure that exception behavior is not changed and the dynamic data dependence remains the same; - three methods have been investigated to support more aggressive speculation without erroneous exception behavior: - hardware and OS ignore exceptions for speculated instructions; - poison bits are attached to the result register written by speculated instructions when the instructions cause exceptions; they cause a fault when a normal instruction tries to use the "poisoned" register; - a mechanism is used to indicate that an instruction is speculative; the hardware then buffers its results until the instruction is no longer speculative; * exceptions that can be resumed can be accepted and processed for * speculative instructions just as if they were normal instructions; - speculative instructions can have their results saved in temporary registers and discarded if necessary; - Poison bits ----------- - the use of poison bits allows compiler speculation with less change to the exception behavior; programs that caused termination without speculation will still cause exceptions when instructions are speculated; * a poison bit is added to every register and another bit is added to * every instruction to indicate that it is speculative; the poison bit * of the destination register i set whenever a speculative instruction * results in a terminating exception; all other exceptions are handled * immediately; if a speculative instruction uses a register with a * poison bit turned on, the destination register of the instruction * has it poison bit turned on; if a normal instruction tries to use a * poisoned register, it causes a fault; - Speculative instructions with renaming -------------------------------------- - the main disadvantages of the previous schemes are the need to introduce copies to deal with register renaming and the possibility of exhausting the registers; - an alternative is to move instructions past branches, flagging them as speculative, and providing renaming and buffering in the hardware, much as Tomasulo's algorithm does; this scheme is caled boosting; - a boosted instruction is executed speculatively based on a branch; its results can be used by other boosted instructions; results of boosted instructions are commited to registers only after the branch is resolved; * now the same register can be used as destination of a boosted * instruction, as the results are not speculatively commited to the * register; LW R1, 0(R3) LW+ R1, 0(R2) // result commited only if brach is taken BNEZ R1, L3 ADD R1, R1, #4 L3: SW 0(R3), R1 - Hardware-based speculation -------------------------- - combines three ideas: dynamic branch prediction to choose which instructions to execute, speculation to allow the execution of instructions before the control dependences are resolved, and dynamic scheduling to deal with the scheduling of different combinations of basic blocks; - advantages of hardware-based speculation: - able to disambiguate memory references; - better when hardware-based branch prediction is better than software-based branch prediction done at compile time; - maintains a completely precise exception model even for speculated instructions; - does not require compensation or bookkeeping code; - main disadvantage: complex and requires substantial hardware resources; - when using Tomasulo, we must separate the bypassing of results among instructions, which is needed to execute an instruction speculatively, from the actual completion of an instruction; we also need to include another stage, the commit stage; * the reorder buffer is used to hold the results of speculated * instructions and is also used to pass results among instructions * that may be speculated; hence, the reorder buffer is a source of * operands for instructions, just as the reservation stations provide * operands in Tomasulo's algorithm; the register file is not updated * until the instruction commits; the CDB now is connected to the * reorder buffer, which is then connected to the register file; - each entry in the reorder buffer has three fields: the instruction type, the destination field, and the value field; it completely replaces the load and store buffers; - four steps in execution: * ISSUE - issue the instruction if there is an empty * reservation station and an empty slot in the reoder buffer; * EXECUTE - if operand is not available, monitor the CDB; when * they are available, execute the instruction; * WRITE RESULT - when the result is available, write it on the * CDB (with the reorder buffer tag sent when the instruction * issued) and from the CDB into the reorder buffer; mark this * reservation station as available; * COMMIT - when an instruction, other then a branch with * incorrect prediction, reaches the head of the reorder buffer * and its result is present in the buffer, update the register * with the result (or perform a memory write if instruction is * a store); when a branch with incorrect prediction reaches the * head of the reorder buffer, it indicated that the * speculation was wrong; the reorder buffer is flushed and * execution is restarted at the correct successor of the * branch; - as the reorder buffer is cleared on a branch misprediction, prediction accuracy, misprediction detection, and misprediction recovery increase in importance; - if a speculated instruction raises an exception, the exception is recorded in the reorder buffer entry; - to measure available ILP, experiments shown in section 4.7 assume infinite virtual registers (for renaming), perfect branch prediction, perfect jump prediction, and perfect memory disambiguation; - with perfect machine, available parallelism was measured to lie between 18 and 150 instruction issues per cycle; - as the instruction window decreases in size (was assumed to be infinite), so does the available parallelism; - as expected, branch prediction is critical; available parallelism decreases as the perfect branch prediction is replaced by real predictors; - the use of finite register set also decreases ILP substantially; - assuming real capabilities, FP applications show much better parallelism than integer applications;