Chapter 11 Latency Tolerance Three ways to reduce effective latency: 1. Reduce the access time to each level of the extended memory hierarchy. 2. Structure the system to reduce the frequency of high latency access. Replication, caches. 3. Structure the application to reduce the frequency of high latency access. Good partitioning. Latency tolerance: hide latency by overlapping it with computation. 4 approaches to latency tolerance: 1. Block data transfer -- make messages longer. 2. Precommunication -- prefetching. 3. Proceeding past an outstanding cummunication event in the same thread. 4. Multithreading. 3 major types of overlapping that can be exploited by latency tolerance schemes: 1. Within the communication pipeline: transmit several words at a time through the communication fabric. 2. Communication with communication: simultaneous communication with multiple nodes can be overlapped. 3. Computation with communication: CPU works while CP takes care of the communication. Block data transfer: Advantages: 1. Receiver only sees the propagation latency of the first word sent -- new word arrives with every subsequent network cycle. As opposed to waiting for an ack before sending the next word. 2. Larger messages amoritze overhead associated with each message. 3. Larger messages may require only one ack. Precommunication: Essentially prefetching. Requires long latency events to be be predictable. Must not stall the processor. Proceeding past communication in the same thread: Basically non-blocking communication. Multithreading Instead of executing the same thread after a communication, switch threads. Tolerating latency requires: 1. Extra parallelism 2. Increased bandwidth -- multiple outstanding, overlapping communications. 3. More sophisticated hardware and protocols Limitations of latency tolerance: 1. Application limitations -- limited amount of truely parallel code. 2. Communication architecture limitations -- The communication architecture may limit the amount of outstanding and overlapping communication. This can be further subdivided into problems involving overhead, assist occupancy, point-to-point bandwidth, and network capacity. 3. Processor limitations -- Non-blocking reads/writes, etc. LATENCY TOLERANCE IN EXPLICIT MESSAGE PASSING Mainly sender-initiated communication. Evaluate using the four main approaches to latency tolerance. 1. Block data transfer -- make messages large to amortize overhead. 2. Precommunication -- There is high pressure on the receiver buffers because the receiver may not use the messages for a while. To solve this problem, intersperse computation between the precommunications. 3. Proceeding past communication in the same thread -- use asynchronous messages. 4. Multithreading -- context switch after a send/receive. LATENCY TOLERANCE IN SHARED ADDRESS SPACE Mainly receiver initiated communication. Without caching of shared data, may be sender-initiated communication via writes to remote memories. BLOCK DATA TRANSFER IN A SHARED ADDRESS SPACE MULTITHREADING IN A SHARED ADDRESS SPACE Advantages over other approaches to latency tolerance: 1. Requires no special software analysis or support, other than having a programming model with explicit threads. 2. It is invoked dynamically, so it can handle unpredictable situations like cache misses. 3. Other techniques primarily target memory-access latency. Multithreading can handle any long-latency event, such as instruction latency (fp divide). 4. It does not change the memory consistency model since it does not reorder the memory operations in a thread. Unfortunately, not used in practice: 1. Requires substantial changes to processor. 2. Not as effective for uniprocessor and desktop systems. Context: state that must be saved when switching between threads. Includes processor registers, the program counter, the stack pointer, and the per-process parts of the processor status word (condition codes). Two general approaches to multithreading, depends on when a context switch is made: 1. Blocked -- a thread runs continuously until it blocks on a high-latency event. A new thread is then switched in. Similar to what happens in modern OSs. 2. Interleaved -- switch threads on every processor cycle. Blocked multithreading: Hardware support involves maintaining multiple hardware register files and program counters for different threads. An active thread is a thread that is assigned one of these hardware copies. The number of active threads may be less than the number of ready threads, which are threads that are ready to run. When an instruction stalls in the pipeline and we do a context switch, we must do something with the instructions already in the pipeline. Three choices: 1. Allow all instructions in the pipeline to complete but fetch instructions from the new thread. This is difficult, because instructions from multiple contexts will be in the pipeline at the same time. Interlock control must be modified. Secondly, the instructions that are already in the pipeline may depend on the stalled data, and thus completely stall the pipeline. 2. Allow all instructions in the pipeline to complete before fetching instructions from the new thread. Again, instructions already in the pipeline may depend on the stalled instruction and halt the pipeline. 3. Nullify the instructions in the pipeline and start fetching from the new thread. Easy to implement and is the prefered choice. However, this scheme raises the context switch cost. How to make the thread switch: 1. Synchronization events are followed by explicit context switch instructions. 2. Long pipeline latency instructions, such as fp divide, are followed by a context switch instruction. Usually 4-8 active thread contexts are enough. Implementation issues for the blocked scheme: 1. State replication: a.) Shared large register file: Cache like structure, indexted by context identifiers and register offsets. b.) Context status word: global status word that specifies which context is currently running, if switching is enabled, and a bit vector that indicates which active contexts are ready to run. 2. PC a.) PC chain holds the PCs for instructions in the pipeline. b.) Exception PC (EPC) holds address of last instruction returned from the pipeline. i.) On a fault, all instructions in the pipeline are nullified and a handler is invoked. Upon return, the EPC points to the faulting instruction that must be reexecuted. ii.) One EPC per active context. 3. Control a.) Detect when to switch and do the switch: cache miss, explicit context switch, time-out (end of quantum) b.) Round-robin thread scheduling. c.) On context switch, complete the following actions: i.) Save the PC. ii.) Nullify all incomplete instructions in the pipeline iii.) Start executing the new thread from its saved PC. iv.) Load the ASID into the TLB's register. v.) Load control/status registers and context identifier. vi.) Switch the register file to the new context's register file. Interleaved multithreading: Also uses multiple register files. When a thread encounters a high-latency event, it is taken off the ready list until the event is completed. Most interleaved schemes don't have caches. This means that there must be a larger number of threads to hide latency. Advantages: no context switch overhead. No need to nullify instructions in the pipeline. Disadvantages: typically higher hardware cost. Technique: lookahead bits. Some interleaving machines do not have hardware interlocks to stall the pipeline in case of hazards. Thus the compiler generates every instruction with a lookahead tag that specifies how far away the next dependant instruction is in the instruction stream. Then the machine knows how many more instructions from the same thread can be executed before having to do a context switch. Another approach is to provide a fully functional pipeline with hazard interlocks. If a cache miss occurs, all instructions *from that thread* are nullified, but other instructions are allowed to continue. Less overhead than in the blocked case where all instructions are nullified. Implementation issues for the interleaved scheme: 1.) State replication: a.) Pressure is greater for fast accesses to different segments in the register file. 2.) PC: a.) Processor's PC chain must now carry context ID for each pipeline stage. b.) To nullify instructions, must broadcast context ID to all pipeline stages. c.) Next PC can come from several places: EPC, branch prediction, computed branch address. d.) How to handle exceptions: Allow exception handler thread to be interleaved with other threads or nullify all instructions and only allow exception handler to execute. 3.) Control: a.) Each context has an available signal if it can be scheduled on the next cycle. b.) Round-robin scheduling. Integrating multithreading with multiple-issue (SMT): 1.) Let available instructions from different threads be scheduled in same cycle, filling issue slots more effectively. 2.) Instructions from different threads compete for issues slots in every cycle. 3.) Renaming is important. For each instruction, we need to maintain its context. Lockup free cache design: 1.) Need to keep address, type of request, place to return data, and current status of request.