Chapter 11: Latency Tolerance --- - Latency tolerance in the communication pipeline ----------------------------------------------- - four key approaches: - block data transfer: make indivitual messages larger so they communicate more than a word and can be pipelinedthrough the network (overlaps communication with communication); - precommunication: generating the communication before the point where the operation naturally appeairs in the program so that it is partially or entirely completed before data is actually needed; of course, the precommunication transaction itself should not stall the processor until it completes, or overlap will not be achieved; - proceeding past communication in the same thread: the communication happens at the original time, but the processor is allowed to proceed past it and find other independent computation or communication what would come later in the same process or thread;; - multithreading: similar to the previous case, except that the independent work is found by switching to another thread that has been scheduled to run in the same processor; - how much latency can actually be hidden depends on many factors involving the application and the architecture. Relevant application characteristics include the structure of the comunication and how much other work can be overlapped with it. Architectural issues are: how much of the endpoint processing is performed on the main processor versus on the assist; can communication be overlapped with computation, other communication, or both; how many messages involving a given processor may be outstanding at a time; what are the occupancies of the asist and the stages of the network pipeline; - main limitations: application limitations, limitations on the communication architecture, and processor limitations; - making the communication architecture efficient (high bandwidth, low endpoint overhead or occupancy) is very important for tolerating latency effectively; - Latency tolerance in a shared address space ------------------------------------------- - techniques need to be hardware-supported to be effective; - for block transfer to maintain even local coherence, the assist has some workto do for every cache block and becomes an integral part of the transfer pipeline; it can become the bottleneck in transfer bandwidth, affecting even those blocks for which no interaction with caches is necessary; - performance advantages of using block transfer: - amortized per-message overhead; - pipelined transfer of large chunks of data; - less wasted bandwidth; - replication of transferred data in the destination main memory; - bundling of synchronization with data transfer; - potential performance disadvantages of block transfer: - higher overhead per transfer; - increased contention; - extra work; - Proceeding past long-latency events ----------------------------------- - in multiprocessors, proceeding past memory operations before they complete or commit violates the sufficient conditions for SC; whether or not it violates SC itself depends on whether the operations are allowed to become visible out of program order; - the extent of overlap possible is thys determined by both the machine mechanisms and the consistency model adopted; - SC can take advantage of write buffers, but the benefits are limited; PC/TSO allow only reads to complete before previous writes; - therefore, the weaker the consistency model the higher the number of optimizations that can be done; - one study assuming an RC model found that a substantial portion of read latency can indeed be hidden using a dynamically scheduled processor with speculative execution and that the amount of read latency that can be hidden increases with the size of the reorder buffer; - the most interesting question is whether, with aggressive, dynamically scheduled processors, RC still buys performance gains over SC at the hardware/software interface; results indicate that RC is still beneficial, even though the gap has closed substantially; - without hardware prefetching and speculative reads, RC still provides substantial advantages over SC even with a dynamically scheduled processor; - Precommunication in a shared address space ------------------------------------------ - interesting in a cache-coherent machine, since shared nonlocal data may be prefommunicated directly into a processor's cache rather than a scpecial buffer and since it interacts with the cache coherence protocol; - two categories of prefething: hardware-controlled and software-controlled; in software, prefething instructions are inserted by the compiler; - a binding prefetch means that the value of the prefetched data is bound at the time of the prefetch, i.e., when the process later reads the variable through a regular read, it will see the value that the variable had when it was prefetched even if the value has been modified since the prefetch; a nonbinding prefetch means that the value brought by a prefetch instruction remains subject to updates or invalidations; - other important issues are what data to prefetch (analysis) and when to initiate prefetches (scheduling); - trade-offs between hardware- and software-controlled prefetches: - coverage; - reducing unnecesary prefetches; - maximizing effectiveness; - prefetching is especially good for programs with predictable accesses and good spacial locality; - Multithreading in a Shared address space ----------------------------------------