Chapter 11: Latency Tolerance
---

- Latency tolerance in the communication pipeline
  -----------------------------------------------

- four key approaches:

	- block data transfer: make indivitual messages larger so they
	  communicate more than a word and can be pipelinedthrough the 
	  network (overlaps communication with communication);

	- precommunication: generating the communication before the
	  point where the operation naturally appeairs in the program
	  so that it is partially or entirely completed before data
	  is actually needed; of course, the precommunication 
	  transaction itself should not stall the processor until it
	  completes, or overlap will not be achieved;

	- proceeding past communication in the same thread: the
	  communication happens at the original time, but the
	  processor is allowed to proceed past it and find other
	  independent computation or communication what would come
	  later in the same process or thread;;

	- multithreading: similar to the previous case, except that
	  the independent work is found by switching to another thread
	  that has been scheduled to run in the same processor;

- how much latency can actually be hidden depends on many factors
  involving the application and the architecture.  Relevant
  application characteristics include the structure of the comunication
  and how much other work can be overlapped with it.  Architectural
  issues are: how much of the endpoint processing is performed on the
  main processor versus on the assist; can communication be overlapped
  with computation, other communication, or both; how many messages 
  involving a given processor may be outstanding at a time; what are the
  occupancies of the asist and the stages of the network pipeline;

- main limitations: application limitations, limitations on the
  communication architecture, and processor limitations;

- making the communication architecture efficient (high bandwidth, low
  endpoint overhead or occupancy) is very important for tolerating
  latency effectively;

- Latency tolerance in a shared address space
  -------------------------------------------

- techniques need to be hardware-supported to be effective;

- for block transfer to maintain even local coherence, the assist has
  some workto do for every cache block and becomes an integral part of
  the transfer pipeline; it can become the bottleneck in transfer
  bandwidth, affecting even those blocks for which no interaction with
  caches is necessary;

- performance advantages of using block transfer:

	- amortized per-message overhead;

	- pipelined transfer of large chunks of data;

	- less wasted bandwidth;

	- replication of transferred data in the destination main
	  memory;

	- bundling of synchronization with data transfer;


- potential performance disadvantages of block transfer:

	- higher overhead per transfer;

	- increased contention;

	- extra work;

- Proceeding past long-latency events
  -----------------------------------

- in multiprocessors, proceeding past memory operations before they
  complete or commit violates the sufficient conditions for SC;
  whether or not it violates SC itself depends on whether the operations
  are allowed to become visible out of program order;

- the extent of overlap possible is thys determined by both the
  machine mechanisms and the consistency model adopted;

- SC can take advantage of write buffers, but the benefits are
  limited; PC/TSO allow only reads to complete before previous writes;

- therefore, the weaker the consistency model the higher the number of
  optimizations that can be done;

- one study assuming an RC model found that a substantial portion of
  read latency can indeed be hidden using a dynamically scheduled
  processor with speculative execution and that the amount of read
  latency that can be hidden increases with the size of the reorder
  buffer;

- the most interesting question is whether, with aggressive,
  dynamically scheduled processors, RC still buys performance gains
  over SC at the hardware/software interface;  results indicate that RC
  is still beneficial, even though the gap has closed substantially;

- without hardware prefetching and speculative reads, RC still
  provides substantial advantages over SC even with a dynamically
  scheduled processor;

- Precommunication in a shared address space
  ------------------------------------------

- interesting in a cache-coherent machine, since shared nonlocal data
  may be prefommunicated directly into a processor's cache rather than
  a scpecial buffer and since it interacts with the cache coherence
  protocol;

- two categories of prefething: hardware-controlled and
  software-controlled;  in software, prefething instructions are
  inserted by the compiler;

- a binding prefetch means that the value of the prefetched data is
  bound at the time of the prefetch, i.e., when the process later
  reads the variable through a regular read, it will see the value that
  the variable had when it was prefetched even if the value has been
  modified since the prefetch;  a nonbinding prefetch means that the
  value brought by a prefetch instruction remains subject to updates or
  invalidations;

- other important issues are what data to prefetch (analysis) and when
  to initiate prefetches (scheduling);

- trade-offs between hardware- and software-controlled prefetches:

	- coverage;

	- reducing unnecesary prefetches;

	- maximizing effectiveness;

- prefetching is especially good for programs with predictable accesses 
  and good spacial locality;


- Multithreading in a Shared address space
  ----------------------------------------