Chapter 11 Latency Tolerance

Three ways to reduce effective latency:
  1.  Reduce the access time to each level of the extended memory
      hierarchy.

  2.  Structure the system to reduce the frequency of high latency
      access.  Replication, caches.

  3.  Structure the application to reduce the frequency of high latency
      access.  Good partitioning.

Latency tolerance: hide latency by overlapping it with computation.

4 approaches to latency tolerance:
  1.  Block data transfer -- make messages longer.
  2.  Precommunication -- prefetching.
  3.  Proceeding past an outstanding cummunication event in the same
  thread.
  4.  Multithreading.

3 major types of overlapping that can be exploited by latency tolerance
schemes:
  1.  Within the communication pipeline: transmit several words at a time
  through the communication fabric.
  2.  Communication with communication: simultaneous communication with
  multiple nodes can be overlapped.
  3.  Computation with communication: CPU works while CP takes care of the
  communication.  

Block data transfer:
Advantages:
  1.  Receiver only sees the propagation latency of the first word sent --
  new word arrives with every subsequent network cycle.  As opposed to
  waiting for an ack before sending the next word.
  2.  Larger messages amoritze overhead associated with each message.
  3.  Larger messages may require only one ack.

Precommunication:
  Essentially prefetching.
  Requires long latency events to be be predictable.
  Must not stall the processor.

Proceeding past communication in the same thread:
  Basically non-blocking communication.

Multithreading
  Instead of executing the same thread after a communication, switch
  threads.

Tolerating latency requires:
  1.  Extra parallelism
  2.  Increased bandwidth -- multiple outstanding, overlapping
  communications.
  3.  More sophisticated hardware and protocols

Limitations of latency tolerance:
  1.  Application limitations -- limited amount of truely parallel code.
  2.  Communication architecture limitations -- The communication
  architecture may limit the amount of outstanding and overlapping
  communication.  This can be further subdivided into problems involving
  overhead, assist occupancy, point-to-point bandwidth, and network
  capacity.
  3.  Processor limitations -- Non-blocking reads/writes, etc.

LATENCY TOLERANCE IN EXPLICIT MESSAGE PASSING
  Mainly sender-initiated communication.

  Evaluate using the four main approaches to latency tolerance.
  1.  Block data transfer -- make messages large to amortize overhead.
  2.  Precommunication -- There is high pressure on the receiver buffers
  because the receiver may not use the messages for a while.  To solve
  this problem, intersperse computation between the precommunications.
  3.  Proceeding past communication in the same thread -- use asynchronous
  messages.  
  4.  Multithreading -- context switch after a send/receive.

LATENCY TOLERANCE IN SHARED ADDRESS SPACE
  Mainly receiver initiated communication.
  Without caching of shared data, may be sender-initiated communication
  via writes to remote memories.

BLOCK DATA TRANSFER IN A SHARED ADDRESS SPACE

MULTITHREADING IN A SHARED ADDRESS SPACE
  Advantages over other approaches to latency tolerance:
  1.  Requires no special software analysis or support, other than having
  a programming model with explicit threads.
  2.  It is invoked dynamically, so it can handle unpredictable situations
  like cache misses.
  3.  Other techniques primarily target memory-access latency.
  Multithreading can handle any long-latency event, such as instruction
  latency (fp divide).
  4.  It does not change the memory consistency model since it does not
  reorder the memory operations in a thread.

  Unfortunately, not used in practice:
  1.  Requires substantial changes to processor.
  2.  Not as effective for uniprocessor and desktop systems.

  Context: state that must be saved when switching between threads.
  Includes processor registers, the program counter, the stack pointer,
  and the per-process parts of the processor status word (condition
  codes).

  Two general approaches to multithreading, depends on when a context
  switch is made:
  1.  Blocked -- a thread runs continuously until it blocks on a
  high-latency event.  A new thread is then switched in.  Similar to what
  happens in modern OSs.
  2.  Interleaved -- switch threads on every processor cycle.

  Blocked multithreading: 

    Hardware support involves maintaining multiple hardware register files
    and program counters for different threads.

    An active thread is a thread that is assigned one of these hardware
    copies.  The number of active threads may be less than the number of
    ready threads, which are threads that are ready to run.

    When an instruction stalls in the pipeline and we do a context switch,
    we must do something with the instructions already in the pipeline.
    Three choices:

    1.  Allow all instructions in the pipeline to complete but fetch
        instructions from the new thread.  This is difficult, because
        instructions from multiple contexts will be in the pipeline at the
        same time.  Interlock control must be modified.  Secondly, the
        instructions that are already in the pipeline may depend on the
        stalled data, and thus completely stall the pipeline.
    2.  Allow all instructions in the pipeline to complete before fetching
        instructions from the new thread.  Again, instructions already in
        the pipeline may depend on the stalled instruction and halt the
        pipeline.
    3.  Nullify the instructions in the pipeline and start fetching from
        the new thread.  Easy to implement and is the prefered choice.
        However, this scheme raises the context switch cost.

    How to make the thread switch:
    1.  Synchronization events are followed by explicit context switch
    instructions.
    2.  Long pipeline latency instructions, such as fp divide, are
    followed by a context switch instruction.

    Usually 4-8 active thread contexts are enough.
    
    Implementation issues for the blocked scheme:
    1.  State replication:
	a.)  Shared large register file:
	     Cache like structure, indexted by context identifiers and
	     register offsets.
	b.)  Context status word: global status word that specifies which
	context is currently running, if switching is enabled, and a bit
	vector that indicates which active contexts are ready to run.
    2.  PC
        a.)  PC chain holds the PCs for instructions in the pipeline.
	b.)  Exception PC (EPC) holds address of last instruction returned
	from the pipeline.
	  i.)  On a fault, all instructions in the pipeline are nullified
	  and a handler is invoked.  Upon return, the EPC points to the
	  faulting instruction that must be reexecuted.  
	  ii.)  One EPC per active context.
    3.  Control
        a.)  Detect when to switch and do the switch: cache miss, explicit
	context switch, time-out (end of quantum)
	b.)  Round-robin thread scheduling.
	c.)  On context switch, complete the following actions:
	  i.)   Save the PC.
	  ii.)  Nullify all incomplete instructions in the pipeline
	  iii.) Start executing the new thread from its saved PC.
	  iv.)  Load the ASID into the TLB's register.
	  v.)  Load control/status registers and context identifier.
	  vi.) Switch the register file to the new context's register file.  

  Interleaved multithreading:

    Also uses multiple register files.

    When a thread encounters a high-latency event, it is taken off the
    ready list until the event is completed.

    Most interleaved schemes don't have caches.  This means that there
    must be a larger number of threads to hide latency.

    Advantages: no context switch overhead.  No need to nullify
    instructions in the pipeline.
    Disadvantages: typically higher hardware cost.

    Technique:  lookahead bits.  Some interleaving machines do not have
    hardware interlocks to stall the pipeline in case of hazards.  Thus
    the compiler generates every instruction with a lookahead tag that
    specifies how far away the next dependant instruction is in the
    instruction stream.  Then the machine knows how many more instructions
    from the same thread can be executed before having to do a context
    switch.

    Another approach is to provide a fully functional pipeline with hazard
    interlocks.  If a cache miss occurs, all instructions *from that
    thread* are nullified, but other instructions are allowed to
    continue.  Less overhead than in the blocked case where all
    instructions are nullified.

    Implementation issues for the interleaved scheme:
    1.)  State replication:
	 a.)  Pressure is greater for fast accesses to different segments
	 in the register file.
    2.)  PC:
	 a.)  Processor's PC chain must now carry context ID for each
	 pipeline stage.
	 b.)  To nullify instructions, must broadcast context ID to all
	 pipeline stages.
	 c.)  Next PC can come from several places: EPC, branch
	 prediction, computed branch address.
	 d.)  How to handle exceptions:  Allow exception handler thread to
	 be interleaved with other threads or nullify all instructions and
	 only allow exception handler to execute.
    3.)  Control:
	 a.)  Each context has an available signal if it can be scheduled
	 on the next cycle.
	 b.)  Round-robin scheduling.

  Integrating multithreading with multiple-issue (SMT):
    1.)  Let available instructions from different threads be scheduled in
    same cycle, filling issue slots more effectively.
    2.)  Instructions from different threads compete for issues slots in
    every cycle.
    3.)  Renaming is important.  For each instruction, we need to maintain
    its context.

  Lockup free cache design:
    1.)  Need to keep address, type of request, place to return data, and
    current status of request.