Memory Coherence in Shared Virtual Memory Systems
--- Li et al., 1989

Summary by David Wagner

   * Problem: manage virtual memory on a loosely coupled multiprocessor
     (where each processor has its own physical memory); each processor must
     see a coherent view of the same virtual address space.
   * Motivation: we'll get process migration, RPC for free. More intuitive
     programming model than message-passing.
   * Algorithms:
        o Page synchronization schemes: invalidation or write-broadcast.
          Write-broadcast needs special hardware support that's not
          currently available, so paper uses invalidation.
        o Page ownership: fixed (static) or dynamic. Statically chosen page
          ownership makes writes too expensive, so paper uses dynamic page
          ownership.
        o ``Page-ownership meta-data'' location: centralized, fixed
          distributed, or dynamic distributed. (Data is paged between CPUs
          according to usage patterns, so everyone must have a way to find
          the page location for fault resolution.)
        o Centralized algorithm: like a monitor; pages do not have fixed
          owners, and there is only one manager CPU that knows who the owner
          is for all pages. Centralized manager serializes page accesses.
          Disadvantage is that centralized manager becomes a bottleneck.
        o Improved centralized algorithm: let individual client CPUs handle
          serialization. This is a partial win, but only because it moves a
          bit closer to a distributed solution.
        o Fixed distributed algorithm: hash function to determine CPU that
          knows where a page is. Disadvantage is that this distribution
          function must be the same for all applications and page usage
          patterns.
        o Broadcast dynamic distributed algorithm: broadcast to find CPU
          that knows where a page is. Simple, but requires too many
          broadcasts. Doesn't scale.
        o True dynamic distributed algorithm: everyone knows the next-hop of
          a route to the CPU that knows where a page is ("probOwner" field
          in paper). Do path-compression (a la set union find algorithm,
          ...). Still broadcasts for invalidation after write faults.
        o Improvements: Force a broadcast after every M page faults (then
          everyone will know the true owner of the page, and paths get
          compressed to length 1); tune M for best performance. Distributed
          copy sets: instead of broadcasting on invalidation, send
          invalidate along reverse direction of all "probOwner" links; no
          broadcasts ever needed.
   * Implementation experience:
        o Implemented on a 8-CPU multiprocessor.
        o Benchmarked with four plausible parallel programs. Performance
          good when communication is not too frequent; also good when usage
          pattern exhibits high degree of locality and little contention.
          Dynamic distributed algorithm a big winner; forwarding is
          infrequent. Performance not compared to message-passing
          implementation of problems; only compared to uniprocessor.
        o "Super-linear speedup" (heh!): performance for N processors on
          highly parallelizable problem can be more than N times as fast as
          for one processor. This paradox arises because the parallel
          machine has N times as much physical memory-- you're comparing
          apples and oranges.