Chapter 8 Notes
Multiprocessors

  SISD:  Single instruction stream, single data stream
  SIMD:  Single inst. multi data
  MISD:  Multi inst. single data (not used)
  MIMD:  Multi inst. multi data

  UMA: Uniform memory access via centralize memory access.
  DSM: Distributed shared-memory
  NUMA:  Non-uniform memory access (aka DSM)

  Performance metrics:
    Communication bandwidth
    Communication latency
    Communication latency hiding

  Shared memory vs. Message Passing
    Shared mem:
      Already implemented in multiprocessor SMP
      Ease of programming/compiler
      Low overhead for communication
      Hardware-controled caching
    Message Passing
      Simpler hardware
      Explicit communication

  Amdahl's Law:
                           1
  Speedup =   --------------------------------------------
               Fraction_enhanced
               ----------------- + ( 1 - Fraction_enhanced)
               Speedup_enhanced

   DO EXAMPLE ON PAGE 644

  (Skipped applications section 8.2)

  Coherence:  
  Defining what values are returned from a read.  A memory system is
  coherent if any read of a data item returns the more recently
  written value of that data item.  (Always involves the same mem location)

  Coherence operations:
  1)    A
    mem[X] = 5;
    a = mem[X];  \\ a must be 5
    
        B
    \\ Does nothing

  2)    A
     mem[X] = 5;

        B
     \\ wait a little
     a = mem[X];  \\ a must be 5

  3)    A
     mem[X] = 5;
      
        B
     mem[X] = 6;

     Processor C may 5 then 6, or 6 then 5 but cannot change
afterward.  All other processors must also see same final value.

  1) Preserves program order
  2) Memory must be updated in fixed time.
  3) Write serialization.

  Enforcing Coherence:
    Protocols: 
      Directory based:  status stored on one location
      Snooping:  Monitoring a shared bus

    Write Invalidation:
      Invalidate all other copies on a write.  Enforces write
serialization.
    Write Update/Broadcast:
      Broadcast write result to all nodes.
    
    Tradeoffs:
      Multiple writes require multiple broadcasts.
      Invalidates happen on entire cache blocks while broadcasts must
happen for each word in cache block. (Fixable with WB)
      Delay between write and read is less in broadcast.

      Invalidate favored in general.

    (SKIPPED MESI implementation and performance results)

  Distributed Shared-Memory:

   Example: Cray T3D
   
   Directory-Based Cache-Coherence Protocols
     Cache block info stored at home node.
     States:  Shared, Uncached, Exclusive
  
  Syncronization:
    
    Instruction level:
      Atomic Exchange, Test-And-Set
      Load Linked, store conditional.
    Spin locks
    Barriers

    Exponential back-off to eleminate contention.
    Barriers: Use combining trees to minimize contention

    Queuing locks
    Fetch-and-increment

 
  Constistancy:
  Defines behavior of reads and writes to different memory locations.
  When must a processor see a value that has been update by another
processor?

    A                  B
   a = 0;            b=0;
   a = 1;            b=1;
   if (b=0)          if (a=0)
     blah              blah

  Both if's cannot be taken by SC rules.

  Basic way to enforce is to not allow write to complete until all
copies are invalidated.

  Relaxing constistancy models:

  REVIEW PAGE 717

  Sequential consistency
  R-R R-W W-R W-W, S-W, S-R, R-S, W-S, S-S
  Complete consistency
 
  Total Store Order or Processor Consistancy
  R-R, R-W, W-W, S-W, S-R, R-S, W-S, S-S
  Relaxing a write followed by a read to different addresses.
  Requires sync to ensure write.

  Partial Store Order
  R-R R-W, S-W, S-R, R-S, W-S, S-S
  Weak Ordering
  No rely on writes or reads, sync after any lock ops.

  Release Consistency
  No read or write consistancy, Reads only done after aquire
  writes only done after release.

  Implementation of TSO done with bypassing reads.  Don't need to 
wait for a write to complete.

  
Memory system Issues

  Multiple Caches in a processor:
     Inclusion:  Is L1 Shadowed in L2.
     Seperates coherence traffic with processor traffic

  Prefetching:
    Non-binding vs. Binding prefetching.
      Want non-binding in case someone else write between the 
      prefetch and the read.  Binding would give you the old value.

  Could use vitual memory to implement shared memory across cluster
  of workstations.

  SKIP SGI CHALLENGE MULTIPROCESSOR.

  Fallacies and Pitfalls:
     
    Measuring performance of multiprocessors by linear speedup verse
    exectution time.

    Amdahl's law doesn't apply to parallel computers

    You need fast sequencial processors to get fast parallel machines
 
    Multiprocessors are ``free''

    Scalability is free.

    Neglecting data distribution

    Linear speedups are needed to make multiprocessors cost-effective.