Chapter 8 Notes Multiprocessors SISD: Single instruction stream, single data stream SIMD: Single inst. multi data MISD: Multi inst. single data (not used) MIMD: Multi inst. multi data UMA: Uniform memory access via centralize memory access. DSM: Distributed shared-memory NUMA: Non-uniform memory access (aka DSM) Performance metrics: Communication bandwidth Communication latency Communication latency hiding Shared memory vs. Message Passing Shared mem: Already implemented in multiprocessor SMP Ease of programming/compiler Low overhead for communication Hardware-controled caching Message Passing Simpler hardware Explicit communication Amdahl's Law: 1 Speedup = -------------------------------------------- Fraction_enhanced ----------------- + ( 1 - Fraction_enhanced) Speedup_enhanced DO EXAMPLE ON PAGE 644 (Skipped applications section 8.2) Coherence: Defining what values are returned from a read. A memory system is coherent if any read of a data item returns the more recently written value of that data item. (Always involves the same mem location) Coherence operations: 1) A mem[X] = 5; a = mem[X]; \\ a must be 5 B \\ Does nothing 2) A mem[X] = 5; B \\ wait a little a = mem[X]; \\ a must be 5 3) A mem[X] = 5; B mem[X] = 6; Processor C may 5 then 6, or 6 then 5 but cannot change afterward. All other processors must also see same final value. 1) Preserves program order 2) Memory must be updated in fixed time. 3) Write serialization. Enforcing Coherence: Protocols: Directory based: status stored on one location Snooping: Monitoring a shared bus Write Invalidation: Invalidate all other copies on a write. Enforces write serialization. Write Update/Broadcast: Broadcast write result to all nodes. Tradeoffs: Multiple writes require multiple broadcasts. Invalidates happen on entire cache blocks while broadcasts must happen for each word in cache block. (Fixable with WB) Delay between write and read is less in broadcast. Invalidate favored in general. (SKIPPED MESI implementation and performance results) Distributed Shared-Memory: Example: Cray T3D Directory-Based Cache-Coherence Protocols Cache block info stored at home node. States: Shared, Uncached, Exclusive Syncronization: Instruction level: Atomic Exchange, Test-And-Set Load Linked, store conditional. Spin locks Barriers Exponential back-off to eleminate contention. Barriers: Use combining trees to minimize contention Queuing locks Fetch-and-increment Constistancy: Defines behavior of reads and writes to different memory locations. When must a processor see a value that has been update by another processor? A B a = 0; b=0; a = 1; b=1; if (b=0) if (a=0) blah blah Both if's cannot be taken by SC rules. Basic way to enforce is to not allow write to complete until all copies are invalidated. Relaxing constistancy models: REVIEW PAGE 717 Sequential consistency R-R R-W W-R W-W, S-W, S-R, R-S, W-S, S-S Complete consistency Total Store Order or Processor Consistancy R-R, R-W, W-W, S-W, S-R, R-S, W-S, S-S Relaxing a write followed by a read to different addresses. Requires sync to ensure write. Partial Store Order R-R R-W, S-W, S-R, R-S, W-S, S-S Weak Ordering No rely on writes or reads, sync after any lock ops. Release Consistency No read or write consistancy, Reads only done after aquire writes only done after release. Implementation of TSO done with bypassing reads. Don't need to wait for a write to complete. Memory system Issues Multiple Caches in a processor: Inclusion: Is L1 Shadowed in L2. Seperates coherence traffic with processor traffic Prefetching: Non-binding vs. Binding prefetching. Want non-binding in case someone else write between the prefetch and the read. Binding would give you the old value. Could use vitual memory to implement shared memory across cluster of workstations. SKIP SGI CHALLENGE MULTIPROCESSOR. Fallacies and Pitfalls: Measuring performance of multiprocessors by linear speedup verse exectution time. Amdahl's law doesn't apply to parallel computers You need fast sequencial processors to get fast parallel machines Multiprocessors are ``free'' Scalability is free. Neglecting data distribution Linear speedups are needed to make multiprocessors cost-effective.