Memory Coherence in Shared Virtual Memory Systems --- Li et al., 1989 Summary by David Wagner * Problem: manage virtual memory on a loosely coupled multiprocessor (where each processor has its own physical memory); each processor must see a coherent view of the same virtual address space. * Motivation: we'll get process migration, RPC for free. More intuitive programming model than message-passing. * Algorithms: o Page synchronization schemes: invalidation or write-broadcast. Write-broadcast needs special hardware support that's not currently available, so paper uses invalidation. o Page ownership: fixed (static) or dynamic. Statically chosen page ownership makes writes too expensive, so paper uses dynamic page ownership. o ``Page-ownership meta-data'' location: centralized, fixed distributed, or dynamic distributed. (Data is paged between CPUs according to usage patterns, so everyone must have a way to find the page location for fault resolution.) o Centralized algorithm: like a monitor; pages do not have fixed owners, and there is only one manager CPU that knows who the owner is for all pages. Centralized manager serializes page accesses. Disadvantage is that centralized manager becomes a bottleneck. o Improved centralized algorithm: let individual client CPUs handle serialization. This is a partial win, but only because it moves a bit closer to a distributed solution. o Fixed distributed algorithm: hash function to determine CPU that knows where a page is. Disadvantage is that this distribution function must be the same for all applications and page usage patterns. o Broadcast dynamic distributed algorithm: broadcast to find CPU that knows where a page is. Simple, but requires too many broadcasts. Doesn't scale. o True dynamic distributed algorithm: everyone knows the next-hop of a route to the CPU that knows where a page is ("probOwner" field in paper). Do path-compression (a la set union find algorithm, ...). Still broadcasts for invalidation after write faults. o Improvements: Force a broadcast after every M page faults (then everyone will know the true owner of the page, and paths get compressed to length 1); tune M for best performance. Distributed copy sets: instead of broadcasting on invalidation, send invalidate along reverse direction of all "probOwner" links; no broadcasts ever needed. * Implementation experience: o Implemented on a 8-CPU multiprocessor. o Benchmarked with four plausible parallel programs. Performance good when communication is not too frequent; also good when usage pattern exhibits high degree of locality and little contention. Dynamic distributed algorithm a big winner; forwarding is infrequent. Performance not compared to message-passing implementation of problems; only compared to uniprocessor. o "Super-linear speedup" (heh!): performance for N processors on highly parallelizable problem can be more than N times as fast as for one processor. This paradox arises because the parallel machine has N times as much physical memory-- you're comparing apples and oranges.