Chapter 5: Shared Memory Multiprocessors --- - symmetric multiprocessor (SMP): global physical address and symmetric access to all of main memory; - organization of memory hierarchy: - shared cache: one for all processors. Used for small number of processors (2-8); - bus-based shared memory: each processor has its own cache and access main memory through a common bus. Widely used for small- to medium-scale machines of up to 20-30 processors. It is the dominant architecture in the market today. The bus determines the scaling limit; - dancehall: similar to BBSM, but memory is divided in multiple banks; - distributed memory: each processor has its own cache and local portion of main memory (NUMA). Despite its asymmetric memory access time, this model improves over dancehall if locality is exploited; - cache coherence: pervasive and performance critical. Read operation should return last value written, independently of which processor executed it; - cache coherence problems (CHPs) occur in both write-through and write-back caches (worst in WB caches); - CHPs arise even in uniprocessors, when DMA is used to perform I/O operations; one solution adopted is to mark segments of VM used for DMA as uncachable; another option is to flush the cache before DMA takes place; - virtually all microprocessors today provide support for cache coherence; - the appearance of a total, serial order on operations to a memory location is what we expect from any coherent memory system; - a multiprocessor memory system is coherent if: - operations issued by any particular process occur in the order in which they were issued to the memory system by that process; - the value returned by each read operation is the value written by the last write to that location in the serial order; - write propagation: writes become visible to other processors; - write serialization: all processors see the writes in the same order; - cache coherence through bus snooping: - processors can "snoop" on the bus and be aware of all memory operations; - a snooping cache controller may take action if a bus transaction is relevant to its state; - the operations are seen by processors in the same (bus) order. - easiest way to provide snooping: use write-through caches (all writes will generate bus operations); - invalidate-based vs. update-based protocols; - bus transaction consists of three phases: arbitration, command/address, and data; - arbitration: the bus arbiter selects one of the requesting processors and creates the serial order of transactions; - every cache block has a state associated with it (valid, invalid, dirty, etc); - in multiprocessor machines, a cache block has one state in each processor; - to show that a protocol provides coherence, we need to show that for any execution under the protocol a total order on the memory operations for a local cache can be constructed that satisfies the program order and write serialization conditions; - however, using write-through caches is not very realistic, as most multiprocessors use write-back caches to enhance performance; * coherence says nothing about when a write will become visible. For * example, suppose program P1 writes both variables a and b. Cache * coherence does no guarantee that the new value of a will be seen by * other processors before the new value of b. Coherence says nothing * about the order in which the reads issued to different locations by * P2 are performed with respect to P1. * a memory consistency model for a shared address space specifies * constraints on the order in which memory operations must appear to * be performed (i.e. to become visible to the processors) with respect * to one another. This includes operations to the same location or * different ones. In this sense, memory consistence subsumes * coherence. * a multiprocessor is sequentially consistent if the result of any * execution is the same as if the operations of all the processors were * executed in some sequential order, and the operations of each * individual processor occur in this sequence in the order specified * by its program; - conditions for sequencial consistency (SC): - program order requirement: memory operations of a process must appear to become visible (to itself and others) in program order; - the total order of operations is consistent for all processes, requiring that the operations appear atomic; - the tricky part of the second requirement is to make writes atomic: - write atomicity: the position in the total order at which a write appears to perform should be the same with respect to all processes. Write atomicity extends the write serialization property necessary for cache coherence (the writes do not need to be made to the same memory location); - sequentially consistent execution: any interleaving (total order) of operations yields the same result; - sequentially consistent system: any execution is sequentially consistent; * sufficient conditions for preserving SC: * every process issues memory operations in program order; * after a write operation is issued, the issuing process waits * for the write to complete before issuing its next operation; * after a read operation is issued, the issuing process waits * for the read to complete, and for the write whose value is * being returned by the read to complete, before issuing its * next operation. That is, if the write whose value is being * returned has performed with respect to this processor, then * the processor should wait until the write has performed with * respect to all processors; - the third condition is the most demanding when implementing SC; - some compiler optimizations violate the three conditions for SC explained above. The problem in parallel programs is that the out-of-order processing of operations to different shared variables by a process can be detected by (and affect) other processes; - preserving the sufficient conditions for SC in multiprocessors is quite a strong requirement since it limits compiler reordering and out-of-order processing techniques; * in this chapter, the authors assume that the compiler does not * reorder memory operations; - if a cache does not have exclusivity for a data block, it cannot write a new value into the block before first putting a transaction on the bus to communicate with others; - when a process does not have exclusivity, a write operation is always considered a write miss, in order to generate the required bus transaction; - if a cache has the block in modified (dirty) state, it is clear that it is the owner and it has exclusivity; - a read exclusive operation is used to read a block and require exclusivity at the same time; - because of locality of reference, invalidation-based protocols are have been found to be more robust; - A Three-State (MSI) Write-back Invalidation Protocol ---------------------------------------------------- - cache entry states: modified, shared, or invalid (MSI); - a processor issues reads (PrRd) and writes (PrWr). The bus allows the transactions bus read (BusRd), bus read exclusive (BusRdX), and bus write back (BusWB); - BusRd: generated by a PrRd that misses in the cache; - BusRdX: generated by a PrWr to a block that is either not in the cache or is in the cache but not in the modified state; - BusWB: generated by the controler on a write back; - if a cache has the block in the modified state (owner) and observes a BusRd transaction on the bus, it flushes the data onto the bus, in lieu of memory, and demotes its copy of the block to the shared state. The memory and the requesting cache both pick up the block; - a write miss generates a read exclusive bus transaction, which causes other copies to be invalidated; - A Four-state (MESI) Write-back Invalidation Protocol ---------------------------------------------------- - a fourth state, exclusive-clean (or simply exclusive), is introduced relative to MSI; indicates an intermediate level between shared and modified; implies exclusivity but no ownership (memory has a valid copy); in this state, the cache need not reply requests for the block; - when the block is first read, it enters the cache as shared (S) or exclusive (E), depending on whether other cache also has a copy of the block (this is known by the wired-OR signal - other caches assert it if they have a copy of the block); - an additional signal, called the shared signal, must be available and indicates if any other cache currently holds the data; - A Four-state (Dragon) Write-back Update Protocol ------------------------------------------------ - proposed by researches at Xerox PARC for their Dragon machine; - this is an UPDATE-BASED protocol; - four states: exclusive-clean (E), shared-clean(Sc), shared-modified (Sm), and modified (M); - Sm means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache (this cache is the owner); - a block can be in Sm state in only one cache at a time (remember that this is an update-based protocol); - there is no invalid state, as Dragon is an update-based protocol; - the transitions of a cache coherence protocol generate traffic on the data bus; the overall traffic of representative applications give an estimate of how many processors can a given machine handle; - cache-coherent MPs introduce coherence misses, which occur when blocks of data are shared among multiple caches: - true sharing vs. false sharing misses; - technology pushes in the direction of large cache block sizes (e.g. DRAM organizations and access modes and the need for high-bandwidth data transfers by amortizing the overhead); ? the only way to reduce true sharing is by increasing the block size ? and increasing spacial locality of communicated data; - too large cache blocks cause: - uneeded data prefetching; - increased conflic misses; - increased false sharing; - subblocks can be used to decrease false sharing; - another solution is to use small block size and more aggressive prefetching; - performance of update-based vs. invalidation-based protocols is workload-dependent; hardware can implement both and decide which one to use dynamically; this can be done at the page level by assigning bits in PTEs in the TLB; - competitive schemes change the protocol for a block in hardware based on observed patterns, at run-time; - alternatives include "counting" updates (after k updates with no local access the block is locally invalidated) and the use of probabilistic (randomized) algorithms; - updates are single-word writes, rather than a full cache block transfer; however, update traffic is quite substantial; - ex: consecutive writes to the same block; - however, the increased bandwidth demand, the complexity of supporting updates, the trend toward larger cache blocks, and the pack rat phenomenon with the important case of multiprogrammed sequential workloads uderlie the trend away from update-based protocols in the industry; - synchronization: how synchronization operations can be implemented on a bus-based cache-coherent MP through a combination of software algorithms and hardware primitives; * the difficulty in implementing high-level synch. operations in * hardware is not the acquire or release component, but the waiting * algorithm; - mutual exclusion: - hardware locks; - simple software lock algorithms; - test&set lock with backoff; - test-and-test&set lock; - load-locked, store conditional; - ticket lock; - array-based lock; - hardware locks: - fixed set of lock lines on the bus; - inflexible (limited number of simultaneous locks); - simple software locks: - usually constructed using atomic test&set or compare&swap instructions; - the problem with the test&set lock is the traffic generated during the waiting method. Every attempt to get the lock generates a write operation to the cache block holding the lock variable, which in turn generates a bus transaction; - enhancements to the simple lock algorithm: - test&set w/ backoff: reduce the frequency with which processes issue test&set instructions; - test-and-test&set: busy-wait only with read operations, instead of issuing test&set (writes); - goals for locks: - low latency; - low traffic; - scalability; - low storage cost; - fairness; - load-locked, store conditional (LL-SC): load-locked Reg, LockVar ... ... store conditional LockVar, Reg - the SC tries to write the register back to the memory position if and only if no other processor has written to that location since this processor completed its LL instruction; - if SC succeeds, LL-SC was atomic; - like test-and-test&set, LL-SC does not generate bus traffic during the waiting algorithm. However, LL-SC is not a fair lock. - still two undesirable properties: 1) only one processor should attempt to obtain the lock (rather than a storm of test&sets); 2) when a lock is released, one read miss is better than multiple (occurs even in LL-SC); - Ticket lock (TL): - accomplishes (1); - every process has a different ticket # and busy-waits on a global "now-serving" number; - an unlock is done by incrementing the serving number; - atomic operation used: fetch&increment; - not a large difference in traffic between TL and LL-SC; - TL shows more fairness than LL-SC (ticket order); - Array-based lock (ABL): - accomplishes both (1) and (2); - fetch&incr does not get a ticket number but a unique location on which to busy-wait; - the release operation writes an "unlock" value in the next memory position; - fair (FIFO); - less read misses than in TL (different memory positions to busy-wait); * the LL-SC lock performs well (best one in experiments), particularly * at lower processor counts, because it is not fair! a process that * releases a lock can reacquire it faster than other processors (the * microbenchmark was a tight loop of acquire/release in each * processor); - simple LL-SC locks perform well on buses with sufficient bandwidth; LL-SC gets worse than ABL and TL for more than 16 processors; - when exponential backoff is used, LL-SC delivers the best average lock transfer time in all cases (again, good performance due to unfairness); * a shared data structure is said to be "lock-free" if the operations * defined on it do not require mutual exclution over multiple * instructions; * if an operation is guaranteed to finish in a finite amount of time, * the data structure is said to be "nonblocking"; * if the operations can guarantee that every (nonfaulting) process * will complete its operation in a finite amount of time, the data * structure is "wait-free"; - these ideas work like updating a counter using LL-SC; - LL-SC is used to update the data structure, NOT to acquire a lock; - Point-to-point Event Syncronization ----------------------------------- - software algorithms: - flags (binary variables); - use the value of the shared variable itself as a flag; - hardware support: full-empty bit - one bit associated with every word in memory; - producer sets and consumer clears the bit; - not used much; - Global (Barrier) Event Synchronization -------------------------------------- - implemented using locks, shared counters, and flags; - centralized barrier: - counter keeps the number of processes that have arrived; - after incrementing counter, each process checks if its value is p (number of processes); - if it's the last process, it writes the flag value (on which the other processes are busy-waiting); - problem: if p-1 processes proceed and get to the barrier a second time while a process Px does not see the flag being set (it was swapped out, for example), Px will be waiting indefinetely in the first barrier while all the others wait for him at the second barrier; - centralized barrier with sense reversal: - a counter could be used before flag is reset; however, this incurs further latency; - solution: processes wait for flag to be 0 in one round and 1 in the other, i.e., instead of always going from 0 to 1 (when processes pass the barrier), a variable in each process indicates which value to wait for. This way, the error case discussed before does not happen (the late process will see the new value while the other processes wait for him); - this method is called sense-reversal; - in terms of bus transactions, barriers can be expensive; - improving barrier algorithms for a bus: - using a tree structure; - bus snooping (all procs are waiting for the flag variable, so theoretically just one read miss would be necessary - like a broadcast of the new value); - hardware barriers: separate synchronization bus; - implications for software: - increase spacial locality; - minimize false sharing; - conflict misses: e.g., two consecutive rows for a process map to the same positions in the cache; allocating power-of-2 array makes this worse; - per-process heaps (also to minimize false sharing); - pad arrays (again to separate data for different processes in different cache blocks); - be also aware of false sharing at PAGE LEVEL!!