Chapter 8: Multiprocessors --- - Synchronization --------------- - implementing a single atomic operation introduces some challenges, since it requires both a memory read and a write in a single, uninterruptible instruction; this complicates the implementation of coherence, since the hardware cannot allow any other operation between the read and the write, and yet must not deadlock; - one alternative: load linked (LL) + store conditional (SC): if the contents specified by the load linked are changed before the store conditional to the same address occurs, then the store conditional fails; if the processor does a context switch between the two instructions, then the store conditional always fails; these instructions are typically implemented by keeping track of the address specified in the LL instruction in a register, often called the link register; if an interrupt occurs, or if the cache block matching the address in the link register is invalidated (for example, by another SC), the link register is cleared; the SC simply checks that its address matches that in the lin register, failing if they differ; - the number of instructions between LL and SC should be minimized; - implementing spin locks with cache coherence has two main advantages: first, it is much cheaper to spin on a local copy than fetch the lock every time from main memory. Second, locks have a high degree of locality; - instead of trying to write the lock at every operation, a better implementation is shown in p.698 where keeps reading the lock position until it detects that the lock is free; then it tries the atomic exchange operation; - synchronization performance can be a real bottleneck when there is substantial contention among multiple processes; when contention is low, we are concerned about the latency of the synchronization primitive; - the more serious problem in these examples is the serialization of each process's attempt to complete the synchronization; this serialization is a problem when there is contention, because it greatly increases the time to complete the operation; for example, if the timet to complete all 20 lock and unlock operations depended only on the latency in the uncontended case, theh it would take 2000 rather than 40000 cycles to complete the sychr. operations; - Synchronization mechanisms for large-scale machines --------------------------------------------------- - we want to minimize serialization when contention is significant; - alternative 1: exponential back-off; - alternative 2: queuing locks: - can be implemented in either hardware or software; - delay for the 20-lock problem decreases to 2050 cycles; - the hardware must be prepared to reclaim suck locks, since the process that requested the lock may have been context-switched and may not even be scheduled again on the same processor; - Models of Memory Consistency ---------------------------- - the consistency question: "When must a processor see a value that has been updated by another processor?" - sequential consistency (SC) requires that the result of any execution be the same as if the accesses executed by each process were kept in order and the accesses among different processors were interleaved; * THE SIMPLEST WAY TO IMPLEMENT SEQUENTIAL CONSISTENCY IS TO REQUIRE * A PROCESSOR TO DELAY THE COMPLETION OF ANY MEMORY ACCESS UNTIL ALL * THE INVALIDATIONS CAUSED BY THAT ACCESS ARE COMPLETED. - SC reduces potential performance, especially in a machine with a large number of processors, or long interconnect delays; - to provide better performance, designers have developed less restrictive memory consistency models that allow for faster hardware; - a program is said to be synchronized if all accesses to shared variables are ordered by synchronization operations; synchronized programs are also called data-race-free; - it is a broadly accepted observation that most programs are synchronized; even with SC, reasoning about the outcome of unsynchronized programs is very difficult; - there are two types of restriction on memory orders: read fences and write fences; fences are fixed points in a computation that ensure that no read or write is moved across the fence; for example, a write fence executed by processor P ensures that: - all writes by P that occur before P executed the write fence operation have completed, and - no writes that occur after the fence in P are initiated before the fence; - in SC, all reads are read fences and all writes are write fences; - from a performance viewpoint, the processor would like to execute reads as early as possible and complete writes as late as possible; - there are a number of relaxed models that all maintain the property that the execution semantics of a synchronized program is the same under the model as it would be under a SC mode; - it is simpler if we define the models in terms of what orderings among reads and writes PERFORMED BY A SINGLE PROCESSOR are preserved by each model. There are four such orderings: - R->R: a read followed by a read; - R->W: a read followed by a write, which is always preserved if the operations are to the same address, since this is an antidependence; - W->W: a write followed by a write, which is always preserved if they are to the same address, since this is an output dependence; - W->R: a write followed by a read, which is always preserved if they are to the same address, since this is a true dependence; * IF THERE IS A DEPENDENCE BETWEEN THE READ AND THE WRITE, THEN * UNIPROCESSOR PROGRAM SEMANTICS DEMAND THAT THE OPERATIONS BE * ORDERED; IF THERE IS NO DEPENDENCE, THE MEMORY CONSISTENCY MODEL * DETERMINES WHAT ORDERS MUST BE PRESERVED; A SC MODEL REQUIRES THAT * ALL FOUR ORDERINGS BE PRESERVED; - for example, relaxing the ordering W->R means that we allow a read that is later than a write to complete before the write has completed; letting the read occur after the write miss has been handled but before the invalidations are done does not preserve the ordering; * A CONSISTENCY MODEL DOES NOT, IN REALITY, RESTRICT THE ORDERING OF * EVENTS; INSTEAD, IF SAYS WHAT POSSIBLE ORDERINGS CAN BE OBSERVED; * FOR EXAMPLE, IN SC THE SYSTEM MUST APPEAR TO PRESERVE THE FOUR * ORDERINGS, ALTHOUGH IN PRACTICE IT CAN ALLOW REORDERING; THIS ALLOWS * IMPLEMENTATIONS TO USE TRICKS THAT REORDER EVENTS WITHOUT ALLOWING * THE REORDERING TO BE OBSERVED; - the consistency model must also define the orderings imposed between synchronization variable accesses, which act as fences, and all other accesses; for weaker consistency models, we need to specify the ordering restrictions involving ordinary variables; the simplest ordering restriction is that every synchronization access is a memory fence; if we let S stand for a synchr. variable access, we could also write this with the ordering notation just shown as S->W, W->S, etc. - Processor Consistency (PC) or Total Store Ordering (TSO): - relaxes the ordering between a write and a read (to a different address), eliminating the order W->R; - processor allows a read to proceed before it guaranteed that an earlier write by that processor has been seen by all the other processors; - by relaxing only this one ordering, many applications (even those that are unsynchronized) operate correctly, altough a sychronization operation is necessary to ensure that a write completes before a read is done; - this model is equivalent to making the writes be write fences; - Partial Store Ordering (PSO) - allows nonconflicting writes to potentially complete out of order, by relaxing the W->W ordering; - a write operation need only cause a stall when a synchronization operation, which causes a write fence, is encountered; - Weak Ordering (WO) - eliminates the R->R and R->W orderings in addition to the other two orderings; - does not preserve ordering among references, except for the following: - a read or write is completed before any synchronization operation executed in program order by the processor after the read or write; - a synchr. operation is always completed before any reads or writes that occur in program order after the operation; - the only orderings imposed by WO are those created by synchronization operations; - Release Consistency (RC) - more relaxed than WO; - distinguishes between synchr. operations: release vs. acquire; - a read or write that precedes an acquire need not complete before the acquire, and also that a read or write that follows a release need not wait for the release; - provides one of the least restrictive models that is easily checkable, and ensures that synchronized programs will see a sequentially consistent execution; - Summary: SC: R->R, R->W, W->R, W->W S->W, S->R, R->S, W->S, S->S TSO/PC: R->R, R->W, W->W S->W, S->R, R->S, W->S, S->S PSO: R->R, R->W S->W, S->R, R->S, W->S, S->S WO: S->W, S->R, R->S, W->S, S->S RC: Sa->W, Sa->R, R->Sr, W->Sr, Sa->Sa, Sa->Sr, Sr->Sa, Sr->Sr - our directory protocols already satisfy SC, since the processor stalls until an operation is complete and the directory first invalidates all sharers before responding to a write miss; - PC is typically implemented by allowing read misses to bypass pending writes; a write buffer that can support a check to determine whether any pending write in the buffer is to the same address as a read miss, together with a mwmory and interconnection system that can support two outstanding references per node, is sufficient to implement this scheme; - in RC we must ensure that all pending invalidates to all outstanding writes complete before we allow a release to complete, so we simply check the pending invalidation counts on any outstanding write when a release is executed; this is made easier by limiting the number of outstanding writes; - at the present time, most machines being built support some sort of weak consistency model, varying from processor consistency to RC, and almost all also support SC as an option; - Distributed Shared Memory (DSM) or Shared Virtual Memory (SVM) -------------------------------------------------------------- - pages become the units of coherence, rather than cache blocks; - just as in a directory system, each page has a home node, and the operating system running in that node is responsible for tracking who has copies of that page; - this software dependence and bigger unit of coherence are the main reasons for the performance reason between DSM and the hardware-based approach; - SVM has become an acceptable substitute for loosely coupled message passing, since in both cases the frequency of communication must be low, and communication that is structured in larger blocks is preferred;