Chapter 6: Snoop-Based Multiprocessor Design --- - latency and bandwidth of a protocol depend on the bus design, cache design, and integration with memory; - implementation objectives: correctness, high performance, and minimal extra hardware; * fundamental challenge: maintaining the illusion of order as required * by coherence and memory consistency model; - for coherence, it should ensure that stale copies are invalidated (or updated) on writes, and it should provide write serialization; - deadlock vs. livelock; * main problem: even though the bus is atomic, the set of actions to * satisfy a memory operation uses other resources as well (such as * cache controllers) and is NOT atomic; - during a block transfer from cache to memory, the cache controller: - asserts request for bus; - waits for bus grant; - drive address and command; - wait command to be accepted by relevant device; - transfer data; - two-controllers per cache: bus-side controller and processor-side controller; * to support concurrent accesses by the 2 controllers, a coherent * cache design may use dual-ported RAM for the tags and state or it * may duplicate the tag and state for every block; of course, * modifications to tags are serialized/synchronized; - for snooping caches, each cache must check the addresss against its tags, and the collective result must be reported on the bus before the transaction can proceed; - one thing to do is to inform main memory whether it should respond to the request or not (some cache may be holding a modified copy); the questions are : WHEN is the snoop result reported and IN what form; - WHEN: 1) within a fixed number of clock cycles from the issue of the address on the bus; - advantages: design of main memory not affected and the cache-to-cache handshake is very simple; - disadvantage: extra hardware and potentially larger snoop latency; 2) variable delay snoop. MM assumes that one of the caches will supply the data until all cache controllers have snooped and indicated otherwise; 3) MM maintains a bit per block that indicates if block is modified in one of the caches; - usually, wired-OR signal bits are used to pass information between cache controllers (p. 384); - write-backs complicate the implementation (2 bus transactions) and are usually dealt with by the use of a write-back buffer (additional storage, which works like a victim cache); - even when bus transactions are atomic, the whole oparation on servicing a memory request is NOT; - a convenient way to deal with the nonatomic nature of state transitions and revise requests and actions based on oberserved events, is to extend the protocol state diagram with intermediate or transient states; original states in MESI will be called stable states; * if processors can return from a write after the local cache has been * updated but before the controller acquires the bus for the * transaction, it complicates coherence (if transactions are to the * same block) as well as SC (if they are to different blocks); - no need to wait for the whole transaction; can proceed with reads and writes once the transaction is on the bus; sufficient for coherence and SC and is what makes it possible to implement them with pipelined buses, mutilevel memory hierarchies, and multiple outstanding operations per processor; - implementing atomic operations: overall, traffic and locality considerations tend to dominate and lock variables are usually cacheable, so processors can busy-wait on local copies; - Multilevel Cache Hierarchies ---------------------------- - one obviuos way to handle multilevel caches is to have independent bus snooping hardware for each level; this is bad for several reasons: - L1 is usually on the processor, and an on-chip snooper will consume precious pins to monitor the shared bus; - duplicating the tags may consume too much on-chip real statel - duplication of efforts (L2 contains L1); - inclusion property: - L2 contains everyting in L1; - if a block is in the modified state in L1, the same happens with its copy in L2; - by extending the inclusion property, only a snooper at L2 is necessary, as it has all the information needed; - now we have information flowing in both ways: - L1 accesses L2 for cache miss handling and block state changes; - L2 forwards to L1 blocks invalidated/updated by bus transactions; - problems that make difficult maintaining the inclusion property: - one of the problems of maintaining the inclusion property (IP) is that L1 and L2 may have different eviction algorithms. While a block is kept by L1 it may be evicted by L2; - separated data and instruction caches; - different cache block sizes; * fortunately, in one of the most commonly encountered cases, * inclusion is maintained automatically: L1 is direct mapped, while L2 * is either direct mapped or set associative with any replacement * policy as long as the new block brought in is put in both L1 and * L2 caches; the block size is the same for both caches and the * number of sets in L1 is smaller than in L2; - however, this does not occur always. Instead, inclusion is maintained explicitly by extending the mechanisms used for propagating coherence events in the hierarchy of caches; - information about invalidation/update can be passed from L2 to L1 in two different approaches: send all bus transactions to L1 (even if the given block is not present there) or put some state at L2 (a bit for each block) to identify which blocks in L2 are also in L1; - the same tradeoff occurs during writes to L1. L1 can be made write-through (so all modifications affect also L2) or blocks in L2 can be marked "modified-but-stale", indicating that a newer version needs to be fetched from L1 in case there is the need to send this block as part of a bus transaction (handled by L2); - p.397 describes transaction propagation examples, both downwards and upwards; * interestingly, dual tags are less critical when we have multilevel * caches, as L2 serves as a filter for the L1 cache, screening out * irrelevant transactions from the bus; - Split-transaction Bus --------------------- - in a Split-transaction bus (STB), transactions that require a response are split in two independent subtransactions: a request transaction and a response transaction; the advantage, of course, is that by pipelining bus operations the bus is utilized more efficiently; the disadvantage is increased complexity; - major issues when supporting STBs: - a new request can appear on the bus before the snoop and/or servicing of an earlier request are complete; in particular, these requests may be conflicting requests (same block); - the number of buffers for incoming requests and potential data responses from bus to cache controller is usually fixed and small; FLOW CONTROL IS THUS NEEDED; - since requests from the bus are buffered, we need to revisit the issue of when and how snoop responses and data responses are produced on the bus; - going back to snoop definition, there are 3 phases in a transaction: a request is put on the bus, snoop results are sent by other caches (as a way to communicate the presence of the requested block in a state that is important for the requester to know), and finally data is sent for the requesting cache, if needed; - SGI Challenge: - eight outstanding requests at a time; - no conflicting requests can be issued simultaneously; - flow control for the buffers is implemented by the use of negative acknowledgments (NACKs); <\begin SGI Challenge> - the STB esentially consists of two separate buses: a request bus for command and address and a response bus for data; - each request grated the bus it is assigned a unique tag (3-bit buffer location). - a response consists of data on the response bus as well as the tag value on the 3-bit-wide tag lines; - cache blocks are 128 bytes while cache lines are 256 bits, for four bus cycles plus a one-cycle turnaround time are required for the response phase; - a uniform pipeline strategy is followed, so the request phase is also five cycles: arbitration, resolution, address, decode, and acknowledgment; overall, a complete request-response transaction takes three or more of these five-cycle phases: an address request phase, a data request phase (which uses the data bus arbitration logic and obtains access to the data bus for the response subtransaction), and a data transfer or response phase (which uses the data bus); 3 different memory operations can be in the three different phases at the same time; - to keep track of the eight outstanding requests on the bus, each cache controller maintains an eight-entry table, called a request table. Whenever a new request is issued on the bus, it is added to all request tables at the same index as part of the arbitration process; the index is the 3-bit tag assigned to that request; - avoiding conflicting requests is easy: since every controller has a record of the pending transactions that have been issued to the bus in its request table, no request is issued for a block that has a transaction outstanding; - each controller sees all the requests made by the other controllers through the shared bus. therefore, if controler C1 wants a block for which there is already an entry in the request table (made by another controller), it may decide to "snoop" the result for the other request and set the "shared" bit when the data is transfered, in order to let the originator know that the block is now shared; * suppose the processor issues a BusRd request on the bus. What * should other cache controllers and the main memory controller do? * The request is entered in the request table of all controllers and * they start checking their caches for the requested memory block. MM * subsystem has no idea if the block is dirty in one of the caches, so * it independently starts fetching this block. Now we have 3 * different scenarios to consider: * one of the caches may determine that it has the block in * modified state and may acquire the bus to generate a * response before main memory can respond; on seeing this * response, the MM controller aborts the fetch; another cache * controller may use the inhibit line to extend the response * transaction, while MM asserts the NACK signal if it does not * have the buffer space needed to get the modified value from * the bus; * MM may fetch the data and acquire the bus before the cache * controller holding the block dirty has finished its snoop * and/or acquired the bus. The controller will first assert * the inhibit line until it has finished its snoop and then * assert the dirty line and release the inhibit line, * indicating to the memory that it has the latest copy; MM * cancels its response transaction and does not put the data * on the bus; * the simplest scenario is that no other cache has the block * dirty. MM will acquire the bus and generate the response; - Once a BusRdX or BusUpgr has obtained the bus, the associated write is commited. However, commitment of a write does NOT guarantee that the value produced by the write is already visible to all other processors; only actual completion guarantees that. further mechanisms are needed to allow SC. - example 6.3 is a good example of possible violation of SC; - condition necessary for SC in this case: a processor should not be allowed to actually see the new value dru to a write before previous writes (in bus order, as usual) are visible to it; this can be done in two ways (discussed in more detail in page 408): - not letting certain types of incoming transactions from bus to cache be reordered in the incoming queues; - allowing these reorderings in the queues, but then ensuring that the important orders are preserved at the necessary points in the machine; <\end SGI Challenge> - a simpler approach is to threat all the requests in FIFO order. Although this approach is simpler, it can have performance problems; - STB with Multilevel Caches (Challenge-like machine) --------------------------------------------------- - usual scenario: +---------------+ | Processor | +-------+-------+ | +-------+-------+ | L1 | +-------+-------+ | +-------+-------+ | L2 | +-------+-------+ | --------+---------------------- BUS - to maintain high bandwidth while allowing the individual units (e.g. controllers and caches) to operate at their own rates, queues are placed between levels of the hierarchy: locking and serialization issues; - buffer deadlock can occur if L1 is write-back, for example. If L1 is WB, there is data flow in both directions (L2->L1, L1->L2) and both levels may be waiting for the other with full buffers; this does not occur, for example, if L1 is not write-back; - Case studies: SGI Challenge and Sun Enterprise 6000 --------------------------------------------------- <\begin SGI Challenge> - up to 18/36 MIPS processors - Powerpath-2 bus with peak bandwidth of 1.2GB/s; - interleaved memory accessible through the bus, in uniform time; - Powerpath-2 (PP2) is nonmultiplexed, having a 256-bit-wide data portion and a separate 40-bit-wide address portion, plus command and other signals; clocked at 47.6MHz and is a split-transaction design supporting eight outstanding read requests; - the bus supports 16 slots, 9 of which can be used by 4-processor boards; - L2 block size is 128 bytes, i.e., can be transfered in 4 bus cycles; - total of 329 signals: - 256 data; - 8 data parity; - 40 address; - 8 command; - 2 address + command parity; - 8 data resource ID; - 7 miscellaneous; - all transactions take 5 cycles: arbitration, resolution, address, decode, and acknowledge; when no transactions are occurring, each bus controller drops into a two-state idle machine; - since the bus is STB, the address and data buses must be arbitrated for separately; - during the arbitration cycle, the 48 address+command lines are used for arbitration. The lower 16 of these lines are used by the 16 boards to request the data bus, and the middle 16 lines are used for address bus arbitration. For transactions that require both of them at the same time (e.g. write backs), corresponding bits for both uses can be set high. the top 16 lines are used to make urgent, higher priority requets; - at the end of the arbitration cycle, all bus interface ASICs capture the 48-bit state of the address+command lines and thus see the bus requests from all boards; a distributed arbitration scheme is used, and the results are computed during the resolution cycle; - during the address cycle, the address bus winner drives the address and command buses with corresponding information. Simultaneously, the data bus winner drives the data resource ID line corresponding to the response (the data resource ID is that 3-bit global tag assigned to the read request when it was issued on the bus); - during the decode cycle, no signals are driven on the address bus. Each bus interface slot decides how to respond to this transaction. In addition, all slots prepare to supply the proper cache coherence information; - during the acknowledge cycle, each bus interface slot responds to the data/address bus transaction. The 48 bits are used by the controllers to provide feedback, including 16 bits that inform, for each cache, if the requested block is present or not (used to know the state in which this block will be loaded in the requestor); - once a slot has become the master (the source of the data), the 128 bytes of data are transfered in four consecutive cycles over the 256-bit-wide datapath. This four-cycle sequence begins with the acknowledgement cycle and ends at the address cycle of the following five-cycle bus phase; - each processor board has: - one address chip (A-chip); - one cache coherence chip (CC-chip) PER PROCESSOR; - four bit-sliced data chips (D-chips) that interface with the data bus; they are quite simple and shared among the processors; - a data request to be satisfied by MM appears 12 cycles after the address appeared on the bus (250 ns); - L2 cache miss penalty: approx. 1us (explained in the book); - uses the Illinois MESI protocol by default; - I/O devices are connected through a second bus, the HIO I/O bus (64-bit-wide, 320MB/s) that forms the PowerChannel-2 I/O subsystem; - the I/O subsystem interfaces to the system data bus just as another processor board. DMA reads and writes look like processor operations and are handled in the same way; <\end SGI Challenge> <\begin Sun Enterprise 6000> - up to 30 UltraSparc processors with Gigaplane system bus of 2.67GB/s; - even though some memory is physically local to a pair of processors, all of memory is accessed through the bus and hence is of uniform access time; - Sun Gigaplane bus: nonmultiplexed, STB with 256-bit data lines and 41-bit physical addresses but is clocked at 83.5MHz; the bus can support up to 112 outstanding transactions, including up to 7 from each board; - bus has 388 signals: - 256 data; - 32 ECC; - 43 for address with parity; - 7 for ID tag; - 18 for arbitration; - some (32) for configuration signals; - the bus protocol is very different from the Challenge, using collision-based speculative arbitration techniques to avoid the cost of bus arbitration; - if a request collision occurs, the requestor that wins simply drives the address again in the next cycle, as it would with conventional arbitration (this is sort of like an optimistic protocol); - five cycles after the address is on the bus, all boards assert their snoop signals (shared, owned, mapped, and ignore) on the state bus lines; - like the Challenge, invalidations are ordered by the BusRdX requests appearing on the address bus and are handled in FIFO fashion by the cache subsystem; - although the UltraSparc implements a five-state MOESI protocol in the L2 caches, the D-tags (fig 6.19) maintain an approximation using only three states: owned, shared, and invalid; that is the minimum information it needs to know when to place data on the bus; <\end Sun Enterprise 6000> - advantages of having multiple processors sharing a single L1 cache (same idea for shared L2): - no coherence protocol needed at this level; - reduces the latency of communication; - "prefetch" can occur across processors (P1 misses on a block that is also going to be used by P2); - more effective use of long cache blocks, as there is no false sharing; - if working sets from different processors overlap, the cache for n processors can be smaller than n independent caches combined; - today shared L1 are being investigated for single-chip multiprocessors; tradeoff: is a double-P4 better than a single P5? As the number of transistors in a chip increases, both alternatives are possible (multiprocessor vs. more powerful processor); - disadvantages of sharing L1: - higher bandwidth demand; - hit latency to a shared cache is higher than to a private one; - higher cache complexity; - as it needs to be larger than single-processor caches, shared caches are usually slower; - instead of constructive interference (like the working set example), destructive interferene can occur; - virtually-indexed caches: - problem of same VA pointing to different data accros processes can be handled by flushing the cache or by using address-space identifiers (ASIDs) with cache blocks; - other problem: synonyms: different VAs that point to the same physical address; - one of the solutions available is to use VAs to lookup the cache but to put PAs on the bus; to support this caches need to maintain both VA and PA tags, pointing to each other (p.438); - TLB coherence: ------------- - a PTE can come to reside in the TLB of multiple processors; so there is a coherence problem if one of the copies is modified; solutions for TLB coherence vary, some working even through the OS, in software; - 4 approaches to handle TLB coherence: 1) TLBs are not needed if virtually addressed caches are used. If cache misses are rare, the PT can be used directly; as PTEs get cached, the OS must flush the PTEs from all caches when the memory copy gets changed (e.g., a protection-bit change); 2) TLB shootdown. The TLB coherence procedure is invoked by a processor (the "initiator") when it makes changes to PTEs that may be cached by other TLBs; an interrupt is sent to other processors, that examine their TLBs; 3) some processors, like MIPS, use software-loaded TLBs. The coherence problem can be solved using ASIDs, avoiding flushing the entire TLB on a context switch; 4) some processors provide hardware instructions to invalidate other processor's TLBs; - section 6.6.4 discusses cache coherence on rings; SC is a little trickier, since processors may see a pair of transactions in different orders; * scaling data bandwidth is easy; the real challenge is scaling the * snoop bandwidth; - some alternatives to scaling data bandwidth: wider buses, multiple buses; * the general solution to building scalable cache-coherent machines is * to distribute memory physically among nodes and use a scalable * interconnect, together with coherence protocols that do not rely on * snooping (chap.8); it is difficult to predict what the future holds * for beses and the scale at which they will be used, although they * are likely to have an important role for some time to come;