Chapter 8: Directory-Based Cache Coherence --- - at the final point in our design so far, the communication assist provides a shared address space in hardware; however, we didn't talk about cache coherece for scalable multiprocessors; - actually, to avoid cache coherence and simplify memory consistency, some machines disable he hardware caching of logicaly shared by physically remote data, restricting the programming model; - this chapter deals with coherence and consitency in machines without a snoopable interconnect such as a bus; - as we've seen, coherence algoriths may become the bottleneck of a system, and need to scale as well; - this chapter deals with hardware support for coherence in scalable MP; - as there is no share bus anymore, a directory is used to maintain state explicitly, where requests can go and look it up; - from the directory, the node determines where the valid cached copies (if any) are and what further actions to take; it then communicates with the cached copies as necessary using additional network transactions; - hierarchical buses have been tried as an alternative to directories; however, it does not apply to general network topologies such as meshes and cubes, and we will see that it has problems with latency and bandwidth; - invalidations and updates are performed by the communication assist rather than the main processor; - another popular approach is a limited, two-level protocol hierarchy. Each node of the machine is itself a multiprocessor. The caches within a node are kept coherent by an "inner protocol", while coherence across nodes is maintained by another, possibly different protocol called the "outer protocol". To the outer protocol, each multiprocessor node looks like a single cache, and coherence within the node is the responsibility of the inner protocol; - a common organization is for the outer protocol to be a directory protocol and the inner one to be a snooping protocol. However, other combinations such as snooping-snooping, directory-directory, and even directory-snooping may be used (fig 8.4, pag 557); - if coherent relication is provided in main memory, additional support for keeping caches coherent may not be necessary; this chapter assumes that data is automatically replicated only in caches, not in main memory; - machines of this type are called cache-coherent, nonuniform memory access, or CC-NUMA architectures; - access faults (cache misses) are generated during access to a block that is not in the cache or a write access to a block that is present but in shared state; - different coherent protocols provide different mechanisms to: - find out enough information about the state of the location (cache block) in other caches; - locating those other copies, to perform updates or invalidations; - communicating with the other copies; - directory-based protocols avoid broadcasts; - definitions: - home node: node in whose main memory the block is allocated; - dirty node: node that has a copy of the block in modified state; - owner node: holds the valid copy of a block and must supply the data when needed; in directory protocols, this is either the home node or the dirty node; - exclusive node: has an exclusive copy of the block, either clean or dirty; thus, the dirty node is always the exclusive node; - local node, or requesting node: node containing a processor that issues a request for the block; - Simple directory scheme ----------------------- - on a read miss, the accessing node contacts the home node and the directory indicates from which node the data may be obtained; - on a write miss (write to shared blocks are considered write misses as well), the directory identifies the copies of the block, and invalidation or update network transactions may be sent to these copies; for invalidates, for example, now an explicit ack from each copy of the block is needed, as we cannot count on the shared bus anymore; we also cannot guarantee ordering with respect to other tranactions; - a natural way to organize a directory is to maintain the directory information for a block together with the block in main memory, that is, at the home node for the block; - for example, a bit vector with p presence bits can be used together with some state bits; assume that 1 dirty bit is used; - the directory does not necessarily need to know the exact state (e.g. MESI) in each cache, but only enough information to determine what actions to take; - because of communication delays, the actual state in the cache may differ from the one in the directory, and this stale information may cause the directory to generate messages with old information; - from usual parallalel applications, the number of sharers of a block is expected to be small; - presence bits can generate high overheads for large number of processors; - two major classes of alternatives for finding the source of the directory information for a block: - flat directory schemes: the source of the directory information for a block is in a fixed place, usually at the home that is determined from the PA of the block; - hierarchical directory schemes: the source of directory information is not known a priori; the directory information for each block is logically organized as a hierarchical data structure (a tree); the processing nodes, each with its portion of memory, are at the leaves of the tree; the internal nodes of the tree are simply hierarchically maintained directory information for the block; a node keeps track of whether each of its children has a copy of a block; every processing node in the system not only servers as a leaf node for the block it contains but also stores directory information as an internal tree node for other blocks; - flat schemes can be memory-based or cache-based. Memory-based schemes store the directory information about all cached copies at the home mode of the block, while cache-based schemes distribute the information about the copies themselves. In this case, the home simply contains a pointer to (or the identity of) the node that has the next cached copy of the block, in a distributed linked-list organization; - thank God, hierarchical schemes are not popular in modern systems; - rest of the chapter deals with memory- and cache-based flat schemes; - examples of Flat/Memory-based systems: Stanford FLASH/DASH, MIT Alewife, SGI Origin, HAL; - examples of Flat/Cache-based systems: IEEE SCI, Sequent NUMA-Q; - Flat, Memory-based directory schemes (FMB) ------------------------------------------ - bit vector is the most straightforward way to implement FMB schemes; - main disadvantage of bit vector is the storage overhead; two ways to reduce the overhead: - increase the block size (duhh); - put more than one processor in one node, which is the unit visible to the directory service (Stanford DASH uses 4 processors per node); - these two mechanisms make FMBDS attractive, as in a machine with 256 nodes and 128-byte blocks, only 6.25% of space is spent with bit vectors; - however, total storage is still P*P*M, i.e., each processor (P) has a bit vector with P entries for each of the M memory positions; - we can reduce the number of bits per directory entry, or directory width, by not letting it grow proportionally to P. Or we can reduce the total number of directory entries, or directory height, by not having an entry per memory block; - directory width is reduced by the use of limited pointer directories, which are motivated by the earlier observation tha most of the time only a few caches have a copy of a block when the block is written; Instead of maintaining a yes/no bit for each processor, the limited pointer directory keeps i pointers to processors that have the block cached (each pointer takes log(P) bits); special strategies are necessary when there are more than processors holding the block than there is space allocated for the pointers. One solution is to use broadcast to all nodes to invalidate a block; - directory height can be reduced by organizing the directory itself as a cache, given the fact that the sum of all the caches (possible copies) is much smaller than the total amount of memory, which shows that most of the entries of the directories will not be used most of the time anyway; - Flat, Cache-based directory schemes (FCB) ----------------------------------------- - there is still a home main memory for the block, but the directory entry at the home node does not contain the identities of all sharers but only a pointer to the first sharer in the list (head pointer) plus a few state bits; - each cached copy is a node in a doubly-linked list; - on a read miss, the requestor contacts the home node, then contacts the head node, and becomes the new head, pointing to the old one. This new state is also shown in the home node; - on a write miss, the requestor obtains the head node as before, is also set as the new head, but traverses the list contacting each node in order to locate the copies and invalidate them; acknowledgments for these invalidations are sent to the writer; if the data is needed by the writer, it is either provided by the home (clean) or the current head of the list (the owner, that has a dirty copy of the block); oberserve that these accesses are serialized, as the writer needs to contact one node in order to discover the next node in the list; - write backs or block replacements require that the node delete itself from the list, having to contact the nodes that are before and after it in the list; - advantages over FMB schemes: - directory overhead is small (every memory block has a single head pointer); - linked lists record the order in which accesses were made, making it easier to provide fairness; - the work of sending invalidation is transfered from the home node (memory-based) to the sharers (the writer, in a write miss); - synchronization is very important and tricky when dealing with these linked lists. IEEE 1596-1992 standard deals just with this! - update-based protocols are not very popular in scalable machines, as they make it much more difficult to preserve the desired memory consistency model; - two important measures for invalidation-based protocols: - invalidation frequency: the frequency with which processors issue writes that may require invalidating other copies (writes to shared blocks); - invalidation size distribution: number of invalidations (sharers) needed on such a write; * directory schemes are advantageous if the invalidation size is small * and the frequency is signigicant enough to make broadcast a * performance problem; - using 64 processors and 1 processor per node, simulation results with infinite cache show that invalidation sizes are rarely bigger than 5; - data access patterns are important in understanding invalidation patterns; important data access patterns: * read-only: never written once they have been initialized; * there are no invalidating writes, so data in this category * is not an issue for directories; * producer-consumer: the invalidation size is determined by * how many consumers there have been each time the producer * writes the value. * migratory: data migrates from one processor to another, * being written and usually read by each processor; ex: global * sum, on which each processor adds its local sum; * irregular read-write: irregular or unpredictable read and * write access patterns. These usualy lead to wide-ranging * invalidation size distributions; - two major performance goals at the protocol layer: reduce the number of network transactions generated per memory operation and to reduce the number of actions, especially network transactions, that are on the critical path of the processor, thus reducing uncontended latency; - consider a read miss to a remotely allocated block that is dirty in a third node in a FMB protocol. There are two request-response interactions, between the requestor and the home and owner nodes. There is also a message from the owner to the home node in order to update the state of the block (from owned to shared), giving a total of five messages; - intervention forwarding: the home does not respond to the requestor, but forwards the request to the owner. The owner replies to the home with the data or acknowledgment. The home node updates the state of the block and sends forwards this message to the requestor. This scheme reduces the number of messages to four; - disadvantage: state is kept at the home instead of the requestor; - reply forwarding: the home node also forwards the request to the owner, the but latter also receives the identity of the requestor, to which it responds directly. The owner also sends a revision message to the home node, but this message is not in the critical path of the read miss; - now the state is maintained in the requestor, which makes reply forwarding preferred by real systems; - a possible optimization: while the directory lookup is executed, the requested block can be fetched from memory (assuming it is clean). If the block is discovered to be dirty, the memory transaction is discarded. - similar techniques can be applied to cache-based systems (fig 8.13); - Serialization for coherence --------------------------- - coherence: all processors must see the writes to a given memory location as having happened in the same order; - several types of solutions can be used to ensure serialization to a location. Most of them use additional directory states called busy states or pending states. A block being in busy state at the directory indicates that a previous request that came to the home for that block is still in progress and has not been completed; when a new request comes to the home and finds the directory state to be busy, serialization may be provided by one of the following mechanisms: 1) buffer at the home: the request may be buffered at the home as a pending request until the previous request that is in progress for the block has completed, regardless of whether the previous request was forwarded to a dirty node or whether a request-response was used; disadvantages include the fact that the home needs to know when the write has completed and possibility of buffer overflow (could be solved by implementing the rest of the list in memory). 2) buffer at the requestors: pending requests are buffered at the requestors themselves. 3) NACK and retry: an incoming requrest may be NACKed bu the home if the block is busy; the request will be tried again in a later time; 4) forward to the dirty node: this means that the accesses are serialized by the home node when the block is clean and by the owner when the block is dirty; if the dirty node becomes clean before the request arrives at the owner node, it will probably be NACKed and retried later; - recall the two components of preserving the sufficicient conditions for satisfying SC: - detecting write completion (needed to preserve program order); - ensuring write atomicity; - these two conditions are easily achieved in a distributed network with no cached shared data; - to achieve SC with multiple copies, a processor may wait after issuing a write until all ACKs for that write have been received; - in the distributed case, it seems easier to provide SC in invalidation-based rather than in update-based protocols; in update-based protocols, there is a 2nd phase in the protocol to let the replicas know when they can use the new value; - atomicity: the owner does not allow access to a new value until all invalidation ACKs for the write that generated it have returned; <\begin SGI Origin> - FMB directory system; - NACKs are used instead of buffering (at the home node) to provide serialization; serialization is determined by the order in which the home accepts the requests, not in the order that they arrive; - another general requirement for serialization: any node, not just the serializing entity, should not apply a transaction corresponding the a new memory operation to a block until an oustanding memory operation on that block is clomplete (as far as the node is concerned); - support for automatic page migration: the home node keeps a counter of local/remote accesses to each PAGE, one counter per node; - invalidating TLB entries as a result of a page migration is VERY expensive (100 microseconds); Origin uses lazy TLB invalidation, supported by its "poisoned" directory state; therefore, the TLB entries will be invalidated only when used by the other processors; - if the processor (or cache) provides data as a reply to an incoming intervention, it is the logic in the PI's (inside the hub, the CA) outgoing FIFO that expands the reply into the multiple messages required by the coherence protocol; <\end SGI Origin> <\begin Sequent NUMA-Q> - FCB directory protocol: SCI (Scalable Coherent Interface); - processing nodes connected by high-speed links in a ring configuration; - each node: 4 Pentium Pros, called "quads"; - IQ-link plugs into the quad memory bus and takes the place of the hub in the Origin; - memory cache: 32MB (expandable), four-way set-associative remote access cache for blocks that are fetched to the node from remote memory; this cache represents the quad to the directory protocol; the processor caches are kept coherent with the remote cache through a snooping bus protocol (inclusion is maintained between the remote cache and the processors' caches); - there is a linked list of sharers per block; the head of the list is stored in the memory of the home node; - states of blocks: - home: no remote cache in the system has a copy of the block; - fresh: one or more remote caches may have a read-only copy, and the copy in memory is valid; - gone: another remote cache contains a writable copy (exclusive or dirty); - lists are manipulated by using three primitive operations: - list construction: a sharer is added the the head of the list; - rollout: removing a node from the list; the removed node contacts its neighbors so they can update their pointers (this is one ugly thing about cache-based mechanisms); - purging (invalidation): the node at the head may invalidate all other nodes; only the head node can call this operation; - home node serializes accesses to blocks; - SCI standard does not specify a memory consistency model; NUMA-Q implements processor consistency; - when a second write request to a given block arrives at the home node, it will already return the address of the current owner, that may be in an intermediate state (in the FSM); the requestor will add itself to the list of pending requests in the owner; there are entries in the cache for this pending list; - latency tends to be larger than in FMB protocols (maintaining this list of sharers is clearly a bottleneck); - to decrease the interference between I/O and memory transfers, NUMA-Q provides a separate communication substrate through the PCI buses for interquad I/O transfers; a FiberChannel link connects to a PCI bus on each node; I/O is globally addressable; - a read and a write miss examples are described in detail in p.638; - when extending a bus-based system to be the node of a larger cache-coherent machine, it is essential that the bus be split transaction; - interesting: a block can be in the HOME state (unowned) at the home node even if one of the home's local processors has a dirty copy of it; on a read miss to this block, bus and directory "cooperate" to provide the requestor with the latest value (this is like a lazy scheme); <\end Sequent NUMA-Q> - 4 parameters in communication architecture: processor overhead, occupancy of the communication assist, network transit delay, and network bandwidth; - network delay and assist occupancy are the more critical issues nowadays; - assist occupancy tends to be higher in FCB schemes; - in CC-NUMA machines, memory management is typically done at the granularity of pages; - overall software objectives: minimize false-sharing, artifactual communication (by better data placement), and synchronization costs; - Reducing directory storage overhead: ----------------------------------- - the broadcast scheme for directory-based (FMB) protocols: set a broadcast bit when the number of pointers to sharers (available in the directory) is exceeded; if a write occurs while this bit is set, invalidation messages are sent to all nodes (broadcasted); - another option is to fix the maximum number of sharers at a given time, invalidating one of them if anothe writer shows up; - another approach: change the granularity of the bit vector (from nodes to group of nodes) when maximum number is reached; - Hierarchical snooping --------------------- - bus-based nodes are connected by a second bus; a coherence monitor listens to both buses and forwards the necessary transactions between them; - one tricky point is that now the remote cache has to be bigger, if the system wants to maintain the inclusion property within a node (now a node means multiple processors); - Hierarchical directory schemes ------------------------------ - the processing nodes are at the leaves of the tree and mein memory is distributed along with the processing nodes; every block has a home memory (leaf) in which it is allocated, but this does not mean that the directory information is maintained or rooted there; the internal nodes of the tree are not processing nodes and only hold directory information; each such directory node keps track of all memory blocks that are being cached or recorded by its subtrees; it uses a presence vector per block to tell which of its subtrees have copies of the block and a bit to tell whether one of them has it dirty; it also records information about local memory blocks that are being cached by processing nodes outside its subtree; all this information is used to decide when requests originating within the subtree should be propagated further up the hierarchy; - in contrast to bus-based architectures, hierarchies of directories are logical; they don't need to map to physical hierarchies (network could be a mesh, for example); - in general, the advantages of hierarchical schemes are tightly related to the amount of locality shown by memory accesses, as the delay is high if all the buses/levels need to be traversed to server a high percentage of the memory accesses; - for hierarchical directories, the latency problem is that the number of network transactions sent up and down the hierarchy to satisfy a request tends to be larger than in a flat, memory-based scheme; even though these transactions may be more localized in the network, each one is a full-fledged network transaction that also requires either looking up or modifying the directory at its (intermediate) destination node; this increased endpoing overhead at the nodes along the critical path tends to far outweigh any reduction in the total number of network hops traversed and hence network delay, especially given characteristics of modern networks; - some interesting open questions for hardware-coherent shared address space systems include whether their performance on real applications will indeed scale to large processor counts (and whether significant changes to current protocols will be needed for this), whether the appropriate node for a scalable system will be a small-scale multiprocessor or a uniprocessor, the extent to which commodity communication architectures will be successful in supporting this abstraction efficnently, and the success with which a communication assist can be designed that supports the most appropriate mechanisms for both cache coherence and explicit message passing;