Chapter 8: Directory-Based Cache Coherence
---


- at the final point in our design so far, the communication assist
  provides a shared address space in hardware;  however, we didn't
  talk about cache coherece for scalable multiprocessors;

- actually, to avoid cache coherence and simplify memory consistency,
  some machines disable he hardware caching of logicaly shared by
  physically remote data, restricting the programming model;

- this chapter deals with coherence and consitency in machines without
  a snoopable interconnect such as a bus;

- as we've seen, coherence algoriths may become the bottleneck of a
  system, and need to scale as well;

- this chapter deals with hardware support for coherence in scalable MP;

- as there is no share bus anymore, a directory is used to maintain
  state explicitly, where requests can go and look it up;

- from the directory, the node determines where the valid cached
  copies (if any) are and what further actions to take; it then
  communicates with the cached copies as necessary using additional
  network transactions;

- hierarchical buses have been tried as an alternative to directories;
  however, it does not apply to general network topologies such as
  meshes and cubes, and we will see that it has problems with latency
  and bandwidth;

- invalidations and updates are performed by the communication assist
  rather than the main processor;

- another popular approach is a limited, two-level protocol hierarchy.
  Each node of the machine is itself a multiprocessor.  The caches
  within a node are kept coherent by an "inner protocol", while
  coherence across nodes is maintained by another, possibly different
  protocol called the "outer protocol".  To the outer protocol, each
  multiprocessor node looks like a single cache, and coherence within
  the node is the responsibility of the inner protocol;

- a common organization is for the outer protocol to be a directory
  protocol and the inner one to be a snooping protocol.  However,
  other combinations such as snooping-snooping, directory-directory, and
  even directory-snooping may be used (fig 8.4, pag 557);

- if coherent relication is provided in main memory, additional
  support for keeping caches coherent may not be necessary;  this
  chapter assumes that data is automatically replicated only in caches,
  not in main memory;

- machines of this type are called cache-coherent, nonuniform memory
  access, or CC-NUMA architectures;

- access faults (cache misses) are generated during access to a block
  that is not in the cache or a write access to a block that is
  present but in shared state;

- different coherent protocols provide different mechanisms to:

	- find out enough information about the state of the location
	  (cache block) in other caches;

	- locating those other copies, to perform updates or
	  invalidations;

	- communicating with the other copies;

- directory-based protocols avoid broadcasts;

- definitions:

	- home node: node in whose main memory the block is allocated;

	- dirty node: node that has a copy of the block in modified
	  state;

	- owner node: holds the valid copy of a block and must supply
	  the data when needed;  in directory protocols, this is
	  either the home node or the dirty node;

	- exclusive node: has an exclusive copy of the block, either
	  clean or dirty;  thus, the dirty node is always the
	  exclusive node;

	- local node, or requesting node: node containing a processor
	  that issues a request for the block;

- Simple directory scheme
  -----------------------

- on a read miss, the accessing node contacts the home node and the 
  directory indicates from which node the data may be obtained;

- on a write miss (write to shared blocks are considered write misses 
  as well), the directory identifies the copies of the block, and 
  invalidation or update network transactions may be sent to these 
  copies;  for invalidates, for example, now an explicit ack from 
  each copy of the block is needed, as we cannot count on the shared 
  bus anymore;  we also cannot guarantee ordering with respect to 
  other tranactions;

- a natural way to organize a directory is to maintain the directory 
  information for a block together with the block in main memory, 
  that is, at the home node for the block;

- for example, a bit vector with p presence bits can be used together 
  with some state bits;  assume that 1 dirty bit is used;

- the directory does not necessarily need to know the exact state 
  (e.g. MESI) in each cache, but only enough information to determine 
  what actions to take;

- because of communication delays, the actual state in the cache may
  differ from the one in the directory, and this stale information may
  cause the directory to generate messages with old information;

- from usual parallalel applications, the number of sharers of a block
  is expected to be small;

- presence bits can generate high overheads for large number of
  processors;

- two major classes of alternatives for finding the source of the
  directory information for a block:

	- flat directory schemes: the source of the directory
	  information for a block is in a fixed place, usually at the
	  home that is determined from the PA of the block;

	- hierarchical directory schemes: the source of directory
	  information is not known a priori; the directory information
	  for each block is logically organized as a hierarchical data 
	  structure (a tree);  the processing nodes, each with its
	  portion of memory, are at the leaves of the tree; the 
	  internal nodes of the tree are simply hierarchically
	  maintained directory information for the block; a node keeps 
	  track of whether each of its children has a copy of a block;
	  every processing node in the system not only servers as a 
	  leaf node for the block it contains but also stores
	  directory information as an internal tree node for other 	
	  blocks;

- flat schemes can be memory-based or cache-based.  Memory-based
  schemes store the directory information about all cached copies at
  the home mode of the block, while cache-based schemes distribute the
  information about the copies themselves.  In this case, the home
  simply contains a pointer to (or the identity of) the node that has
  the next cached copy of the block, in a distributed linked-list
  organization;

- thank God, hierarchical schemes are not popular in modern systems;

- rest of the chapter deals with memory- and cache-based flat schemes;

- examples of Flat/Memory-based systems: Stanford FLASH/DASH, MIT
  Alewife, SGI Origin, HAL;

- examples of Flat/Cache-based systems: IEEE SCI, Sequent NUMA-Q;


- Flat, Memory-based directory schemes (FMB)
  ------------------------------------------

- bit vector is the most straightforward way to implement FMB schemes;

- main disadvantage of bit vector is the storage overhead; two ways to
  reduce the overhead:

	- increase the block size (duhh);

	- put more than one processor in one node, which is the unit
	  visible to the directory service (Stanford DASH uses 4
	  processors per node);

- these two mechanisms make FMBDS attractive, as in a machine with 256
  nodes and 128-byte blocks, only 6.25% of space is spent with bit vectors;

- however, total storage is still P*P*M, i.e., each processor (P) has
  a bit vector with P entries for each of the M memory positions;

- we can reduce the number of bits per directory entry, or directory
  width, by not letting it grow proportionally to P.  Or we can reduce
  the total number of directory entries, or directory height, by not
  having an entry per memory block;

- directory width is reduced by the use of limited pointer
  directories, which are motivated by the earlier observation tha most
  of the time only a few caches have a copy of a block when the block is
  written;  Instead of maintaining a yes/no bit for each processor, 
  the limited pointer directory keeps i pointers to processors that have
  the block cached (each pointer takes log(P) bits);  special
  strategies are necessary when there are more than processors holding
  the block than there is space allocated for the pointers.  One
  solution is to use broadcast to all nodes to invalidate a block;

- directory height can be reduced by organizing the directory itself
  as a cache, given the fact that the sum of all the caches (possible
  copies) is much smaller than the total amount of memory, which shows
  that most of the entries of the directories will not be used most of
  the time anyway;

- Flat, Cache-based directory schemes (FCB)
  -----------------------------------------

- there is still a home main memory for the block, but the directory
  entry at the home node does not contain the identities of all
  sharers but only a pointer to the first sharer in the list (head
  pointer) plus a few state bits;

- each cached copy is a node in a doubly-linked list;

- on a read miss, the requestor contacts the home node, then contacts
  the head node, and becomes the new head, pointing to the old one.
  This new state is also shown in the home node;

- on a write miss, the requestor obtains the head node as before, is
  also set as the new head, but traverses the list contacting each
  node in order to locate the copies and invalidate them;
  acknowledgments for these invalidations are sent to the writer;
  if the data is needed by the writer, it is either provided by the
  home (clean) or the current head of the list (the owner, that has a 
  dirty copy of the block);  oberserve that these accesses are
  serialized, as the writer needs to contact one node in order to
  discover the next node in the list;

- write backs or block replacements require that the node delete
  itself from the list, having to contact the nodes that are before
  and after it in the list;

- advantages over FMB schemes:

	- directory overhead is small (every memory block has a single
	  head pointer);

	- linked lists record the order in which accesses were made,
	  making it easier to provide fairness;

	- the work of sending invalidation is transfered from the home
	  node (memory-based) to the sharers (the writer, in a write
	  miss);

- synchronization is very important and tricky when dealing with these
  linked lists.  IEEE 1596-1992 standard deals just with this!

- update-based protocols are not very popular in scalable machines, as
  they make it much more difficult to preserve the desired memory
  consistency model;

- two important measures for invalidation-based protocols:

	- invalidation frequency: the frequency with which processors
	  issue writes that may require invalidating other copies
	  (writes to shared blocks);

	- invalidation size distribution: number of invalidations
	  (sharers) needed on such a write;

* directory schemes are advantageous if the invalidation size is small
* and the frequency is signigicant enough to make broadcast a
* performance problem;

- using 64 processors and 1 processor per node, simulation results
  with infinite cache show that invalidation sizes are rarely bigger
  than 5;

- data access patterns are important in understanding invalidation
  patterns;  important data access patterns:

	* read-only: never written once they have been initialized;
	* there are no invalidating writes, so data in this category
	* is not an issue for directories;

	* producer-consumer: the invalidation size is determined by
	* how many consumers there have been each time the producer
	* writes the value.

	* migratory: data migrates from one processor to another,
	* being written and usually read by each processor; ex: global
	* sum, on which each processor adds its local sum;

	* irregular read-write: irregular or unpredictable read and
	* write access patterns.  These usualy lead to wide-ranging
	* invalidation size distributions;

- two major performance goals at the protocol layer: reduce the number
  of network transactions generated per memory operation and to reduce
  the number of actions, especially network transactions, that are
  on the critical path of the processor, thus reducing uncontended 
  latency;

- consider a read miss to a remotely allocated block that is dirty in
  a third node in a FMB protocol.  There are two request-response
  interactions, between the requestor and the home and owner nodes.
  There is also a message from the owner to the home node in order to 
  update the state of the block (from owned to shared), giving a total
  of five messages;

- intervention forwarding: the home does not respond to the requestor,
  but forwards the request to the owner.  The owner replies to the
  home with the data or acknowledgment.  The home node updates the state
  of the block and sends forwards this message to the requestor.  This
  scheme reduces the number of messages to four;

	- disadvantage: state is kept at the home instead of the
	  requestor;

- reply forwarding: the home node also forwards the request to the
  owner, the but latter also receives the identity of the requestor,
  to which it responds directly.  The owner also sends a revision
  message to the home node, but this message is not in the critical path
  of the read miss;

	- now the state is maintained in the requestor, which makes
	  reply forwarding preferred by real systems;

- a possible optimization: while the directory lookup is executed, the 
  requested block can be fetched from memory (assuming it is clean).  
  If the block is discovered to be dirty, the memory transaction is 
  discarded.

- similar techniques can be applied to cache-based systems (fig 8.13);

- Serialization for coherence
  ---------------------------

- coherence: all processors must see the writes to a given memory
  location as having happened in the same order;

- several types of solutions can be used to ensure serialization to a
  location.  Most of them use additional directory states called busy
  states or pending states.  A block being in busy state at the
  directory indicates that a previous request that came to the home for
  that block is still in progress and has not been completed; when a
  new request comes to the home and finds the directory state to be
  busy, serialization may be provided by one of the following
  mechanisms:

1) buffer at the home: the request may be buffered at the home as a
   pending request until the previous request that is in progress for the
   block has completed, regardless of whether the previous request was
   forwarded to a dirty node or whether a request-response was used;
   disadvantages include the fact that the home needs to know when the
   write has completed and possibility of buffer overflow (could be
   solved by implementing the rest of the list in memory).
	
2) buffer at the requestors: pending requests are buffered at the
   requestors themselves.

3) NACK and retry: an incoming requrest may be NACKed bu the home if
   the block is busy;  the request will be tried again in a later time;

4) forward to the dirty node: this means that the accesses are
   serialized by the home node when the block is clean and by the owner
   when the block is dirty;  if the dirty node becomes clean before the
   request arrives at the owner node, it will probably be NACKed and
   retried later;

- recall the two components of preserving the sufficicient conditions
  for satisfying SC:

	- detecting write completion (needed to preserve program
	  order);

	- ensuring write atomicity;

- these two conditions are easily achieved in a distributed network
  with no cached shared data;

- to achieve SC with multiple copies, a processor may wait after
  issuing a write until all ACKs for that write have been received;

- in the distributed case, it seems easier to provide SC in
  invalidation-based rather than in update-based protocols; in
  update-based protocols, there is a 2nd phase in the protocol to let
  the replicas know when they can use the new value;

- atomicity: the owner does not allow access to a new value until all
  invalidation ACKs for the write that generated it have returned;

<\begin SGI Origin>

- FMB directory system;

- NACKs are used instead of buffering (at the home node) to provide
  serialization;  serialization is determined by the order in which
  the home accepts the requests, not in the order that they arrive;

- another general requirement for serialization: any node, not just
  the serializing entity, should not apply a transaction corresponding
  the a new memory operation to a block until an oustanding memory
  operation on that block is clomplete (as far as the node is
  concerned);

- support for automatic page migration: the home node keeps a counter
  of local/remote accesses to each PAGE, one counter per node;

- invalidating TLB entries as a result of a page migration is VERY
  expensive (100 microseconds); Origin uses lazy TLB invalidation,
  supported by its "poisoned" directory state; therefore, the TLB
  entries will be invalidated only when used by the other processors;

- if the processor (or cache) provides data as a reply to an incoming
  intervention, it is the logic in the PI's (inside the hub, the CA) 
  outgoing FIFO that expands the reply into the multiple messages 
  required by the coherence protocol;

<\end SGI Origin>


<\begin Sequent NUMA-Q>

- FCB directory protocol: SCI (Scalable Coherent Interface);

- processing nodes connected by high-speed links in a ring
  configuration;

- each node: 4 Pentium Pros, called "quads";

- IQ-link plugs into the quad memory bus and takes the place of the
  hub in the Origin;

- memory cache: 32MB (expandable), four-way set-associative remote
  access cache for blocks that are fetched to the node from remote
  memory;  this cache represents the quad to the directory protocol; the
  processor caches are kept coherent with the remote cache through a
  snooping bus protocol (inclusion is maintained between the remote
  cache and the processors' caches);

- there is a linked list of sharers per block;  the head of the list
  is stored in the memory of the home node;

- states of blocks:

	- home: no remote cache in the system has a copy of the block;

	- fresh: one or more remote caches may have a read-only copy,
	  and the copy in memory is valid;

	- gone: another remote cache contains a writable copy
	  (exclusive or dirty);

- lists are manipulated by using three primitive operations:

	- list construction: a sharer is added the the head of the
	  list;

	- rollout: removing a node from the list; the removed node
	  contacts its neighbors so they can update their pointers
	  (this is one ugly thing about cache-based mechanisms);

	- purging (invalidation): the node at the head may invalidate
	  all other nodes;  only the head node can call this operation;

- home node serializes accesses to blocks;

- SCI standard does not specify a memory consistency model; NUMA-Q
  implements processor consistency;

- when a second write request to a given block arrives at the home
  node, it will already return the address of the current owner, that
  may be in an intermediate state (in the FSM);  the requestor will
  add itself to the list of pending requests in the owner; there are
  entries in the cache for this pending list;

- latency tends to be larger than in FMB protocols (maintaining this
  list of sharers is clearly a bottleneck);

- to decrease the interference between I/O and memory transfers,
  NUMA-Q provides a separate communication substrate through the PCI 
  buses for interquad I/O transfers;  a FiberChannel link connects to
  a PCI bus on each node; I/O is globally addressable;

- a read and a write miss examples are described in detail in p.638;

- when extending a bus-based system to be the node of a larger
  cache-coherent machine, it is essential that the bus be split
  transaction;

- interesting: a block can be in the HOME state (unowned) at the home
  node even if one of the home's local processors has a dirty copy of
  it;  on a read miss to this block, bus and directory "cooperate" to
  provide the requestor with the latest value (this is like a lazy
  scheme);

<\end Sequent NUMA-Q>


- 4 parameters in communication architecture: processor overhead,
  occupancy of the communication assist, network transit delay, and
  network bandwidth;

- network delay and assist occupancy are the more critical issues
  nowadays;

- assist occupancy tends to be higher in FCB schemes;

- in CC-NUMA machines, memory management is typically done at the
  granularity of pages;

- overall software objectives: minimize false-sharing, artifactual
  communication (by better data placement), and synchronization costs;

- Reducing directory storage overhead:
  -----------------------------------

- the broadcast scheme for directory-based (FMB) protocols: set a
  broadcast bit when the number of pointers to sharers (available
  in the directory) is exceeded; if a write occurs while this bit is
  set, invalidation messages are sent to all nodes (broadcasted);

- another option is to fix the maximum number of sharers at a given
  time, invalidating one of them if anothe writer shows up;

- another approach: change the granularity of the bit vector (from
  nodes to group of nodes) when maximum number is reached;


- Hierarchical snooping
  ---------------------

- bus-based nodes are connected by a second bus; a coherence monitor
  listens to both buses and forwards the necessary transactions
  between them;

- one tricky point is that now the remote cache has to be bigger, if
  the system wants to maintain the inclusion property within a node
  (now a node means multiple processors);


- Hierarchical directory schemes
  ------------------------------

- the processing nodes are at the leaves of the tree and mein memory
  is distributed along with the processing nodes; every block has a
  home memory (leaf) in which it is allocated, but this does not mean
  that the directory information is maintained or rooted there; the
  internal nodes of the tree are not processing nodes and only hold
  directory information;  each such directory node keps track of all
  memory blocks that are being cached or recorded by its subtrees; it
  uses a presence vector per block to tell which of its subtrees have
  copies of the block and a bit to tell whether one of them has it
  dirty; it also records information about local memory blocks that are
  being cached by processing nodes outside its subtree; all this
  information is used to decide when requests originating within the
  subtree should be propagated further up the hierarchy;

- in contrast to bus-based architectures, hierarchies of directories
  are logical; they don't need to map to physical hierarchies (network
  could be a mesh, for example);

- in general, the advantages of hierarchical schemes are tightly
  related to the amount of locality shown by memory accesses, as the 
  delay is high if all the buses/levels need to be traversed to server
  a high percentage of the memory accesses;

- for hierarchical directories, the latency problem is that the number
  of network transactions sent up and down the hierarchy to satisfy a
  request tends to be larger than in a flat, memory-based scheme; even
  though these transactions may be more localized in the network, each
  one is a full-fledged network transaction that also requires either
  looking up or modifying the directory at its (intermediate)
  destination node; this increased endpoing overhead at the nodes along
  the critical path tends to far outweigh any reduction in the total
  number of network hops traversed and hence network delay, especially
  given characteristics of modern networks;

- some interesting open questions for hardware-coherent shared address
  space systems include whether their performance on real applications 
  will indeed scale to large processor counts (and whether significant 
  changes to current protocols will be needed for this), whether the
  appropriate node for a scalable system will be a small-scale
  multiprocessor or a uniprocessor, the extent to which commodity
  communication architectures will be successful in supporting this
  abstraction efficnently, and the success with which a communication
  assist can be designed that supports the most appropriate mechanisms
  for both cache coherence and explicit message passing;