Chapter 6: Snoop-Based Multiprocessor Design
---

- latency and bandwidth of a protocol depend on the bus design, cache
  design, and integration with memory;

- implementation objectives: correctness, high performance, and
  minimal extra hardware;

* fundamental challenge: maintaining the illusion of order as required
* by coherence and memory consistency model;

- for coherence, it should ensure that stale copies are invalidated
  (or updated) on writes, and it should provide write serialization;

- deadlock vs. livelock;

* main problem: even though the bus is atomic, the set of actions to
* satisfy a memory operation uses other resources as well (such as
* cache controllers) and is NOT atomic;

- during a block transfer from cache to memory, the cache controller:

	- asserts request for bus;
	- waits for bus grant;
	- drive address and command;
	- wait command to be accepted by relevant device;
	- transfer data;

- two-controllers per cache: bus-side controller and processor-side
  controller;

* to support concurrent accesses by the 2 controllers, a coherent
* cache design may use dual-ported RAM for the tags and state or it
* may duplicate the tag and state for every block; of course,
* modifications to tags are serialized/synchronized;

- for snooping caches, each cache must check the addresss against its
  tags, and the collective result must be reported on the bus before
  the transaction can proceed;

	- one thing to do is to inform main memory whether it should
	  respond to the request or not (some cache may be holding a
	  modified copy);  the questions are : WHEN is the snoop 
	  result reported and IN what form;

- WHEN:

	1) within a fixed number of clock cycles from the issue of the
	   address on the bus;

		- advantages: design of main memory not affected and
		  the cache-to-cache handshake is very simple;
	
		- disadvantage: extra hardware and potentially larger
		  snoop latency;

	2) variable delay snoop.  MM assumes that one of the caches
	   will supply the data until all cache controllers have 
	   snooped and indicated otherwise;

	3) MM maintains a bit per block that indicates if block is
	   modified in one of the caches;

- usually, wired-OR signal bits are used to pass information between
  cache controllers (p. 384);

- write-backs complicate the implementation (2 bus transactions) and
  are usually dealt with by the use of a write-back buffer (additional
  storage, which works like a victim cache);

- even when bus transactions are atomic, the whole oparation on
  servicing a memory request is NOT;

- a convenient way to deal with the nonatomic nature of state
  transitions and revise requests and actions based on oberserved
  events, is to extend the protocol state diagram with intermediate
  or transient states;  original states in MESI will be called stable
  states;

* if processors can return from a write after the local cache has been
* updated but before the controller acquires the bus for the
* transaction, it complicates coherence (if transactions are to the
* same block) as well as SC (if they are to different blocks);

- no need to wait for the whole transaction; can proceed with reads
  and writes once the transaction is on the bus;  sufficient for 
  coherence and SC and is what makes it possible to implement them
  with pipelined buses, mutilevel memory hierarchies, and multiple
  outstanding operations per processor;

- implementing atomic operations: overall, traffic and locality
  considerations tend to dominate and lock variables are usually
  cacheable, so processors can busy-wait on local copies;

- Multilevel Cache Hierarchies
  ----------------------------

- one obviuos way to handle multilevel caches is to have independent
  bus snooping hardware for each level;  this is bad for several
  reasons:

	- L1 is usually on the processor, and an on-chip snooper will
	  consume precious pins to monitor the shared bus;

	- duplicating the tags may consume too much on-chip real
	  statel

	- duplication of efforts (L2 contains L1);

- inclusion property:

	- L2 contains everyting in L1;

	- if a block is in the modified state in L1, the same happens
	  with its copy in L2;

- by extending the inclusion property, only a snooper at L2 is
  necessary, as it has all the information needed;

- now we have information flowing in both ways:

	- L1 accesses L2 for cache miss handling and block state
	  changes;

	- L2 forwards to L1 blocks invalidated/updated by bus
	  transactions;

- problems that make difficult maintaining the inclusion property:

 	- one of the problems of maintaining the inclusion property (IP) is
   	  that L1 and L2 may have different eviction algorithms.  While a
   	  block is kept by L1 it may be evicted by L2;

	- separated data and instruction caches;

	- different cache block sizes;

* fortunately, in one of the most commonly encountered cases,
* inclusion is maintained automatically: L1 is direct mapped, while L2 
* is either direct mapped or set associative with any replacement
* policy as long as the new block brought in is put in both L1 and
* L2 caches;  the block size is the same for both caches and the
* number of sets in L1 is smaller than in L2;

- however, this does not occur always.  Instead, inclusion is
  maintained explicitly by extending the mechanisms used for
  propagating coherence events in the hierarchy of caches;

- information about invalidation/update can be passed from L2 to L1 in
  two different approaches: send all bus transactions to L1 (even if
  the given block is not present there) or put some state at L2 (a bit
  for each block) to identify which blocks in L2 are also in L1;

- the same tradeoff occurs during writes to L1.  L1 can be made
  write-through (so all modifications affect also L2) or blocks in L2
  can be marked "modified-but-stale", indicating that a newer version
  needs to be fetched from L1 in case there is the need to send this
  block as part of a bus transaction (handled by L2);

- p.397 describes transaction propagation examples, both downwards and
  upwards;

* interestingly, dual tags are less critical when we have multilevel
* caches, as L2 serves as a filter for the L1 cache, screening out
* irrelevant transactions from the bus;

- Split-transaction Bus
  ---------------------

- in a Split-transaction bus (STB), transactions that require a
  response are split in two independent subtransactions: a request
  transaction and a response transaction;  the advantage, of course,
  is that by pipelining bus operations the bus is utilized more
  efficiently;  the disadvantage is increased complexity;

- major issues when supporting STBs:

	- a new request can appear on the bus before the snoop and/or
	  servicing of an earlier request are complete;  in
	  particular, these requests may be conflicting requests 
	  (same block);

	- the number of buffers for incoming requests and potential
	  data responses from bus to cache controller is usually fixed
	  and small; FLOW CONTROL IS THUS NEEDED;

	- since requests from the bus are buffered, we need to revisit
	  the issue of when and how snoop responses and data responses
	  are produced on the bus;

- going back to snoop definition, there are 3 phases in a transaction:
  a request is put on the bus, snoop results are sent by other caches
  (as a way to communicate the presence of the requested block in a
  state that is important for the requester to know), and finally data
  is sent for the requesting cache, if needed;

- SGI Challenge:

	- eight outstanding requests at a time;
	- no conflicting requests can be issued simultaneously;
	- flow control for the buffers is implemented by the use of
	  negative acknowledgments (NACKs);

<\begin SGI Challenge>

- the STB esentially consists of two separate buses: a request bus for
  command and address and a response bus for data;

- each request grated the bus it is assigned a unique tag (3-bit
  buffer location). 

- a response consists of data on the response bus as well as the tag
  value on the 3-bit-wide tag lines;

- cache blocks are 128 bytes while cache lines are 256 bits, for four
  bus cycles plus a one-cycle turnaround time are required for the
  response phase;

- a uniform pipeline strategy is followed, so the request phase is
  also five cycles: arbitration, resolution, address, decode, and
  acknowledgment; overall, a complete request-response transaction
  takes three or more of these five-cycle phases:  an address request
  phase, a data request phase (which uses the data bus arbitration logic
  and obtains access to the data bus for the response subtransaction),
  and a data transfer or response phase (which uses the data bus);  3
  different memory operations can be in the three different phases at
  the same time;

- to keep track of the eight outstanding requests on the bus, each
  cache controller maintains an eight-entry table, called a request
  table.  Whenever a new request is issued on the bus, it is added to
  all request tables at the same index as part of the arbitration
  process;  the index is the 3-bit tag assigned to that request;


- avoiding conflicting requests is easy: since every controller has a
  record of the pending transactions that have been issued to the bus
  in its request table, no request is issued for a block that has a 
  transaction outstanding;

- each controller sees all the requests made by the other controllers
  through the shared bus.  therefore, if controler C1 wants a block
  for which there is already an entry in the request table (made by
  another controller), it may decide to "snoop" the result for the other
  request and set the "shared" bit when the data is transfered, in order
  to let the originator know that the block is now shared;

* suppose the processor issues a BusRd request on the bus.  What
* should other cache controllers and the main memory controller do?
* The request is entered in the request table of all controllers and
* they start checking their caches for the requested memory block.  MM
* subsystem has no idea if the block is dirty in one of the caches, so
* it independently starts fetching this block.  Now we have 3
* different scenarios to consider:

	* one of the caches may determine that it has the block in
	* modified state and may acquire the bus to generate a
	* response before main memory can respond;  on seeing this
	* response, the MM controller aborts the fetch;  another cache
	* controller may use the inhibit line to extend the response
	* transaction, while MM asserts the NACK signal if it does not
	* have the buffer space needed to get the modified value from
	* the bus;

	* MM may fetch the data and acquire the bus before the cache
	* controller holding the block dirty has finished its snoop
	* and/or acquired the bus.  The controller will first assert
	* the inhibit line until it has finished its snoop and then
	* assert the dirty line and release the inhibit line,
	* indicating to the memory that it has the latest copy; MM
	* cancels its response transaction and does not put the data
	* on the bus;

	* the simplest scenario is that no other cache has the block
	* dirty. MM will acquire the bus and generate the response;

- Once a BusRdX or BusUpgr has obtained the bus, the associated write
  is commited.  However, commitment of a write does NOT guarantee that
  the value produced by the write is already visible to all other
  processors; only actual completion guarantees that.  further
  mechanisms are needed to allow SC.

- example 6.3 is a good example of possible violation of SC;

- condition necessary for SC in this case: a processor should not be
  allowed to actually see the new value dru to a write before previous
  writes (in bus order, as usual) are visible to it;  this can be done
  in two ways (discussed in more detail in page 408):

	- not letting certain types of incoming transactions from bus
	  to cache be reordered in the incoming queues;

	- allowing these reorderings in the queues, but then ensuring
	  that the important orders are preserved at the necessary
	  points in the machine;

<\end SGI Challenge>

- a simpler approach is to threat all the requests in FIFO order.
  Although this approach is simpler, it can have performance problems;

- STB with Multilevel Caches (Challenge-like machine)
  ---------------------------------------------------

- usual scenario:

	+---------------+
	|   Processor 	|
	+-------+-------+
		|
	+-------+-------+
	|      L1   	|
	+-------+-------+
		|
	+-------+-------+
	|      L2	|
	+-------+-------+
		|
	--------+----------------------
		BUS

- to maintain high bandwidth while allowing the individual units (e.g.
  controllers and caches) to operate at their own rates, queues are
  placed between levels of the hierarchy: locking and serialization 
  issues;

- buffer deadlock can occur if L1 is write-back, for example.  If L1
  is WB, there is data flow in both directions (L2->L1, L1->L2) and
  both levels may be waiting for the other with full buffers;  this
  does not occur, for example, if L1 is not write-back;


- Case studies: SGI Challenge and Sun Enterprise 6000
  ---------------------------------------------------

<\begin SGI Challenge>

- up to 18/36 MIPS processors

- Powerpath-2 bus with peak bandwidth of 1.2GB/s;

- interleaved memory accessible through the bus, in uniform time;

- Powerpath-2 (PP2) is nonmultiplexed, having a 256-bit-wide data
  portion and a separate 40-bit-wide address portion, plus command
  and other signals;  clocked at 47.6MHz and is a split-transaction 
  design supporting eight outstanding read requests;

- the bus supports 16 slots, 9 of which can be used by 4-processor
  boards;

- L2 block size is 128 bytes, i.e., can be transfered in 4 bus cycles;

- total of 329 signals:

	- 256 data;
	- 8 data parity;
	- 40 address;
	- 8 command;
	- 2 address + command parity;
	- 8 data resource ID;
	- 7 miscellaneous;

- all transactions take 5 cycles: arbitration, resolution, address,
  decode, and acknowledge;  when no transactions are occurring, each
  bus controller drops into a two-state idle machine;

- since the bus is STB, the address and data buses must be arbitrated
  for separately;

- during the arbitration cycle, the 48 address+command lines are used
  for arbitration.  The lower 16 of these lines are used by the 16
  boards to request the data bus, and the middle 16 lines are used for
  address bus arbitration.  For transactions that require both of them
  at the same time (e.g. write backs), corresponding bits for both uses
  can be set high. the top 16 lines are used to make urgent, higher
  priority requets;

- at the end of the arbitration cycle, all bus interface ASICs capture
  the 48-bit state of the address+command lines and thus see the bus
  requests from all boards; a distributed arbitration scheme is used, 
  and the results are computed during the resolution cycle;

- during the address cycle, the address bus winner drives the address
  and command buses with corresponding information.  Simultaneously,
  the data bus winner drives the data resource ID line corresponding to
  the response (the data resource ID is that 3-bit global tag assigned
  to the read request when it was issued on the bus);

- during the decode cycle, no signals are driven on the address bus.
  Each bus interface slot decides how to respond to this transaction.
  In addition, all slots prepare to supply the proper cache coherence
  information;

- during the acknowledge cycle, each bus interface slot responds to
  the data/address bus transaction.  The 48 bits are used by the
  controllers to provide feedback, including 16 bits that inform, for
  each cache, if the requested block is present or not (used to know the
  state in which this block will be loaded in the requestor);

- once a slot has become the master (the source of the data), the 128
  bytes of data are transfered in four consecutive cycles over the
  256-bit-wide datapath.  This four-cycle sequence begins with the
  acknowledgement cycle and ends at the address cycle of the following
  five-cycle bus phase;

- each processor board has:

	- one address chip (A-chip);
	- one cache coherence chip (CC-chip) PER PROCESSOR;
	- four bit-sliced data chips (D-chips) that interface with the
	  data bus;  they are quite simple and shared among the
	  processors;

- a data request to be satisfied by MM appears 12 cycles after the
  address appeared on the bus (250 ns);

- L2 cache miss penalty: approx. 1us (explained in the book);

- uses the Illinois MESI protocol by default;

- I/O devices are connected through a second bus, the HIO I/O bus
  (64-bit-wide, 320MB/s) that forms the PowerChannel-2 I/O subsystem;

- the I/O subsystem interfaces to the system data bus just as another
  processor board.  DMA reads and writes look like processor
  operations and are handled in the same way;

<\end SGI Challenge>


<\begin Sun Enterprise 6000>

- up to 30 UltraSparc processors with Gigaplane system bus of 2.67GB/s;

- even though some memory is physically local to a pair of processors,
  all of memory is accessed through the bus and hence is of uniform
  access time;

- Sun Gigaplane bus: nonmultiplexed, STB with 256-bit data lines and
  41-bit physical addresses but is clocked at 83.5MHz;  the bus can 
  support up to 112 outstanding transactions, including up to 7 from
  each board;

- bus has 388 signals:

	- 256 data;
	- 32 ECC;
	- 43 for address with parity;
	- 7 for ID tag;
	- 18 for arbitration;
	- some (32) for configuration signals;

- the bus protocol is very different from the Challenge, using
  collision-based speculative arbitration techniques to avoid the 
  cost of bus arbitration;

	- if a request collision occurs, the requestor that wins
	  simply drives the address again in the next cycle, as it
	  would with conventional arbitration (this is sort of like
	  an optimistic protocol);

- five cycles after the address is on the bus, all boards assert their
  snoop signals (shared, owned, mapped, and ignore) on the state bus lines;

- like the Challenge, invalidations are ordered by the BusRdX requests
  appearing on the address bus and are handled in FIFO fashion by the
  cache subsystem;

- although the UltraSparc implements a five-state MOESI protocol in
  the L2 caches, the D-tags (fig 6.19) maintain an approximation using
  only three states: owned, shared, and invalid;  that is the minimum
  information it needs to know when to place data on the bus;

<\end Sun Enterprise 6000>		


- advantages of having multiple processors sharing a single L1 cache
  (same idea for shared L2):

	- no coherence protocol needed at this level;

	- reduces the latency of communication;

	- "prefetch" can occur across processors (P1 misses on a block
	  that is also going to be used by P2);

	- more effective use of long cache blocks, as there is no
	  false sharing;

	- if working sets from different processors overlap, the cache
	  for n processors can be smaller than n independent caches
	  combined;

- today shared L1 are being investigated for single-chip
  multiprocessors; tradeoff: is a double-P4 better than a single P5?
  As the number of transistors in a chip increases, both alternatives 
  are possible (multiprocessor vs. more powerful processor);

- disadvantages of sharing L1:

	- higher bandwidth demand;

	- hit latency to a shared cache is higher than to a private
	  one;

	- higher cache complexity;

	- as it needs to be larger than single-processor caches,
	  shared caches are usually slower;

	- instead of constructive interference (like the working set
	  example), destructive interferene can occur;

- virtually-indexed caches: 

	- problem of same VA pointing to different data accros
	  processes can be handled by flushing the cache or by using
	  address-space identifiers (ASIDs) with cache blocks;

	- other problem: synonyms: different VAs that point to the
	  same physical address;

	- one of the solutions available is to use VAs to lookup the
	  cache but to put PAs on the bus; to support this caches need
	  to maintain both VA and PA tags, pointing to each other (p.438);

- TLB coherence:
  -------------

- a PTE can come to reside in the TLB of multiple processors; so there
  is a coherence problem if one of the copies is modified; solutions
  for TLB coherence vary, some working even through the OS, in software;

- 4 approaches to handle TLB coherence:

1) TLBs are not needed if virtually addressed caches are used.  If
   cache misses are rare, the PT can be used directly; as PTEs get
   cached, the OS must flush the PTEs from all caches when the memory
   copy gets changed (e.g., a protection-bit change);

2) TLB shootdown.  The TLB coherence procedure is invoked by a
   processor (the "initiator") when it makes changes to PTEs that may be
   cached by other TLBs; an interrupt is sent to other processors, that
   examine their TLBs;

3) some processors, like MIPS, use software-loaded TLBs.  The
   coherence problem can be solved using ASIDs, avoiding flushing the
   entire TLB on a context switch;

4) some processors provide hardware instructions to invalidate other
   processor's TLBs;

- section 6.6.4 discusses cache coherence on rings; SC is a little
  trickier, since processors may see a pair of transactions in
  different orders;

* scaling data bandwidth is easy; the real challenge is scaling the
* snoop bandwidth; 

- some alternatives to scaling data bandwidth: wider buses, multiple
  buses;

* the general solution to building scalable cache-coherent machines is
* to distribute memory physically among nodes and use a scalable
* interconnect, together with coherence protocols that do not rely on
* snooping (chap.8);  it is difficult to predict what the future holds
* for beses and the scale at which they will be used, although they
* are likely to have an important role for some time to come;