Chapter 5: Shared Memory Multiprocessors
---


- symmetric multiprocessor (SMP): global physical address and
  symmetric access to all of main memory;

- organization of memory hierarchy:

	- shared cache: one for all processors.  Used for small number
	  of processors (2-8);

	- bus-based shared memory:  each processor has its own cache
	  and access main memory through a common bus.  Widely used
	  for small- to medium-scale machines of up to 20-30
	  processors.  It is the dominant architecture in the market 
	  today.  The bus determines the scaling limit;

	- dancehall: similar to BBSM, but memory is divided in
	  multiple banks;

	- distributed memory: each processor has its own cache and
	  local portion of main memory (NUMA).  Despite its asymmetric
	  memory access time, this model improves over dancehall if
	  locality is exploited;

- cache coherence: pervasive and performance critical.  Read operation
  should return last value written, independently of which processor
  executed it;

- cache coherence problems (CHPs) occur in both write-through and write-back
  caches (worst in WB caches);
 
- CHPs arise even in uniprocessors, when DMA is used to perform I/O
  operations;  one solution adopted is to mark segments of VM used for
  DMA as uncachable;  another option is to flush the cache before DMA
  takes place;

- virtually all microprocessors today provide support for cache
  coherence;

- the appearance of a total, serial order on operations to a memory
  location is what we expect from any coherent memory system;

- a multiprocessor memory system is coherent if:

	- operations issued by any particular process occur in the
	  order in which they were issued to the memory system by
	  that process;

	- the value returned by each read operation is the value
	  written by the last write to that location in the serial
	  order;

- write propagation: writes become visible to other processors;

- write serialization: all processors see the writes in the same order;

- cache coherence through bus snooping:

	- processors can "snoop" on the bus and be aware of all
	  memory operations;
 
	- a snooping cache controller may take action if a bus
	  transaction is relevant to its state;

	- the operations are seen by processors in the same (bus) 
	  order.

- easiest way to provide snooping: use write-through caches (all
  writes will generate bus operations);

- invalidate-based vs. update-based protocols;

- bus transaction consists of three phases: arbitration,
  command/address, and data;

- arbitration: the bus arbiter selects one of the requesting
  processors and creates the serial order of transactions;

- every cache block has a state associated with it (valid, invalid,
  dirty, etc);

- in multiprocessor machines, a cache block has one state in each
  processor;

- to show that a protocol provides coherence, we need to show that for
  any execution under the protocol a total order on the memory
  operations for a local cache can be constructed that satisfies the
  program order and write serialization conditions;

- however, using write-through caches is not very realistic, as most
  multiprocessors use write-back caches to enhance performance;

* coherence says nothing about when a write will become visible.  For
* example, suppose program P1 writes both variables a and b.  Cache
* coherence does no guarantee that the new value of a will be seen by
* other processors before the new value of b.  Coherence says nothing
* about the order in which the reads issued to different locations by
* P2 are performed with respect to P1. 

* a memory consistency model for a shared address space specifies
* constraints on the order in which memory operations must appear to
* be performed (i.e. to become visible to the processors) with respect
* to one another.  This includes operations to the same location or
* different ones.  In this sense, memory consistence subsumes
* coherence.

* a multiprocessor is sequentially consistent if the result of any
* execution is the same as if the operations of all the processors were
* executed in some sequential order, and the operations of each
* individual processor occur in this sequence in the order specified
* by its program;

- conditions for sequencial consistency (SC):

	- program order requirement: memory operations of a process
	  must appear to become visible (to itself and others) in
	  program order;

	- the total order of operations is consistent for all
	  processes, requiring that the operations appear atomic;

- the tricky part of the second requirement is to make writes atomic:

	- write atomicity: the position in the total order at which a
	  write appears to perform should be the same with respect
	  to all processes.  Write atomicity extends the write 
	  serialization property necessary for cache coherence (the
	  writes do not need to be made to the same memory location);

- sequentially consistent execution: any interleaving (total order) of
  operations yields the same result;

- sequentially consistent system: any execution is sequentially
  consistent;

* sufficient conditions for preserving SC:

	* every process issues memory operations in program order;

	* after a write operation is issued, the issuing process waits
	* for the write to complete before issuing its next operation;

	* after a read operation is issued, the issuing process waits
	* for the read to complete, and for the write whose value is
	* being returned by the read to complete, before issuing its
	* next operation.  That is, if the write whose value is being
	* returned has performed with respect to this processor, then
	* the processor should wait until the write has performed with
	* respect to all processors;

- the third condition is the most demanding when implementing SC;

- some compiler optimizations violate the three conditions for SC
  explained above.  The problem in parallel programs is that the 
  out-of-order processing of operations to different shared variables
  by a process can be detected by (and affect) other processes;

- preserving the sufficient conditions for SC in multiprocessors is
  quite a strong requirement since it limits compiler reordering and
  out-of-order processing techniques;

* in this chapter, the authors assume that the compiler does not
* reorder memory operations;

- if a cache does not have exclusivity for a data block, it cannot
  write a new value into the block before first putting a transaction
  on the bus to communicate with others;

- when a process does not have exclusivity, a write operation is
  always considered a write miss, in order to generate the required
  bus transaction;

- if a cache has the block in modified (dirty) state, it is clear that
  it is the owner and it has exclusivity;

- a read exclusive operation is used to read a block and require
  exclusivity at the same time;

- because of locality of reference, invalidation-based protocols are
  have been found to be more robust;


- A Three-State (MSI) Write-back Invalidation Protocol
  ----------------------------------------------------

- cache entry states: modified, shared, or invalid (MSI);

- a processor issues reads (PrRd) and writes (PrWr).  The bus allows
  the transactions bus read (BusRd), bus read exclusive (BusRdX), and
  bus write back (BusWB);

- BusRd: generated by a PrRd that misses in the cache;

- BusRdX: generated by a PrWr to a block that is either not in the
  cache or is in the cache but not in the modified state;

- BusWB: generated by the controler on a write back;

- if a cache has the block in the modified state (owner) and
  observes a BusRd transaction on the bus, it flushes the data onto 
  the bus, in lieu of memory, and demotes its copy of the block to
  the shared state.  The memory and the requesting cache both pick
  up the block;

- a write miss generates a read exclusive bus transaction, which
  causes other copies to be invalidated;


- A Four-state (MESI) Write-back Invalidation Protocol
  ----------------------------------------------------

- a fourth state, exclusive-clean (or simply exclusive), is introduced 
  relative to MSI; indicates an intermediate level between shared and
  modified; implies exclusivity but no ownership (memory has a valid
  copy); in this state, the cache need not reply requests for the 
  block;

- when the block is first read, it enters the cache as shared (S) or
  exclusive (E), depending on whether other cache also has a copy of 
  the block (this is known by the wired-OR signal - other caches 
  assert it if they have a copy of the block);

- an additional signal, called the shared signal, must be available
  and indicates if any other cache currently holds the data;


- A Four-state (Dragon) Write-back Update Protocol
  ------------------------------------------------ 

- proposed by researches at Xerox PARC for their Dragon machine;

- this is an UPDATE-BASED protocol;

- four states: exclusive-clean (E), shared-clean(Sc), shared-modified
  (Sm), and modified (M);

- Sm means that potentially two or more caches have this block, main
  memory is not up-to-date, and it is this cache's responsibility to
  update the main memory at the time this block is replaced from the
  cache (this cache is the owner);

- a block can be in Sm state in only one cache at a time (remember
  that this is an update-based protocol);

- there is no invalid state, as Dragon is an update-based protocol;

- the transitions of a cache coherence protocol generate traffic on
  the data bus;  the overall traffic of representative applications
  give an estimate of how many processors can a given machine handle;

- cache-coherent MPs introduce coherence misses, which occur when
  blocks of data are shared among multiple caches:

	- true sharing vs. false sharing misses;

- technology pushes in the direction of large cache block sizes (e.g.
  DRAM organizations and access modes and the need for high-bandwidth
  data transfers by amortizing the overhead);

? the only way to reduce true sharing is by increasing the block size
? and increasing spacial locality of communicated data;

- too large cache blocks cause:

	- uneeded data prefetching;
	- increased conflic misses;
	- increased false sharing;

- subblocks can be used to decrease false sharing;

- another solution is to use small block size and more aggressive
  prefetching;

- performance of update-based vs. invalidation-based protocols is
  workload-dependent; hardware can implement both and decide which 
  one to use dynamically; this can be done at the page level by 
  assigning bits in PTEs in the TLB;

- competitive schemes change the protocol for a block in hardware
  based on observed patterns, at run-time;

- alternatives include "counting" updates (after k updates with no
  local access the block is locally invalidated) and the use of 
  probabilistic (randomized) algorithms;

- updates are single-word writes, rather than a full cache block
  transfer;  however, update traffic is quite substantial;
	
	- ex: consecutive writes to the same block;

- however, the increased bandwidth demand, the complexity of
  supporting updates, the trend toward larger cache blocks, and the
  pack rat phenomenon with the important case of multiprogrammed
  sequential workloads uderlie the trend away from update-based
  protocols in the industry;

- synchronization: how synchronization operations can be implemented
  on a bus-based cache-coherent MP through a combination of software
  algorithms and hardware primitives;

* the difficulty in implementing high-level synch. operations in
* hardware is not the acquire or release component, but the waiting
* algorithm;

- mutual exclusion:

	- hardware locks;
	- simple software lock algorithms;
	- test&set lock with backoff;
	- test-and-test&set lock;
	- load-locked, store conditional;
	- ticket lock;
	- array-based lock;

- hardware locks:

	- fixed set of lock lines on the bus;
	- inflexible (limited number of simultaneous locks);

- simple software locks:

	- usually constructed using atomic test&set or compare&swap
	  instructions;

	- the problem with the test&set lock is the traffic generated
	  during the waiting method.  Every attempt to get the lock
	  generates a write operation to the cache block holding the
	  lock variable, which in turn generates a bus transaction;

- enhancements to the simple lock algorithm:

	- test&set w/ backoff: reduce the frequency with which
	  processes issue test&set instructions;

	- test-and-test&set: busy-wait only with read operations,
	  instead of issuing test&set (writes);

- goals for locks:

	- low latency;
	- low traffic;
	- scalability;
	- low storage cost;
	- fairness;

- load-locked, store conditional (LL-SC):

	load-locked  Reg, LockVar
	...
	<other instructions>
	...
	store conditional  LockVar, Reg

- the SC tries to write the register back to the memory position if
  and only if no other processor has written to that location since 
  this processor completed its LL instruction;

- if SC succeeds, LL-SC was atomic;

- like test-and-test&set, LL-SC does not generate bus traffic during
  the waiting algorithm.  However, LL-SC is not a fair lock.

- still two undesirable properties:

	1) only one processor should attempt to obtain the lock
	   (rather than a storm of test&sets);

	2) when a lock is released, one read miss is better than
	   multiple (occurs even in LL-SC);


- Ticket lock (TL):

	- accomplishes (1);

	- every process has a different ticket # and busy-waits on a
	  global "now-serving" number;

	- an unlock is done by incrementing the serving number;

	- atomic operation used: fetch&increment;

	- not a large difference in traffic between TL and LL-SC;
	
	- TL shows more fairness than LL-SC (ticket order);


- Array-based lock (ABL):

	- accomplishes both (1) and (2);

	- fetch&incr does not get a ticket number but a unique
	  location on which to busy-wait;

	- the release operation writes an "unlock" value in the next
	  memory position;

	- fair (FIFO);
	
	- less read misses than in TL (different memory positions to
	  busy-wait);

* the LL-SC lock performs well (best one in experiments), particularly
* at lower processor counts, because it is not fair! a process that
* releases a lock can reacquire it faster than other processors (the
* microbenchmark was a tight loop of acquire/release in each
* processor);

- simple LL-SC locks perform well on buses with sufficient bandwidth;
  LL-SC gets worse than ABL and TL for more than 16 processors;

- when exponential backoff is used, LL-SC delivers the best average
  lock transfer time in all cases (again, good performance due to
  unfairness);

* a shared data structure is said to be "lock-free" if the operations
* defined on it do not require mutual exclution over multiple
* instructions;

* if an operation is guaranteed to finish in a finite amount of time,
* the data structure is said to be "nonblocking";

* if the operations can guarantee that every (nonfaulting) process
* will complete its operation in a finite amount of time, the data
* structure is "wait-free";

- these ideas work like updating a counter using LL-SC;
	
	- LL-SC is used to update the data structure, NOT to acquire a
	  lock;


- Point-to-point Event Syncronization
  -----------------------------------

- software algorithms:

	- flags (binary variables);
	- use the value of the shared variable itself as a flag;

- hardware support: full-empty bit

	- one bit associated with every word in memory;
	- producer sets and consumer clears the bit;
	- not used much;


- Global (Barrier) Event Synchronization
  --------------------------------------

- implemented using locks, shared counters, and flags;


- centralized barrier:

	- counter keeps the number of processes that have arrived;

	- after incrementing counter, each process checks if its
	  value is p (number of processes);

	- if it's the last process, it writes the flag value (on which
	  the other processes are busy-waiting);

	- problem: if p-1 processes proceed and get to the barrier a
	  second time while a process Px does not see the flag being 
	  set (it was swapped out, for example), Px will be waiting
	  indefinetely in the first barrier while all the others wait
	  for him at the second barrier;

- centralized barrier with sense reversal:
  
	- a counter could be used before flag is reset; however, this
	  incurs further latency;

	- solution: processes wait for flag to be 0 in one round and 1
	  in the other, i.e., instead of always going from 0 to 1
	  (when processes pass the barrier), a variable in each
	  process indicates which value to wait for.  This way, the
	  error case discussed before does not happen (the late
	  process will see the new value while the other processes
	  wait for him);

	- this method is called sense-reversal;

- in terms of bus transactions, barriers can be expensive;

- improving barrier algorithms for a bus:

	- using a tree structure;

	- bus snooping (all procs are waiting for the flag variable,
	  so theoretically just one read miss would be necessary -
	  like a broadcast of the new value);

	- hardware barriers: separate synchronization bus;

- implications for software:

	- increase spacial locality;

	- minimize false sharing;

	- conflict misses: e.g., two consecutive rows for a process
	  map to the same positions in the cache; allocating
	  power-of-2 array makes this worse;

	- per-process heaps (also to minimize false sharing);

	- pad arrays (again to separate data for different processes
	  in different cache blocks);

	- be also aware of false sharing at PAGE LEVEL!!