Chapter 8: Multiprocessors
---

- Synchronization
  ---------------

- implementing a single atomic operation introduces some challenges,
  since it requires both a memory read and a write in a single,
  uninterruptible instruction;  this complicates the implementation of
  coherence, since the hardware cannot allow any other operation between
  the read and the write, and yet must not deadlock;

- one alternative: load linked (LL) + store conditional (SC): if the 
  contents specified by the load linked are changed before the store
  conditional to the same address occurs, then the store conditional
  fails;  if the processor does a context switch between the two
  instructions, then the store conditional always fails; these
  instructions are typically implemented by keeping track of the address
  specified in the LL instruction in a register, often called the link
  register; if an interrupt occurs, or if the cache block matching the
  address in the link register is invalidated (for example, by another
  SC), the link register is cleared;  the SC simply checks that its
  address matches that in the lin register, failing if they differ;

- the number of instructions between LL and SC should be minimized;

- implementing spin locks with cache coherence has two main
  advantages: first, it is much cheaper to spin on a local copy than
  fetch the lock every time from main memory.  Second, locks have a high
  degree of locality;

- instead of trying to write the lock at every operation, a better
  implementation is shown in p.698 where keeps reading the lock
  position until it detects that the lock is free; then it tries the
  atomic exchange operation;

- synchronization performance can be a real bottleneck when there is
  substantial contention among multiple processes;  when contention is
  low, we are concerned about the latency of the synchronization
  primitive;

- the more serious problem in these examples is the serialization of
  each process's attempt to complete the synchronization; this
  serialization is a problem when there is contention, because it
  greatly increases the time to complete the operation;  for example, if
  the timet to complete all 20 lock and unlock operations depended only
  on the latency in the uncontended case, theh it would take 2000 rather
  than 40000 cycles to complete the sychr. operations;


- Synchronization mechanisms for large-scale machines
  ---------------------------------------------------

- we want to minimize serialization when contention is significant;

- alternative 1: exponential back-off;

- alternative 2: queuing locks:

	- can be implemented in either hardware or software;

	- delay for the 20-lock problem decreases to 2050 cycles;

	- the hardware must be prepared to reclaim suck locks, since
	  the process that requested the lock may have been
	  context-switched and may not even be scheduled again on the
	  same processor;


- Models of Memory Consistency
  ----------------------------

- the consistency question: "When must a processor see a value that
  has been updated by another processor?"

- sequential consistency (SC) requires that the result of any
  execution be the same as if the accesses executed by each process
  were kept in order and the accesses among different processors were
  interleaved;

* THE SIMPLEST WAY TO IMPLEMENT SEQUENTIAL CONSISTENCY IS TO REQUIRE
* A PROCESSOR TO DELAY THE COMPLETION OF ANY MEMORY ACCESS UNTIL ALL
* THE INVALIDATIONS CAUSED BY THAT ACCESS ARE COMPLETED.

- SC reduces potential performance, especially in a machine with a
  large number of processors, or long interconnect delays;

- to provide better performance, designers have developed less
  restrictive memory consistency models that allow for faster hardware;

- a program is said to be synchronized if all accesses to shared
  variables are ordered by synchronization operations;
  synchronized programs are also called data-race-free;

- it is a broadly accepted observation that most programs are
  synchronized;  even with SC, reasoning about the outcome of
  unsynchronized programs is very difficult;

- there are two types of restriction on memory orders: read fences and
  write fences;  fences are fixed points in a computation that ensure
  that no read or write is moved across the fence;  for example, a write
  fence executed by processor P ensures that:

	- all writes by P that occur before P executed the write fence
	  operation have completed, and
	
	- no writes that occur after the fence in P are initiated
	  before the fence;

- in SC, all reads are read fences and all writes are write fences;

- from a performance viewpoint, the processor would like to execute
  reads as early as possible and complete writes as late as possible;

- there are a number of relaxed models that all maintain the property
  that the execution semantics of a synchronized program is the same
  under the model as it would be under a SC mode;

- it is simpler if we define the models in terms of what orderings
  among reads and writes PERFORMED BY A SINGLE PROCESSOR are preserved 
  by each model.  There are four such orderings:

	- R->R: a read followed by a read;
	
	- R->W: a read followed by a write, which is always preserved
	  if the operations are to the same address, since this is an
	  antidependence;

	- W->W: a write followed by a write, which is always preserved
	  if they are to the same address, since this is an output
	  dependence;

	- W->R: a write followed by a read, which is always preserved
	  if they are to the same address, since this is a true
	  dependence;

* IF THERE IS A DEPENDENCE BETWEEN THE READ AND THE WRITE, THEN
* UNIPROCESSOR PROGRAM SEMANTICS DEMAND THAT THE OPERATIONS BE
* ORDERED;  IF THERE IS NO DEPENDENCE, THE MEMORY CONSISTENCY MODEL
* DETERMINES WHAT ORDERS MUST BE PRESERVED; A SC MODEL REQUIRES THAT
* ALL FOUR ORDERINGS BE PRESERVED;

- for example, relaxing the ordering W->R means that we allow a read
  that is later than a write to complete before the write has
  completed; letting the read occur after the write miss has been
  handled but before the invalidations are done does not preserve the
  ordering;

* A CONSISTENCY MODEL DOES NOT, IN REALITY, RESTRICT THE ORDERING OF
* EVENTS;  INSTEAD, IF SAYS WHAT POSSIBLE ORDERINGS CAN BE OBSERVED;
* FOR EXAMPLE, IN SC THE SYSTEM MUST APPEAR TO PRESERVE THE FOUR
* ORDERINGS, ALTHOUGH IN PRACTICE IT CAN ALLOW REORDERING; THIS ALLOWS
* IMPLEMENTATIONS TO USE TRICKS THAT REORDER EVENTS WITHOUT ALLOWING
* THE REORDERING TO BE OBSERVED;

- the consistency model must also define the orderings imposed between
  synchronization variable accesses, which act as fences, and all
  other accesses; for weaker consistency models, we need to specify
  the ordering restrictions involving ordinary variables; the simplest
  ordering restriction is that every synchronization access is a memory
  fence; if we let S stand for a synchr. variable access, we could also
  write this with the ordering notation just shown as S->W, W->S, etc.


- Processor Consistency (PC) or Total Store Ordering (TSO):

	- relaxes the ordering between a write and a read (to a
	  different address), eliminating the order W->R;

	- processor allows a read to proceed before it guaranteed that
	  an earlier write by that processor has been seen by all the
	  other processors;

	- by relaxing only this one ordering, many applications (even
	  those that are unsynchronized) operate correctly, altough a 
	  sychronization operation is necessary to ensure that a write
	  completes before a read is done;

	- this model is equivalent to making the writes be write
	  fences;

- Partial Store Ordering (PSO)

	- allows nonconflicting writes to potentially complete out of
	  order, by relaxing the W->W ordering;

	- a write operation need only cause a stall when a
	  synchronization operation, which causes a write fence, is
	  encountered;

- Weak Ordering (WO)

	- eliminates the R->R and R->W orderings in addition to the
	  other two orderings;

	- does not preserve ordering among references, except for the
	  following:

		- a read or write is completed before any
		  synchronization operation executed in program order
		  by the processor after the read or write;

		- a synchr. operation is always completed before any
		  reads or writes that occur in program order after
		  the operation;

	- the only orderings imposed by WO are those created by
	  synchronization operations;

- Release Consistency (RC)

	- more relaxed than WO;

	- distinguishes between synchr. operations: release vs.
	  acquire;

	- a read or write that precedes an acquire need not complete
	  before the acquire, and also that a read or write that
	  follows a release need not wait for the release;

	- provides one of the least restrictive models that is easily
	  checkable, and ensures that synchronized programs will see a 
	  sequentially consistent execution;

- Summary:

	SC:	R->R, R->W, W->R, W->W
		S->W, S->R, R->S, W->S, S->S

	TSO/PC:	R->R, R->W, W->W
		S->W, S->R, R->S, W->S, S->S

	PSO:	R->R, R->W
		S->W, S->R, R->S, W->S, S->S

	WO:	S->W, S->R, R->S, W->S, S->S

	RC:	Sa->W, Sa->R, R->Sr, W->Sr,
		Sa->Sa, Sa->Sr, Sr->Sa, Sr->Sr


- our directory protocols already satisfy SC, since the processor
  stalls until an operation is complete and the directory first
  invalidates all sharers before responding to a write miss;

- PC is typically implemented by allowing read misses to bypass
  pending writes; a write buffer that can support a check to determine
  whether any pending write in the buffer is to the same address as a
  read miss, together with a mwmory and interconnection system that can
  support two outstanding references per node, is sufficient to
  implement this scheme;

- in RC we must ensure that all pending invalidates to all outstanding
  writes complete before we allow a release to complete, so we simply
  check the pending invalidation counts on any outstanding write when a
  release is executed;  this is made easier by limiting the number of 
  outstanding writes;

- at the present time, most machines being built support some sort of
  weak consistency model, varying from processor consistency to RC,
  and almost all also support SC as an option;

- Distributed Shared Memory (DSM) or Shared Virtual Memory (SVM)
  --------------------------------------------------------------

- pages become the units of coherence, rather than cache blocks;

- just as in a directory system, each page has a home node, and the
  operating system running in that node is responsible for tracking 
  who has copies of that page;

- this software dependence and bigger unit of coherence are the main
  reasons for the performance reason between DSM and the
  hardware-based approach;

- SVM has become an acceptable substitute for loosely coupled message
  passing, since in both cases the frequency of communication must be
  low, and communication that is structured in larger blocks is
  preferred;