Chapter 16: Distributed Shared Memory
---


- DSM is used as an abstraction for sharing data between computers
  that do not share physical memory;

- allows a shared memory programming model;

- the amount of communication is strongly related to the consistency
  model adopted by a DSM system;

- the main point of DSM is that it spares the programmer the concerns
  of message passing when writing applications;

- less appropriate for client-server communication;

- Message passing is good because:

	- architectural differences are not a problem because
	  marshalling takes care of it;
	- processes are more protected from one another;
	- simpler to implement; 
	- easier synchronization
	- communication is explicit (can be good and bad);

- DSM is good because:

	- no marchalling/unmarshalling necessary;
	- known programming pradigm;
	- communication not explicit (actually good and bad);

- experiments [Munim] show that parallel programs developed for DSM
  can be made to perform as well as functionally equivalent programs
  written for message passing platforms on the same hardware (at least
  with small number of computers);

- the performance of DSM depends upon many factors, particularly the
  pattern of data sharing;

- variable marshaling is needed when message passing (MP) is used;

- MP gives more protection and takes care of different data
  representation between computers;

- processes that use DSM may have non-overlapping lifetimes (results
  stay in memory), which is not the case when MP is used; this makes
  DSM persistent over time;

- DSM may be provided in hardware (NUMA), paged virtual memory (Ivy,
  Munin, and others), and as a middleware (JavaSpaces, TSpaces);

- In Ivy (that uses VM), DSM is a region of virtual memory occupying
  the same address range in the address space of every participating
  process; this approach depends on homogeneous computers, with same
  data and paging formats;

- the middleware approach is quite different because it is not
  intended to utilize existing shared memory code;

* interesting to see that microkernel architectures make it easier to
* implement DSM, as page fault handlers are implemented in user space
* and can communicate as they will in order to handle a fault;

- a DSM system may be byte-oriented, object-oriented, or based on
  immutable data.  Linda (and its derivatives JavaSpaces and TSpaces)
  uses the last approach, viewing objects as immutable tuples (an 
  update is done by first issuing a "take" and then a "write");

- the main consistency models that can be practically realized in DSM
  implementations are sequential consistency and models tha are based
  on weak consistency;

- perfect consistency is only possible with a total ordering
  mechanism;  this model represents the concept of linearizability,
  also called atomic consistency;

- a DSM is linearizable if for any execution there is some
  interleaving of the operations issued by all the clients that
  satisfies:

	- L1: the interleaved sequence of operations meets the
	specification of a (single) correct copy of the objects;

	- L2: the order of operations in the interleaving is consistent
	with the real times at which the operations occurred in the
	actual execution;

- related to virtual memory, L1 can be written as:

	- L1': the interleaved sequence of operations is such that if
	  R(x)a occurs in the sequence, then either the last write 
  	  operation that occurs before it in the interleaved sequence
	  is W(x)a, or no write operation occurs before it and a is
	  the initial value of x;

- linearizability is too strict for most practical purposes;

- sequential consistency is the model that is more used in practice
  (by Lampson);

- a DSM is sequentially consistent if for any execution there is some
  interleaving of the operations issued by all the clients that
  satisfies:

	- SC1: the interleaved sequence of operations is such that 
	if R(x)a occurs in the sequence, then either the last write
	operation in the sequence is W(x)a or no write operation
	occurs before it and a is the initial value of x;

	- SC2: the order of operations in the interleaving is consistent 
	with the program order in which each individual client
	executes them;

- SC1 is equivalent to L1.  SC2 uses program order, other than
  temporal order (as in L2).  This makes SC possible to be implemented;

- note that memory operations upon the entire DSM have to be taken
  into account to satisfy the conditions of SC - and not just the
  operations on each individual memory location;

* it seemed to me that SC was a direct application of Lamport's
* "happens-before" relation.  But this is claimed to be Causal
* Consistency (CC)... What is the difference then?

- SC is costly to implement;

- coherence in SC is applied on a location-by-location basis;

- update options: write-update (WU) and write-invalidate (WI);

- WU: updates are made locally and multicast to other replica
  managers, which immediately update their copies
  (multiple-readers/multiple-writes sharing);

- WI: multiple-reader/single-write sharing.  When a process wants to
  write a local copy, a multicast message is sent to invalidate all
  other copies.

- granularity of sharing affects performance, e.g., if a page is the
  basic unit, 8KB may be sent because of one byte being updated!

* false sharing - process A reads location 1 while B reads location 2,
* both in the same page.  The bigger the granularity, the bigger the
* changes of having false sharing;

- too much communication overhead leads to thrashing;

- Munin uses WU in a paged-based basis, buffering multiple writes to
  the same page (makes a copy before the local changes occur);

- in WI, sequence numbers may be used to prevent stale copies being
  taken in write mode;

- in WI, ownership of a page is transfered when a write fault occurs;

- Release Consistency (RC): reduce DSM overhead by exploiting the fact
  that programmers use synchronization objects;

- RC satisfies these requirements:

	- RC1: before an ordinary read or write operation occurs, all
  	previous acquire accesses must be performed;

	- RC2: before a release operation is allowed to perform, all 
	previous ordinary read and write operations must be performed;

	- RC3: acquire and release operations are sequentially
	consistent with respect to one another;

- the DSM runtime can only enforce RC if it's aware of synchronization
  accesses;

- in RC, a process need not be blocked when it makes updates inside
  the critical section.  Also, updates are propagated only when the
  lock is released;

- RC can be used with either WI or WU;

- RC is implemented in the V-kernel;

- note that updates may only be done when a process acquires the lock
  (they do it even for reading!);

- Munin divides data items in the following categories:

	- read-only

	- migratory: both read and write permissions given at the same 
	  time;

	- write-shared: used by the programmer when processes access
	  different parts of an array, for example.  After accesses, 
	  only differences are propagated;

	- producer-consumer: shared by a fixed set of processes (only	
	  one of which is writing);

	- reduction: data always follows the pattern: lock, read,
	  write, unlock.  Items have a fixed owner, to which updates
	  are propagated;

	- result: updates done by "slaves" processes are propagated
	  just to the "master";

	- conventional: conventional RC (?).

- uniform x hybrid models: hybrid models differentiate between memory
  accesses;

- Other uniform consistency models:

	- causal consistency

	- processor consistency

	- pipelined RAM

- Other hybrid models:

	- entry consistency: a lock may be held by several readers;

	- scope consistency: automatic (compiler generated) EC;

	- weack consistency: does not differentiate between acquire
	  and release synchr. accesses;