Chapter 7: Scalable Multiprocessors
---

- objective: scale to hundreds or thousands of processors;

- a shared bus has a maximum length of a foot or two, a fixed number
  of slots and a fixed maximum bandwidth;

- with shared-bus architectures, all processors are trusted; this is
  completely different in NOWs and other scalable systems;

- massive parallel processors (MPPs)employ sophisticated packaging and
  a fast proprietary network, so that a large number of processors can
  be located in a confined space, with low latency and high bandwidth;

- four aspects of scalability:

	- bandwidth/throughput increase;
	- latency increase;
	- how the cost of the system increases;
	- how to package the systems and put them together;

- multiple, independent wires (links) are needed to scale; we call the
  number of outputs (or inputs) the DEGREE of the switch;

* key property of a scalable system: provide a large number of
* independent communication paths, such that bandwidth increases as
* nodes are added to the system;

- time to transfer n bytes between 2 nodes:

	T(n) = Overhead + Channel time + Routing delay

* store-and-forward routing is impractical in large scale parallel
* machines;

- with a fixed-degree switch, the routing delay should increase as
  processors are added.  However, it (the increase) should be small
  compared to the overhead and transfer time;

- scaling cost:

	Cost(p,m) = Fixed cost + Incremental cost(p,m)

- definition of a cost-effective system:

	Speedup(p) > Cost(p), where

	Cost(p) = Cost(p) / Cost(1)

* is there any system that does not respect the "cost effective"
* definition above?  doesn't make a lot of sense to me...

- physical scaling: some designs have integrated the communications
  architecture directly into the processor chip;

- hypercube or n-cube: each node connects directly to log(n) other
  nodes;

	- board-level integration is another approach; uses standard
	  microprocessor components;  boards (may) contain bus-based 
	  multiprocessors;

	- system-level integration: connecting almost complete
	  machines (e.g. IBM SP2);

- practical design: balance between computational performance,
  communictions performance, and cost;

- in large-scale parallel machines, bus transactions give place to
  network transactions;

- in network transactions, source and destination are uncoupled;

- there is no global arbitration and many network transactions can be
  initiated simultaneously;

- explicit naming needed to perform routing;

- address translation process: global address -> module name + local
  address;

- synchronous vs. asynchronous message passing; blocking vs.
  non-blocking;

* a send can be buffered or not.  In a non-buffered send, there is a
* 3-message transaction, where the send is matched to a receive
* command at the destination and the data is copied to the final
* buffer .  With buffered sends, a temporary buffer is used if there
* is not matching receive;

- if temporary buffer space is small, a 3-message handshake (like the
  one for non-buffered sends) can be used;

- basically, there are 3 alternatives:

	- shared address space;

	- message passing;

	- active messages, which is essentially a restricted RPC;

- active messages: each message specifies a handler at the
  destination;

- common challenges: 

	- input buffer overflow;

	- fetch deadlock: occurs when output buffers are full and the
	  node wants to process a request just received (how can it 
	  send the response with output buffers full?); the solution
	  is to have different buffer pools for requests and responses;

- Physical DMA
  ------------

- no interpretation is placed on the information within a network
  transaction;

- blind physical DMA: destination communication assist deposits the
  transaction data into storage, whereupon it will be interpreted by
  the processor;

- User level access
  -----------------

- messages can be sent directly to/from user-level buffers;  each
  network transaction carries a user/system flag; a region of address
  space is mapped to the network input and output ports as well as the
  status registers.  While UL receive calls are not made, messages stay
  in the input buffers (in the NIC); one issue to be considered is if
  this portion of the VA is cacheable or not (p.493);

- most basic level of hardware implementation;

- Dedicated message processing
  ----------------------------

- use of dedicated hardware but without binding the interpretation to
  harware design;  the interpretation is done in software on a
  dedicated communication processor (CP);

- the CP can be integrated w/ the NIC or connected to a shared bus
  together with the processor;

- details of communication are hidden from the main processor P. As P
  and CP have shared memory, sending a message is basically writing to SM;

- both DMA and ULP can be implemented;


* flow of CONTROL: 	P->CP->NIC->NIC->CP->P

* flow of DATA: 	M(P)->M(CP)-dma->NIC->NIC-dma->M(CP)->M(P)

* the data is basically copied into shared memory by P; after that, CP
* initiates DMA to transfer data from SM to the buffers in the NIC;

* however, as communication is via shared memory, communication
* performance is strongly influenced by the efficiency of the cache
* coherence protocol;

- before P can send a second message, the first one needs to be
  invalidated from CP's cache (assuming the same buffer is used);

- ex: Intel Paragon - DMA used between SM and NIC;

- Meiko CS/2
  ----------

- integrated CPs (NICs);

- several logical CPs implemented by a single time-multiplexed
  processor;

- Shared Physical address space
  -----------------------------

- a VA is translated to a PA that indicates the location of a memory
  block;

- combinations of shared/private and remote/local may complicate the
  work of cache coherent and memory consistency protocols.  Ex: 
  private/local is easier to handle than slared/remote data (covered
  in chap.9);

- Ex: CRAY T3D: 2048 nodes, each w/ 150MHz Alpha 21064 and 64MB;
  additional registers are used to increase the PA from 32 bits to 48
  bits (5 register index + 27 offset ==> 21-bit node number + 27-bit
  offset);

- Clusters and Networks of Workstations (NOWs)
  --------------------------------------------

- consequence of the availability of scalable communication networks
  (with falling prices, of course);

- NOWs: good for problems with high computation to communication
  ratio; however, becoming less of an issue with better communication
technologies: switched Ethernet/FDDI/HPPI, ATM, Myrinet, etc.

- Myrinet SBUS Lanai:
  ------------------

- 3 DMA engines (Net->NIC, NIC->Net, NIC SRAM <-> MEM);

- NIC SRAM is memory-mapped;

- PCI memory channel:
  ------------------

- reflective memory: establish a conneciton between a region of the
  AS of one process (a transmit region) and a receive region in
  another process.  The transmit region is usally consisted of VAS
  mapped to the NIC memory.  Receive regions are usually in the
  process memory, having the NIC perform a DMA operation on a packet
  reception;


- Performance Lessons (sec. 7.8)	***** READ AGAIN *****
  ------------------------------

- Paragon: inefficient bus-based coherence protocol creates a high
  send overhead;

* the cost of uncached operations, misses, and synchronization
* instructions, generally considered to be infrequent events and
* therefore a low priority for architecture optimization, is critical
* to communication performance;

- depending how you scale a problem, you can change significantly the
  communication to computation ratio, creating a huge problem for
  machines with higher communication costs (e.g. IBM SP2); we can see
  this in figure 7.35;

- in the measurements shown, even though communication costs increase
  as the number of processors increases, a good (almost perfect)
  speedup is achieved by some machines because of a significant decrease
  in cache misses (higher percentage of the local working sets get
  cached at each node);

- Locks
  -----

- a problem with the array-based lock for machines with physically
  distributed memory is that, a priori, we don't know the memory
  position on which a process will spin on, so it can be put in its
  local memory (of course, spinning on local memory is better);

- solution: software queueing lock.  The idea is to have a distributed
  linked list or a queue of waiters on a lock.  The head process holds
the lock.  Every other node is waiting on the lock and its spinning
variable is allocated in that process's local memory; a tail pointer
points to the last enqueued process;

	- processes spin on flags on their local nodes;
	- on remote write per acquire and another one per release to
	  update list pointers (right? not clear from the text);