Chapter 3: Programming for Performance
---

- this chapter focuses on software techniques for enhancing
  performance of parallel programs;

- partitioning for performance:

	- balancing the workload;
	- reducing the time waiting at synchronization events;
	- reducing the extra work done to determine and manage a good
	  assignment;
	- reducing communication;

- identifying enough concurrency: data and function parallelism

- data parallelism: different processors working on different parts of
  the data set;

- function parallelism: entirely different calculations can be
  performed concurrently on either the same or different data;
  also called control parallelism or task parallelism;

- function parallelism is usually more difficult to exploit in a
  load-balanced way, since different functions involve different
  amounts of work;

- static vc. dynamic workload assignment;

- in static assignment, an algorithm maps the work to processes before
  execution, and this map does not change dynamically;

- dynamic partitioning can be:

	- semistatic: get a first partition statically, and change it
	  dynamically if needed;

	- dynamic tasking: used when either the work distribution or
	  the system environment is too unpredictable.  It divides the
	  work into tasks and maintains a pool of available tasks for
	  the processes to execute; 

- in dynamic tasking, task queues can be centralized or distributed, 
  to minimize synchronization costs.  If distributed, a process that 
  runs out of tasks in its queue may get tasks from other processes
  by performing task stealing, quering other processes.  As tasks are
  dynamically inserted into the queues, when stealing is used an
  algorithm for termination detection is necessary, in order for 
  the processes to stop searching for tasks to steal;

- as dynamic partitioning increases communication costs, static
  techniques are preferred when they can provide good load balance;

- task granularity: amount of work associated with a task;

- most important performance to be traded off with load balance is
  reducing interprocess communication;

- communication-to-computation ratio;

- domain decomposition: computation shows space locality;

DATA ACCESS AND COMMUNICATION

- artifactual communication sources:

	- poor allocation of data;
	- unnecessary data in a transfer;
	- unnecessary transfers due to other system granularities
	  (e.g., coherence units in cache systems);
	- redundant communication of data (when global state is
	  difficult to maintain);
	- finite replication capacity (limited cache, for example);

- inherent communication is what occurs given unlimited capacity for
  local replication, perfect granularity, and perfect global
  knowledge;

- reducing artifactual communication:

1) expoiting temporal locality:

	- same object referenced more than once;
	- adjascent objects handled by the same process;
	- use of blocking to better fit working set in cache;

2) exploiting spatial locality:

	- if a memory location is referenced now, it is likely that
	  memory locations close to it will be referenced in the near
  	  future;

	- because of the granularity of data transfer, not having
	  spatial locality degrades performance substantially;

	- also important are the granularities of allocation and
	  coherence.  The interaction of the latter with poor
	  spatial locatity introduces false sharing;

- cost of communication:

	C = Frequency * (Overhead + Delay + Length/Bandwidth + 
			 Contention - Overlap)

- objective is to reduce the overhead, delay, communication volume,
  and contention while trying to maximize the overlap between waiting
  and computing;

- overhead can be reduced by reducing the number of messages or
  increasing its length; easier if explicit (send/receive)
  communication is used;

- reducing delay basically implies optimizing network hardware and
  reducing network hops;  mapping parallel algorithms to network 
  topologies is important to reduce comm. costs;

- network vs. endpoint contention (hot spot);

- overlapping communication and computation is possible if the
  application provides enough concurrency;

- parallel computing adds synchronization and remote transfer delays,
  which make the speedup not follow the number of processors;

- the following costs are found in parallel programs (a + indicates
  that the cost is higher than or nonexistent in a correspondent
  sequential program):

	- busy-useful: time doing useful work;
	- busy-overhead (+);
	- data-local (+/-);
	- data-remote (+);
	- synchronization (+);

- the data-local portion may decrease in parallel programs, as each
  processor works with a subset of the data and may present better
  locality of reference.  When this cost is significant compared with
  remote communication, we may get superlinear speedups;

- aspects of parallel programming better handled by shared memory:

	- naming: implicit location when SM is used;
	- replication: needs to be explicit with MP;
	- communication overhead;

- aspects better handled by memory passing:

	- message size: block size is defined my message;
	- synchronization;
	- hardware or design cost;
	- performance predictability;