Chapter 3: Programming for Performance --- - this chapter focuses on software techniques for enhancing performance of parallel programs; - partitioning for performance: - balancing the workload; - reducing the time waiting at synchronization events; - reducing the extra work done to determine and manage a good assignment; - reducing communication; - identifying enough concurrency: data and function parallelism - data parallelism: different processors working on different parts of the data set; - function parallelism: entirely different calculations can be performed concurrently on either the same or different data; also called control parallelism or task parallelism; - function parallelism is usually more difficult to exploit in a load-balanced way, since different functions involve different amounts of work; - static vc. dynamic workload assignment; - in static assignment, an algorithm maps the work to processes before execution, and this map does not change dynamically; - dynamic partitioning can be: - semistatic: get a first partition statically, and change it dynamically if needed; - dynamic tasking: used when either the work distribution or the system environment is too unpredictable. It divides the work into tasks and maintains a pool of available tasks for the processes to execute; - in dynamic tasking, task queues can be centralized or distributed, to minimize synchronization costs. If distributed, a process that runs out of tasks in its queue may get tasks from other processes by performing task stealing, quering other processes. As tasks are dynamically inserted into the queues, when stealing is used an algorithm for termination detection is necessary, in order for the processes to stop searching for tasks to steal; - as dynamic partitioning increases communication costs, static techniques are preferred when they can provide good load balance; - task granularity: amount of work associated with a task; - most important performance to be traded off with load balance is reducing interprocess communication; - communication-to-computation ratio; - domain decomposition: computation shows space locality; DATA ACCESS AND COMMUNICATION - artifactual communication sources: - poor allocation of data; - unnecessary data in a transfer; - unnecessary transfers due to other system granularities (e.g., coherence units in cache systems); - redundant communication of data (when global state is difficult to maintain); - finite replication capacity (limited cache, for example); - inherent communication is what occurs given unlimited capacity for local replication, perfect granularity, and perfect global knowledge; - reducing artifactual communication: 1) expoiting temporal locality: - same object referenced more than once; - adjascent objects handled by the same process; - use of blocking to better fit working set in cache; 2) exploiting spatial locality: - if a memory location is referenced now, it is likely that memory locations close to it will be referenced in the near future; - because of the granularity of data transfer, not having spatial locality degrades performance substantially; - also important are the granularities of allocation and coherence. The interaction of the latter with poor spatial locatity introduces false sharing; - cost of communication: C = Frequency * (Overhead + Delay + Length/Bandwidth + Contention - Overlap) - objective is to reduce the overhead, delay, communication volume, and contention while trying to maximize the overlap between waiting and computing; - overhead can be reduced by reducing the number of messages or increasing its length; easier if explicit (send/receive) communication is used; - reducing delay basically implies optimizing network hardware and reducing network hops; mapping parallel algorithms to network topologies is important to reduce comm. costs; - network vs. endpoint contention (hot spot); - overlapping communication and computation is possible if the application provides enough concurrency; - parallel computing adds synchronization and remote transfer delays, which make the speedup not follow the number of processors; - the following costs are found in parallel programs (a + indicates that the cost is higher than or nonexistent in a correspondent sequential program): - busy-useful: time doing useful work; - busy-overhead (+); - data-local (+/-); - data-remote (+); - synchronization (+); - the data-local portion may decrease in parallel programs, as each processor works with a subset of the data and may present better locality of reference. When this cost is significant compared with remote communication, we may get superlinear speedups; - aspects of parallel programming better handled by shared memory: - naming: implicit location when SM is used; - replication: needs to be explicit with MP; - communication overhead; - aspects better handled by memory passing: - message size: block size is defined my message; - synchronization; - hardware or design cost; - performance predictability;