Chapter 7: Scalable Multiprocessors --- - objective: scale to hundreds or thousands of processors; - a shared bus has a maximum length of a foot or two, a fixed number of slots and a fixed maximum bandwidth; - with shared-bus architectures, all processors are trusted; this is completely different in NOWs and other scalable systems; - massive parallel processors (MPPs)employ sophisticated packaging and a fast proprietary network, so that a large number of processors can be located in a confined space, with low latency and high bandwidth; - four aspects of scalability: - bandwidth/throughput increase; - latency increase; - how the cost of the system increases; - how to package the systems and put them together; - multiple, independent wires (links) are needed to scale; we call the number of outputs (or inputs) the DEGREE of the switch; * key property of a scalable system: provide a large number of * independent communication paths, such that bandwidth increases as * nodes are added to the system; - time to transfer n bytes between 2 nodes: T(n) = Overhead + Channel time + Routing delay * store-and-forward routing is impractical in large scale parallel * machines; - with a fixed-degree switch, the routing delay should increase as processors are added. However, it (the increase) should be small compared to the overhead and transfer time; - scaling cost: Cost(p,m) = Fixed cost + Incremental cost(p,m) - definition of a cost-effective system: Speedup(p) > Cost(p), where Cost(p) = Cost(p) / Cost(1) * is there any system that does not respect the "cost effective" * definition above? doesn't make a lot of sense to me... - physical scaling: some designs have integrated the communications architecture directly into the processor chip; - hypercube or n-cube: each node connects directly to log(n) other nodes; - board-level integration is another approach; uses standard microprocessor components; boards (may) contain bus-based multiprocessors; - system-level integration: connecting almost complete machines (e.g. IBM SP2); - practical design: balance between computational performance, communictions performance, and cost; - in large-scale parallel machines, bus transactions give place to network transactions; - in network transactions, source and destination are uncoupled; - there is no global arbitration and many network transactions can be initiated simultaneously; - explicit naming needed to perform routing; - address translation process: global address -> module name + local address; - synchronous vs. asynchronous message passing; blocking vs. non-blocking; * a send can be buffered or not. In a non-buffered send, there is a * 3-message transaction, where the send is matched to a receive * command at the destination and the data is copied to the final * buffer . With buffered sends, a temporary buffer is used if there * is not matching receive; - if temporary buffer space is small, a 3-message handshake (like the one for non-buffered sends) can be used; - basically, there are 3 alternatives: - shared address space; - message passing; - active messages, which is essentially a restricted RPC; - active messages: each message specifies a handler at the destination; - common challenges: - input buffer overflow; - fetch deadlock: occurs when output buffers are full and the node wants to process a request just received (how can it send the response with output buffers full?); the solution is to have different buffer pools for requests and responses; - Physical DMA ------------ - no interpretation is placed on the information within a network transaction; - blind physical DMA: destination communication assist deposits the transaction data into storage, whereupon it will be interpreted by the processor; - User level access ----------------- - messages can be sent directly to/from user-level buffers; each network transaction carries a user/system flag; a region of address space is mapped to the network input and output ports as well as the status registers. While UL receive calls are not made, messages stay in the input buffers (in the NIC); one issue to be considered is if this portion of the VA is cacheable or not (p.493); - most basic level of hardware implementation; - Dedicated message processing ---------------------------- - use of dedicated hardware but without binding the interpretation to harware design; the interpretation is done in software on a dedicated communication processor (CP); - the CP can be integrated w/ the NIC or connected to a shared bus together with the processor; - details of communication are hidden from the main processor P. As P and CP have shared memory, sending a message is basically writing to SM; - both DMA and ULP can be implemented; * flow of CONTROL: P->CP->NIC->NIC->CP->P * flow of DATA: M(P)->M(CP)-dma->NIC->NIC-dma->M(CP)->M(P) * the data is basically copied into shared memory by P; after that, CP * initiates DMA to transfer data from SM to the buffers in the NIC; * however, as communication is via shared memory, communication * performance is strongly influenced by the efficiency of the cache * coherence protocol; - before P can send a second message, the first one needs to be invalidated from CP's cache (assuming the same buffer is used); - ex: Intel Paragon - DMA used between SM and NIC; - Meiko CS/2 ---------- - integrated CPs (NICs); - several logical CPs implemented by a single time-multiplexed processor; - Shared Physical address space ----------------------------- - a VA is translated to a PA that indicates the location of a memory block; - combinations of shared/private and remote/local may complicate the work of cache coherent and memory consistency protocols. Ex: private/local is easier to handle than slared/remote data (covered in chap.9); - Ex: CRAY T3D: 2048 nodes, each w/ 150MHz Alpha 21064 and 64MB; additional registers are used to increase the PA from 32 bits to 48 bits (5 register index + 27 offset ==> 21-bit node number + 27-bit offset); - Clusters and Networks of Workstations (NOWs) -------------------------------------------- - consequence of the availability of scalable communication networks (with falling prices, of course); - NOWs: good for problems with high computation to communication ratio; however, becoming less of an issue with better communication technologies: switched Ethernet/FDDI/HPPI, ATM, Myrinet, etc. - Myrinet SBUS Lanai: ------------------ - 3 DMA engines (Net->NIC, NIC->Net, NIC SRAM <-> MEM); - NIC SRAM is memory-mapped; - PCI memory channel: ------------------ - reflective memory: establish a conneciton between a region of the AS of one process (a transmit region) and a receive region in another process. The transmit region is usally consisted of VAS mapped to the NIC memory. Receive regions are usually in the process memory, having the NIC perform a DMA operation on a packet reception; - Performance Lessons (sec. 7.8) ***** READ AGAIN ***** ------------------------------ - Paragon: inefficient bus-based coherence protocol creates a high send overhead; * the cost of uncached operations, misses, and synchronization * instructions, generally considered to be infrequent events and * therefore a low priority for architecture optimization, is critical * to communication performance; - depending how you scale a problem, you can change significantly the communication to computation ratio, creating a huge problem for machines with higher communication costs (e.g. IBM SP2); we can see this in figure 7.35; - in the measurements shown, even though communication costs increase as the number of processors increases, a good (almost perfect) speedup is achieved by some machines because of a significant decrease in cache misses (higher percentage of the local working sets get cached at each node); - Locks ----- - a problem with the array-based lock for machines with physically distributed memory is that, a priori, we don't know the memory position on which a process will spin on, so it can be put in its local memory (of course, spinning on local memory is better); - solution: software queueing lock. The idea is to have a distributed linked list or a queue of waiters on a lock. The head process holds the lock. Every other node is waiting on the lock and its spinning variable is allocated in that process's local memory; a tail pointer points to the last enqueued process; - processes spin on flags on their local nodes; - on remote write per acquire and another one per release to update list pointers (right? not clear from the text);