Simple But Effective Techniques for NUMA Memory Management Bolosky, Fitzgerald, Scott Overview * How do we manage memory in a NUMA, where each processor has a fast, unshared, local memory, and there is a slow, shared, global memory? o hardware, OS, libraries or compiler, explicit application control o hardware is too expensive; application control unnecessarily burdens programmers o this paper puts management in the OS Implementation * used Mach OS: memory management is divided into machine-independent and machine-dependent parts, separated by a well-defined pmap interface * used IBM ACE multiprocessor: up to 8 processors or 128 MB global memory (but not at the same time); global memory about twice as slow as local memory (8 MB per processor) * took existing pmap (machine-dependent) layer and divided it into two modules: pmap manager (which exports the pmap interface to the machine-independent part of Mach), and mmu interface * wrote two other modules: NUMA manager (which maintains consistency of pages cached in local memories), and NUMA policy (which decides whether a page should be placed in local or global memory) The NUMA Manager * local memories are used as a cache for global memory * tell Mach that available memory is as big as the global memory * each local page is in one of these states: o read-only: may appear in 0 or more local memories, must have read-only MMU protection in all o local-writable: in exactly 1 local memory, may be writable o global-writable: in global memory only, may be writable * NUMA manager calls NUMA policy module with a single function cache_policy(logical_page,protection), which returns LOCAL or GLOBAL. The actions then taken by the manager are summarized by a small FSM given in Tables 1 and 2 in the paper. NUMA Policy * initially place all pages in local memory of whatever processor used them first * read-only pages are replicated in many local memories * privately writable pages are moved to the processor that writes them * shared writable pages (at least 1 writer, at least 1 other reader or writer) are moved between local caches as the manager keeps the cahes consistent * count these moves between caches for each page, and place the page in global memory (i.e., remove it from all local memory caches) when a threshold (global constant, default 4) is passed, where it remains until it is freed Changes to Machine-independent Part of Mach * The pmap interface had to be slightly extended to handle NUMA architecture: o addition of new calls pmap_free_page, which is called when a physical page frame is freed, and starts a lazy cleanup of the frame, and pmap_free_page_sync, which is called when a new frame is allocated, which waits for the cleanup to finish; these calls are necessary to reset the cache state o added a parameter for minimum allowed permissions to pmap_enter, so that, for example, a shared-writable page could be mapped read-only in order to get a write fault o added a target processor argument to pmap_enter, so the NUMA manager can know which processor should get the page Other Notes * the new pmap level itself must remain pinned in (global) memory * "false sharing", where unshared objects used by different processors happen to be located on the same memory page, can cause quite a bit of a performance hit; compiler support may be helpful here * how do you handle transient behaviour? maybe allow apps to hint at run time that a certain object will or won't be shared soon * how to deal with process migration? currently, pages just end up in global memory (ick) * they implemented a single policy, and found that it performed "well", but they didn't compare it to any other schemes, either hardware or software