Simple But Effective Techniques for NUMA Memory Management

Bolosky, Fitzgerald, Scott

Overview

   * How do we manage memory in a NUMA, where each processor has a fast,
     unshared, local memory, and there is a slow, shared, global memory?
        o hardware, OS, libraries or compiler, explicit application control
        o hardware is too expensive; application control unnecessarily
          burdens programmers
        o this paper puts management in the OS

Implementation

   * used Mach OS: memory management is divided into machine-independent and
     machine-dependent parts, separated by a well-defined pmap interface
   * used IBM ACE multiprocessor: up to 8 processors or 128 MB global memory
     (but not at the same time); global memory about twice as slow as local
     memory (8 MB per processor)
   * took existing pmap (machine-dependent) layer and divided it into two
     modules: pmap manager (which exports the pmap interface to the
     machine-independent part of Mach), and mmu interface
   * wrote two other modules: NUMA manager (which maintains consistency of
     pages cached in local memories), and NUMA policy (which decides whether
     a page should be placed in local or global memory)

The NUMA Manager

   * local memories are used as a cache for global memory
   * tell Mach that available memory is as big as the global memory
   * each local page is in one of these states:
        o read-only: may appear in 0 or more local memories, must have
          read-only MMU protection in all
        o local-writable: in exactly 1 local memory, may be writable
        o global-writable: in global memory only, may be writable
   * NUMA manager calls NUMA policy module with a single function
     cache_policy(logical_page,protection), which returns LOCAL or GLOBAL.
     The actions then taken by the manager are summarized by a small FSM
     given in Tables 1 and 2 in the paper.

NUMA Policy

   * initially place all pages in local memory of whatever processor used
     them first
   * read-only pages are replicated in many local memories
   * privately writable pages are moved to the processor that writes them
   * shared writable pages (at least 1 writer, at least 1 other reader or
     writer) are moved between local caches as the manager keeps the cahes
     consistent
   * count these moves between caches for each page, and place the page in
     global memory (i.e., remove it from all local memory caches) when a
     threshold (global constant, default 4) is passed, where it remains
     until it is freed

Changes to Machine-independent Part of Mach

   * The pmap interface had to be slightly extended to handle NUMA
     architecture:
        o addition of new calls pmap_free_page, which is called when a
          physical page frame is freed, and starts a lazy cleanup of the
          frame, and pmap_free_page_sync, which is called when a new frame
          is allocated, which waits for the cleanup to finish; these calls
          are necessary to reset the cache state
        o added a parameter for minimum allowed permissions to pmap_enter,
          so that, for example, a shared-writable page could be mapped
          read-only in order to get a write fault
        o added a target processor argument to pmap_enter, so the NUMA
          manager can know which processor should get the page

Other Notes

   * the new pmap level itself must remain pinned in (global) memory
   * "false sharing", where unshared objects used by different processors
     happen to be located on the same memory page, can cause quite a bit of
     a performance hit; compiler support may be helpful here
   * how do you handle transient behaviour? maybe allow apps to hint at run
     time that a certain object will or won't be shared soon
   * how to deal with process migration? currently, pages just end up in
     global memory (ick)
   * they implemented a single policy, and found that it performed "well",
     but they didn't compare it to any other schemes, either hardware or
     software