A New Page Table for 64-bit Address Spaces
--- Talluri et al.


- increasing address space size adversely affects page table an translation
  lookaside buffer (TLB) cost or performance;

- the main idea of the paper is to show why conventional page tables
  are not suited to 64-bit address spaces, and define new table,
  called clustered page table.

- clustered page tables are effective when spatial locality makes it
  likely that consecutive pages are in contemporaneous use;

- the second contribution of the paper is to extend page tables to support
  superpage and subblock PTEs, used to feed the TLB.

- a linear page table stores all PTEs for a process in a single array, 
  indexed by the virtual page number (VPN).  As they are very large, 
  they reside in virtual space, using page faults to dynamically populate
  the table.

- the extension of linear page tables to 64-bit addresses uses a virtual
  array with 4 thousand trillion entries and a 6-level tree.  However, 
  this design IS practical, as a portion of the TLB is reserved for mappings 
  to the page tables and the tree is rarely traversed;

- a forward-mapped page table store PTEs in n-ary trees, with each level
  of the tree indexed using fixed address fields in the VPN.  However, 
  7 levels are needed to 64-bit address spaces, which the overhead
  not acceptable.

- hashed page tables are other choices.  Each PTE in the hash table,
  stores the mapping information for one page, a tag identifying the VPN,
  and a next pointer.  A drawback is that the tag and the pointer are
* 8 bytes each, resulting in 16 bytes of overhead for each 8 bytes of 
  mapping information;  One PTE for each physical page table.

- clustered page tables are a variant of hashed page tables that store
  mapping information for several consecutives pages, with a single tag
  and next pointer.

- each node in the hash table stores one tag but stores mapping information
  for multiple base pages that belong to the same page block - an aligned
  group of consecutive pages.  The number of base pages in a page block   
  is the subblock factor.

- the VPN is split into a virtual page block number (VPBN) and a block
  offset (Boff); 

- shorter hash table lists (buckets);

- CPT can require more space than hash tables, if the address space is 
  really sparse, say 1 page used for each cluster;

- CPT require the OS to acquire a single lock for an entire page block; 
  this can restrict concurrent page table lookups (however, you can use
  reader/writer locks, allowing concurrent reads.)

- to increase the hit ratio in 64-bit architectures, designers are 
  using superpages, power-of-two multiples of base pages that are
  aligned in both virtual and physical memory;

- when subblocking is used in the TLB, the data are is much harder (contains
  multiple PPNs.)  One way to reduce the area is to store a single
  PPN and require physical pages for a PTE to be stored in a aligned
  block of physical memory (this is called partial-subblocking.)  Valid 
  bits make a partial-subblock TLB different from a superpage TLB, as just 
  a subset of a subblock may be resident at a given time;

- instead of having the TLB miss handler construct a superpage or 
  partial-subblock PTE, a more efficient way  would be to modify the 
  page tables to store superpage or partial-subblock PTEs.

- there are at least two solutions for supporting superpages that work 
  for any page table: Replicate PTEs and Multiple Page Tables.  The first
  solution stores a superpage PTE at the page table site of evey 
  base page PTE.  The drawbacks of Replicate PTEs is that we loose
  the space advantage of superpages in the page table and make more
  difficult the updating of a mapping.  When using Multiple Page Tables,
  one page table is kept for each page size in use.

- Linear Intermediate Nodes is used by aggregating a subtree in a linear
  page table for a superpage.  Therefore, each entry in the tree can be
  either a pointer to the next level or a superpage PTE.

- The same idea can be applied to forward-mapped page tables (Forward-Mapped
  Intermediate Nodes.)

- another possibility is to assume a fixed superpage size and use this
  to index a hash page table.  Among the disadvantages of this method is
  the problem of handling bigger superpages.

- Talluri et al. conclude that the Replicate PTE method is probably the
  best method.

- another technique for increasing the memory mapped by a TLB is
  complete subblocking.  In this case, there are block misses and subblock
  misses.  Subblock prefetching can be used, but increase the TLB miss
  penalty (how bad it is depends on the page table used.)

- Talluri adds a field (S) in the clustered page table PTEs to indicate
  if the PTE is a subblock/superpage PTE or not.  This way, the same
  table can have both S-PTEs and base page PTEs. "we are able to 
  service TLB misses to both partial-subblock and base page PTEs without
  increasing the TLB miss penalty while using less memory for partial-
  subblock PTEs."

- the authors define ways to handle superpages bigger than the total
  size of a PTE.