Architectural Support for Translation Table Management in Large 
Address Space Machines
--- Huck et al., 1993


- the management of page tables varies from full hardware implementations to
  complete software algorithms;

- the Hash Page Table (HPT) is presented in the paper, and they claim 5% to
  10% of improvement in simulations of over 4 billion instructions;

- the instruction set provides management instructions to enable and disable
  the translations, change translations and control the protection mode;

- the page table entry (PTE) holds protection information and status bits,
  usually a reference and a dirty bits;

- the PTE is a superset of the information found in a TLB entry;

- significant number of CPU cycles can be consumed in TLB management.
  Measurements show overhead values ranging between 5% and 40%; 

- main objectives of a VM system:

	- minimize the time to service TLB misses;
	- minimize the physical memory space to maintain the translations for
	  the currently mapped pages;
	- maximize the flexibility for software to support a variety of VM
	  mechanisms and capabilities;

- computer systems may use VM systems based on hardware, software or both of
  them.  Harware = performance, Software = flexibility;

- implementations (organizations) usually optimize for a defined workload;

* factors like OOP and memory-mapped files encourage more sharing, increase
  the size of the address space, and decrease locality;

- hash tables: theory suggests sizes to be prime numbers, practice uses powers
  of two;

- two dominant architectures: forward-mapped (usually multi-level) and 
  inverted (reverse-mapped);

- inverted page tables (IPTs) have been typically used by large address space
  architectures;

* interesting: if lots of TLB misses occur, the cache may seem to have
* a better hit rate, as there will be more accesses to the page table;

Forward-mapped Page Tables:
--------------------------

- most architectures with this structure allow portions of the hierarchy to be
  unallocated by using validity bits in the higher levels;

- some architecture promote a short circuit, linking lower level entries
  higher in the hierarchy, when the address space is sparsely used;

- generally a per-process table, with some control register pointing to the
  first level;

- the page table itself can be referenced virtually or physically.  If
  physically, then some table manager controls the amount of physical memory
  being used;

- address space identifiers (ASID), or PIDs, are used to generate a global
  naming system, discarding the need to flush the TLB after every
  context-switch;

- provides a flexible VM structure, good for aliasing, copy-on-write, and
  independent protection views;

- total cost to service a miss:

	- overhead to suspend execution;
	- memory and/or cache references to each level;
	- possible TLB miss with virtual tables;
	- possible page fault on the table itself;
	- possible updates of the dirty and reference bits.

- good for largely sequential or densely packed addresses.  Bad for sparse
  addresses;

- interesting variant is a single flat page table, index by the VPN;

* the introduction of just a single additional cache miss in TLB miss handling
* can greatly reduce performance;

* cache miss times increasing from 10-20 cycles toward 30-70 cycles. Why? The
* increasing gap between memory access time and processor speed?

* TLB hit rate is also an issue.  Sharing may be done by having multiples
* entries in the TLB or just one for the physical address.  So, there are two
* bad things: sharing through multiple entries in the TLB and the lack of
* PIDs, forcing the flushing of the TLB every context switch;

- the space required for page tables is a function of the amount and
  distribution of allocated virtual memory.  In the best case, all entries in
  the leaf pages are used;

- best scenario: 1% of overhead.  Worst scenario (one entry per page): 100%.
  Typical scenarion: 4% of overhead.

- PTE size probably needs to increase to address the increasing physical
  memory size;


Inverted Page Tables:
--------------------

- it is a single table with an entry for each physical page;

- to convert VPN->PPN, a hash anchor table (HAT) is used.  It is indexed by a
  function of the virtual address and provides a pointer to a linked list
  of potential IPT entries;

- it is a global table, shared by all processes;

- the biggest difficulty is the support for address aliasing.  Some systems
  support sharing by using a global address (same virtual address for both 
  address spaces);

* it is interesting to note that in a balanced system, each entry in the page
  table will be referenced a few times (will have a small chain of
  candidates.)  This increases the possibility of a cache miss when an
  entry is looked up.  The same idea is valid for each of the candidates;

- TLB miss overhead:

	- CPU overhead to suspend execution;
	- memory reference to the HAT pointer;
	- memory reference to the IPT;
	- possible memory references for the chain elements;
	- possible update of the status bits.

* generally these memory references generate large miss rates;

- the size of the table is a linear function of the amount of physical 
  memory;

- memory mapped I/O systems can waste significant storage in an IPT if it
  cannot be efficiently packed;


Hashed Page Table:
-----------------

- combines the hash table and the IPT into a single data structure;

- generates less memory references and better memory usage;

- an entry (HPTE) contains both VPN and PPN;

- some hash of the virtual address is used to index the table;

- they basically avoid using a separate hash table;

- by making the HPT very large, it acts as a complete replacement for the
  older hash table and IPT structure;

- since the HPT entry is nearly identical to a TLB entry, it is easy to
  construct a hardware miss handler;


----

- TLB software miss algorithm in a forward-mapped page table:

	- move the faulting address to a general register;
	- determine if the reference is to system or user space;
	- move the user root table pointer to a general register;
	- determine the privilege level;
	- determine if the reference is to a different process's address
	  space;
	- calculate the index into the root table;
	- load the root table entry;
	- calculate and index into the leaf page;
	- load PTE;
	- ...

- the author uses some tricks to minimize the TLB miss handling time, such as
  aligning the table according to its size (base address is a multiple of the
  table size), thus substituting the addition for a simple OR, to calculate
  the index of a hashed value;


Conclusions:
-----------

- by doing a simple change in traditional IPTs, HPTs decrease the number of
  memory references, decreasing the number of cache misses and TLB miss
  handling time;

- varying the participation of hardware and software, they define a native HPT
  and a hybrid HPT.  In both of them the hardware searches only the first 
  bucket.  In a native HPT, the hardware implements the table and the software
  just searches the lists.  In the hybrid HPTs, the hardware searches just
  an HPT cache, traping to software if the entry if not found.

- the HPT's performance is independent of the physical memory size, the amount
  of allocated VM, and the sparseness of the VM.

- if there are lots of TLB misses, there may be more data cache hits
  because the number of accesses to the page table;