Chapter 5: Memory-Hierarchy Design
---

- principle of locality

	- spatial vs. temporal locality;

- associativity:

	- direct mapped;
	
	- fully associative;

	- set associative;

- address division:

	- block address:

		- tag;
		- index;

	- block offset;

- the index is used to select the set in the cache (fully associative
  caches have no index);

- cache replacement with sets: random vs. LRU;

- write handling: write-through (WT) or write-back (WB);

- write-back usually uses dirty bit;

- write-through may use write buffers;

- write-allocate (usually used with WB caches) vs. no-write allocate
  (usually used with WT caches);

- 2^index = Cache size / (block size * associativity)

- Average Memory Access Time (AMAT):

	AMAT = Hit time + Miss rate * Miss penalty

- AMAT is good, but NOT a substitute for execution time;

- Mem stall cycles = 	Reads * Read miss rate * Read miss penalty +
			writes * write miss rate * write miss penalty

or (combining both):

  Mem stall cycles = Mem accesses * Miss rate * miss penalty;

- AMAT and CPU time formulas may achieve different conclusions when
  comparing architectures, as CPU time considers the effects of clock
  cycle time variations over all instructions (example in p.387);

- REDUCING CACHE MISSES
  ---------------------

- cache misses:

	- compulsory: first access to a block;

	- capacity: would also miss in a fully associative cache with
	  same size;

	- conflict: would NOT miss in a fully associative cache with
	  same size;

* increasing associativity decreases conflict misses;

1) Larger block size

	- reduce compulsory misses;

	- larger blocks take advantage of spatial locality;

	- increase miss penalty;

	- may increase capacity and conflict misses;


2) Higher associativity

	- eight-way associativity is almost as good as fully associativity for
  	  commom cache sizes;

	- increases hit time;


3) Victim caches

	- small, fully associative cache;


4) Pseudo-associative caches

	- proceeds just as a direct mapped cache on a hit.  On a miss,
	  another entry is checked for a match;

	- one fast and one slow hit time;

	- although an attractive idea on paper, variable hit times can
	  complicate a pipelined CPU design.


5) Hardware prefetching of instructions and data

	- can actually hurt performance if it interferes with demand
	  misses;


6) Compiler-controlled prefetching

	- prefetch instructions inserted by the compiler; 
		- instruction overhead;

	- register vs. cache prefetch;

	- makes sense only if the processor can proceed while the data
	  is being prefetched;


7) Compiler optimizations

	- merging arrays;

	- loop interchange: exchanging loop nesting;

	- loop fusion;

	- blocking: instead of operating on whole rows or columns,
	  operate on blocks.


- REDUCING CACHE MISS PENALTY
  ---------------------------

1) Giving priority to read misses over writes

	- write buffers complicate read misses;


2) Sub-block placement for reduced miss penalty

	- store tag for the full block (increased size), but keep a
	  valid bit for each sub-block;


3) Early restart and Critical word first

	- don't wait for the full block to be put into the cache;
	  give the CPU the word it wants as soon as possible (early
	  restart);

	- critical word first: provide missed word first;


4) Nonblocking caches to reduce stalls on cache misses

	- to be used with scoreboarding or Tomasulo;

	- cache continues serving hits on a miss;


5) Second-level caches 

	- local vs. global miss rate;

	- may satisfy multilevel inclusion property;


- REDUCING HIT TIME
  -----------------

1) Small and simple caches

	- more important as CPU speed increases;


2) Avoiding address translation during indexing of the cache

	- virtual vs. physical caches;

	- aliasing is a problem with virtual caches; I/O is also a
	  pain, as it uses physical addresses;

	- alternative: use the page offset to index the cache while
	  sending the virtual part to be translated; overlap the time
	  to read the tags with the address translation; these caches
	  are said to be virtually indexed, physically tagged;

		- direct-mapped caches cannot be bigger than the page
		  size, as increasing associativity is the trick used;

	- another alternative is for the OS to implement page
	  coloring by guaranteeing that the last few bits of the
	  physical and virtual page addresses are identical;


3) Pipelining writes for fast write hits

	- write hits usually take longer than read hits because the
	  tag must be checked before writing the data;

	- solution: pipeline writes;

	- delayed write buffer;


- IMPROVING MAIN MEMORY PERFORMANCE
  ---------------------------------


1) Wider main memory

	- usually between L2 and main memory;


2) Simple interleaved memory

	- different banks of main memory;


3) Independent memory banks

	- each bank has separate address lines and possibly a separate
	  data bus;

	- nonblocking caches just make sense when used with
	  independent memory banks;


4) Avoiding memory bank conflics

	- miss-under-miss, scatter/gather I/O, multiprocessors:
	  performance depends whether requests go to different banks;

	- can use prime number as # of banks, but logic for indexing
	  gets more complicated;


5) DRAM-Specific Interleaving

	- DRAM access is divided into row access and column access;

	- nible mode, page mode, static column 


- VIRTUAL MEMORY
  --------------

- pages vs. segments;

- hybrid approach: paged segments;

- TLB: cache of latest VA->PA translations;

- Alpha: mutilevel page tables;

- Intel 8086: segmented memory;