Chapter 5: Memory-Hierarchy Design --- - principle of locality - spatial vs. temporal locality; - associativity: - direct mapped; - fully associative; - set associative; - address division: - block address: - tag; - index; - block offset; - the index is used to select the set in the cache (fully associative caches have no index); - cache replacement with sets: random vs. LRU; - write handling: write-through (WT) or write-back (WB); - write-back usually uses dirty bit; - write-through may use write buffers; - write-allocate (usually used with WB caches) vs. no-write allocate (usually used with WT caches); - 2^index = Cache size / (block size * associativity) - Average Memory Access Time (AMAT): AMAT = Hit time + Miss rate * Miss penalty - AMAT is good, but NOT a substitute for execution time; - Mem stall cycles = Reads * Read miss rate * Read miss penalty + writes * write miss rate * write miss penalty or (combining both): Mem stall cycles = Mem accesses * Miss rate * miss penalty; - AMAT and CPU time formulas may achieve different conclusions when comparing architectures, as CPU time considers the effects of clock cycle time variations over all instructions (example in p.387); - REDUCING CACHE MISSES --------------------- - cache misses: - compulsory: first access to a block; - capacity: would also miss in a fully associative cache with same size; - conflict: would NOT miss in a fully associative cache with same size; * increasing associativity decreases conflict misses; 1) Larger block size - reduce compulsory misses; - larger blocks take advantage of spatial locality; - increase miss penalty; - may increase capacity and conflict misses; 2) Higher associativity - eight-way associativity is almost as good as fully associativity for commom cache sizes; - increases hit time; 3) Victim caches - small, fully associative cache; 4) Pseudo-associative caches - proceeds just as a direct mapped cache on a hit. On a miss, another entry is checked for a match; - one fast and one slow hit time; - although an attractive idea on paper, variable hit times can complicate a pipelined CPU design. 5) Hardware prefetching of instructions and data - can actually hurt performance if it interferes with demand misses; 6) Compiler-controlled prefetching - prefetch instructions inserted by the compiler; - instruction overhead; - register vs. cache prefetch; - makes sense only if the processor can proceed while the data is being prefetched; 7) Compiler optimizations - merging arrays; - loop interchange: exchanging loop nesting; - loop fusion; - blocking: instead of operating on whole rows or columns, operate on blocks. - REDUCING CACHE MISS PENALTY --------------------------- 1) Giving priority to read misses over writes - write buffers complicate read misses; 2) Sub-block placement for reduced miss penalty - store tag for the full block (increased size), but keep a valid bit for each sub-block; 3) Early restart and Critical word first - don't wait for the full block to be put into the cache; give the CPU the word it wants as soon as possible (early restart); - critical word first: provide missed word first; 4) Nonblocking caches to reduce stalls on cache misses - to be used with scoreboarding or Tomasulo; - cache continues serving hits on a miss; 5) Second-level caches - local vs. global miss rate; - may satisfy multilevel inclusion property; - REDUCING HIT TIME ----------------- 1) Small and simple caches - more important as CPU speed increases; 2) Avoiding address translation during indexing of the cache - virtual vs. physical caches; - aliasing is a problem with virtual caches; I/O is also a pain, as it uses physical addresses; - alternative: use the page offset to index the cache while sending the virtual part to be translated; overlap the time to read the tags with the address translation; these caches are said to be virtually indexed, physically tagged; - direct-mapped caches cannot be bigger than the page size, as increasing associativity is the trick used; - another alternative is for the OS to implement page coloring by guaranteeing that the last few bits of the physical and virtual page addresses are identical; 3) Pipelining writes for fast write hits - write hits usually take longer than read hits because the tag must be checked before writing the data; - solution: pipeline writes; - delayed write buffer; - IMPROVING MAIN MEMORY PERFORMANCE --------------------------------- 1) Wider main memory - usually between L2 and main memory; 2) Simple interleaved memory - different banks of main memory; 3) Independent memory banks - each bank has separate address lines and possibly a separate data bus; - nonblocking caches just make sense when used with independent memory banks; 4) Avoiding memory bank conflics - miss-under-miss, scatter/gather I/O, multiprocessors: performance depends whether requests go to different banks; - can use prime number as # of banks, but logic for indexing gets more complicated; 5) DRAM-Specific Interleaving - DRAM access is divided into row access and column access; - nible mode, page mode, static column - VIRTUAL MEMORY -------------- - pages vs. segments; - hybrid approach: paged segments; - TLB: cache of latest VA->PA translations; - Alpha: mutilevel page tables; - Intel 8086: segmented memory;