CS61C Spring 1999
Section 23 TA: Gek Siong Low (cs61c-tb)
Week 10 Discussion notes
Caches
This discussion notes may seem a little wordy. The best way to illustrate caches is to draw them out and go through a few examples, so look at your lecture notes. I'll come up with a few examples later.
The big idea in caches is the memory hierarchy. You can make use of smaller but faster memory to improve the performance of ordinary memory, by taking advantage of certain properties of programs. As we go down the memory hierarchy, things get cheaper but has lower performance. The best policy is to keep the things you'll need often nearby in the faster storage while things that you access infrequently are out there in the slower storage. The memory hierarchy is a form of abstraction. The program doesn't explicitly manage this hierarchy, but rather the hierarchy is invisible to the program.
Why do caches work?
Temporal Locality
Locality in time. The idea is simply that data that you accessed just now is probably going to be accesses again, so it is a good idea to keep it close by. This property is best illustrated in loops when you access the same variable over an over again.
Spatial Locality
Locality in space. Because of the way programs tend to access memory, it is more efficient to load entire blocks of memory into the faster cache because if one item of data is accessed, there is a high probability that the next item to access is going to be nearby, if not the next immediate item. Accessing arrays best exhibit this property. If you are writing nested loops, it is important to consider how they are nested to take advantage of the cache.
How do caches work?
Direct Mapped Caches
This is simply your simple hash table. Multiple addresses get mapped to the same one by a modulus operation. The illustration in the lecture notes demonstrates this clearly. However, to make life a lot easier for hardware designers, the cache sizes are always powers if 2. Calculating cache indices and byte offsets is then simply looking at the bit pattern (shifting and masking) rather then doing the modulus operation.
Suppose in a 32-bit address space, we have a block size of 2^M, and that the cache size is 2^N bytes. The number of cache blocks is thus (2^N)/(2^M) or 2^(N-M). Therefore we require (N-M) bits for the cache index, and M bits for the byte offset. The cache index tells you which block it is and the byte offset tells you which byte in the block it is. Entire blocks are loaded into the cache at a time (remember the principle of spatial locality).
That leaves (32-N) bits and that is the tag. The tag is simply used to determine from which address the data in the cache block comes from, and is part of the cache. There is one tag for each cache block.
So we know where the data comes from, thereby letting us know if there is a cache hit or a miss. But how do we know if the cache block actually contains actual data? It may be undefined, that is, no data has been loaded into the block yet. Hence, we need a valid bit for each block. Once data is loaded into the block, the valid bit will be set. Once the cache is completely filled up, the valid bits would be useless.
Why don't we use the first N bits for the index? If we use the first N bits instead of the last N bits, what happens is that we'll find entire blocks of memory mapping to the same cache index, and that is not what we want. The principle of spatial locality will mean that the cache will keep missing data. Using the last N bits spread out the mapping evenly.
Fully-Associative Cache
The idea behind fully-associative cache is that the modulus-mapping operation of a direct mapped cache is not efficient in terms of storage space. Many cache blocks may be left unused, while other cache blocks may be evicted and replaced over and over again because many memory references happen to map to those indices. In a fully-associative cache, we do away with this mapping stuff, and let the cache stores data anywhere in the cache. The blocks will simply be stored sequentially until the cache is full, then another block is selected for replacement. Storage space is thus maximized.
Obviously, allowing data to be stored anywhere in the cache complicates matters. How do know where in the cache the data is? Searching through the cache linearly is not efficient. The answer is to throw hardware at the problem. All the cache entries will be compared in parallel (i.e. at the same time) by the cache hardware.
There are many ways to decide which cache block to evict when we run out space in the cache and we have to load in a new cache block. One way is to choose the Least Recently Used (LRU) block, but that involves keeping a time stamp for each block. There are many ways to do a time stamp. If there are many blocks, it may be difficult to keep track of all the time stamps and perform the LRU algorithm, so another way is to have a Random Block Replacement policy, which surprisingly performs quite well on average.
N-way Set-Associative Cache
A fully-associative cache is too extreme, but how do we answer the question regarding the inefficiency of block usage raised above? So we came up with a hybrid version of the two methods. Instead of running a search on all the cache blocks in parallel, we split the direct mapped cache into N direct mapped caches, and perform parallel searches only on the N caches. This reduces the hardware required and partially solves the problem.
Therefore you have N blocks in a set, each one with its own tag (because they are essentially N different direct mapped caches in parallel). When you load a block, the hardware will choose one of the N blocks in the set to put the block in.
A note on how the ideas are presented
If you look at notes from previous classes, you may find that associative caches are described as having multiple blocks per cache entry. It is essentially the same idea but with slightly different terminology. Also, they don't present associative caches as several direct-mapped caches oiperating in parallel. I find it easier to understand associative caches the way they are taught now, although the exact details of the tags and cache indices may turn out to be a little confusing (they are different from the direct-mapped case). Some of the other TAs may have notes describing all the little details.
Design Trade-offs
Here's a summary of the trade-offs involved in designing caches:
1. Large cache is better, but costs more.
2. More associativity generally means fewer misses.
3. Bigger block size generally means fewer misses, but you incur a larger
miss penalty because you have to write more bytes.
4. If your block size gets too large, you'll actually get more misses,
because the number of blocks is now too low to be effective.
Multi-Level Caches
Well, there is no reason why you can't have more than one cache, but there are only so many caches you can place between the processor and main memory before performance starts to degrade instead of improve from having to go through all those layers. Common nowadays are L2 caches, short for "Level 2 cache", which are cheaper and thus can be larger, but slower, although it is still much faster than main memory. The question in implementing more levels of caches is whether overall performance of memory accesses can still be improved despite the increase in overhead (e.g. miss penalty).
Handling Writes
One point to remember is that writes to memory don't result in a cache miss. Caches are only for reads. However, there are a few issues you have to consider regarding writing to memory.
Write Through
Whenever you write to a cache, you write the data all the way down to main memory. The problem with write-through is that writing to memory will operate at the speed of main memory. Not good.
Write Buffer
This is a standard solution: try putting a buffer and see if it works better. It does perform better than waiting for every write, but this scheme depends on the frequency of writes being less than the frequency at which main memory can clear the buffer.
Write Back
This is a cleverer approach to the problem. When you write, you update just the cache block. Main memory is updated only when the cache block is evicted and replaced. To make this more intelligent, you'll need to know if the data in the cache block has been modified, so that you just need to update main memory only when the data is modified. Hence, a "modified" bit is introduced.
Written by : Gek Siong Low (cs61c-tb), Spring 1999