Question 1 (a) In a four-way set associative cache, two of the address bits are used to select which cache block within a set will be chosen. FALSE. The whole idea of associativity is that the blocks within a set are equivalent, so any address that goes with this set can go in any of the four blocks. Some bits from the address do determine WHICH SET will be used, but not which block within a set. Some people interpreted the question as asking about how to find which block holds an address that's already in the cache. Some address bits are relevant in that situation, namely the tag bits, but there will be more than two of those. (b) Programs using arrays will be more likely than programs using linked lists to benefit from a wide cache design. TRUE. Wide caches are used to take advantage of spatial locality of reference: If we refer to some address, we're probably going to use its neighbors soon. But linked lists can have their elements scattered all throughout memory; only arrays show spatial locality of reference. (This, *not* the use of recursion, is the main reason why Lisp-like languages are still a little slower than C-like languages, although arrays *are* available in Lisp for programmers more concerned with efficiency than with flexibility.) (c) Increasing the size of a direct mapped cache decreases conflict misses. TRUE. If the cache is bigger, then fewer memory addresses are in competition for a given cache slot. (This would be true for any cache, even an associative one.) Some people said "false" because they read that the way to decrease conflict misses is to increase associativity. That's also true; for a given cache size, there will be fewer conflict misses with increased associativity. But that doesn't mean associativity is the *only* way to improve the hit rate! (d) The choice of write-through or write-back design involves a tradeoff between miss rate and miss penalty. FALSE. The tradeoff is between HIT COST and miss penalty. A write-through cache spends a little more time on each write that hits the cache; a write-back cache spends much more time on some writes that misses the cache (namely, the cases in which the chosen cache block is dirty). But since this design choice affects the contents of memory, NOT the contents of the cache, it doesn't affect the miss rate. (e) The choice of direct mapped or associative design involves a tradeoff between miss rate and hit cost. TRUE. A direct mapped cache is simpler, and so the hit cost is lower. An associative cache will have a somewhat better chance to avoid conflict misses, and therefore improves (decreases) the miss rate. But empirical studies show that associativity has only a tiny effect on the miss rate, so most caches are now direct mapped. Question 2 size words policy offset set tag (wds) (wds) 1024 8 direct 24 31 61797 mapped 0x18 0x1f 0xf165 11000 0011111 4096 16 4-way set 56 15 61797 associative 0x38 0xf 0xf165 111000 011111 4096 4 fully 8 not 15820095 associative 0x8 applicable 0xf1653f 1000 The number of bits in the offset depends on the width of a cache slot. The slot sizes in bytes are 8*4=32=2^5, 16*4=64=2^6, and 4*4=16=2^4. Therefore, the offset sizes are 5, 6, and 4 bits. The offset is the rightmost part of the address, which is 0000 1111 0001 0110 0101 0011 1111 1000. The values listed above under the offset column are the rightmost five, six, and four bits of this address. The number of bits in the set number depends on the number of sets in the cache. We determine this in two steps. First, how many slots (or blocks) are in the cache? This is the total cache size divided by the width of one block. For these three caches there are 1024/8=127=2^7, 4096/16=256=2^8, and 4096/4=1024=2^10. The next step is to divide the total number of slots (computed above) by the number of slots in each set. For a direct mapped cache, each slot is its own set. (In other words, each set contains only one slot; there is only one place that a given address can appear in the cache.) So this cache has 2^7 sets, so the seven bits to the left of the offset are the set number. For a 4-way set associative cache, each set contains four slots, so this cache has 256/4=64=2^6 sets, so its set number takes up six bits. For a fully associative cache, the entire cache is a single set, and there is no set number. Some people thought that the set number is in the high-order (leftmost) bits of the word. If that were true, then consecutive addresses would compete for the same cache set. Locality of reference tells us that such a cache would have a very high miss rate even when most of the slots were unused! The idea is to distribute nearby addresses among all the sets. Finally, the tag is everything not used up for the offset or set number. Question 3 What is the cache size? All of the times for an array size of 256 bytes are small (about 100 ns), whereas for an array of 512 bytes most of the times suddenly are around 900 ns. So probably the cache size is 256 bytes. What is the block size? For array sizes at or above 512, a stride of 4 gives access times around 450 ns, whereas larger strides take about 900 ns. So a stride of 8 moves to the next block, whereas a stride of 4 is within the same block half the time. Therefore the block size is 8 bytes. What is the allocation policy? For an array size of 512, strides of 128 and 256 lead to lower access times (around 300 ns). For an array size of 1024, a stride of 256 leads to a similar low time. In other words, if the stride is 1/2 or 1/4 of the array size, then not every reference leads to a cache miss, whereas if the stride is 1/8 or less of the array size, every reference misses. This suggests a 4-way set associative cache. Question 4 (a) size=64K words, stride=16K words. Although the program only uses four words of the array, all four of them will occupy the same cache slot. (It's a direct mapped cache, so there is no flexibility about which slot to use.) Therefore the hit rate is near zero for this example. (b) size=16K words, stride=2 words. This array entirely fits in the 16K-word cache, so we'll have a near-100% hit rate. (The initialization pass fills up the cache with the array, and so on the second pass everything is where it should be.) (c) size=32K words, stride=2 words. This array is twice the size of the cache, and the stride is small enough so that every slot will be used (twice over). In the initialization phase, the cache fills up with the first half of the array, then fills up with the second half. So when the second pass starts, we're looking for the first half and finding the second half, then looking for the second half and finding the first half. So this would be all misses, except for the fact that the cache width (4 words) is bigger than the stride (2 words), so every second attempt is a cache hit -- we get about a 50% hit rate. Question 5 The width is 8 bytes, so there are 3 bits of offset within a cache row. The number of rows is 64/8 = 8, and the cache is 4-way set associative, so there are 2 sets, which means we need a 1-bit index to select the set. That adds up to 4 bits, i.e., the rightmost hex digit (9) which is 1001 in binary, divided as 1 001. The remaining address bits are the tag. tag = 1232123 set = 1 offset = 1 So we want the row in set 1 whose tag is 1232123; that's the fifth row of the entire cache. In that row, we want byte 1 of word 0, which is the 56. Question 6 The cache holds 2^14 words; each slot holds eight (2^3) words. That means there are (2^14)/(2^3) = 2^11 slots. Since the cache is two-way set associative, these slots are grouped in sets of two, so there are 2^10 of these sets. Therefore we need 10 bits for the INDEX, the part of the address that selects a set. Each slot is eight words, which is 32 bytes (2^5), so we need five bits for the OFFSET, the part of the address that selects a byte or word within a cache slot. This leaves 32 - (10+5) = 17 bits for the tag. Question 7 Cache A: Two blocks, one word each Cache B: One block of two words (a) a[0]++; a[1]++; a[2]++; a[3]++; Cache A: a[0] misses, into first block; a[1] misses, into second block; a[2] misses, into first block; a[3] misses, into second block. Cache B: a[0] misses, loads a[0] and a[1]; a[1] hits; a[2] misses, loads a[2] and a[3]; a[3] hits. So cache B has a better hit rate (2 hits) than cache A (0 hits). (b) a[1]++; a[3]++; a[1]++; a[3]++; Cache A: a[1] misses, into second block; a[3] misses, into second block; a[1] misses, into second block; a[3] misses, into second block. Cache B: a[1] misses, loads a[0] and a[1]; a[3] misses, loads a[2] and a[3]; a[1] misses, loads a[0] and a[1]; a[3] misses, loads a[2] and a[3]. So the two caches have the same hit rate (0 hits). (c) a[1]++; a[2]++; a[1]++; a[2]++; Cache A: a[1] misses, into second block; a[2] misses, into first block; a[1] hits; a[2] hits. Cache B: a[1] misses, loads a[0] and a[1]; a[2] misses, loads a[2] and a[3]; a[1] misses, loads a[0] and a[1]; a[2] misses, loads a[2] and a[3]; So cache A has a better hit rate (2 hits) than cache B (0 hits). So the two answers are C and A. Question 8 A. Find 0x7f55a356 in the cache. Since the cache is 16 bytes wide, the lowest 4 bits (4 = lg 16) of the address are the offset within a block. That's the last hex digit, the 6. So we're looking for the cache block that would contain block number 0x7f55a35. The cache contains 4K bytes, and one block is 16 bytes, so there are 4K/16 = (2^12)/(2^4) = 2^8 blocks in the cache. Since the cache is direct mapped, every block is a set of one, and the address uniquely determines which block (i.e., which set) to use. We find out by looking at the lowest 8 bits (because there are 2^8 blocks) of the block number, which is 0x35. What's left, 0x7f55a, is the tag. So (B) is the correct answer. B. How many sets? A block is 32 bytes; there are five blocks to a set. So the size of a set is 32*5=160 bytes. The entire cache size is 80K = 80*(2^10), so the number of sets is 80K/160 = 2^9 = 512 bytes. C. TLB mapping. A page is 4Kb = 2^12 bytes. That means that the bottom 12 bits of the address (3 hex digits) are the offset, and the rest is the virtual page number. For this virtual address, the offset is 0x678 and the virtual page number is 0x12345. We look up the latter in the TLB and find it in slot number 2, corresponding to physical page number 0x200. Combining this with the offset, the physical address is 0x200678. D. What can't be paged? The answer is (B); if the disk I/O handler were paged out, then it could never be brought back into memory, and neither could anything else. It's okay to page out the page tables; they are in the operating system's virtual address space, and the same mechanism that pages other stuff in and out can deal with missing page table entries. (Of course there has to be access to SOME memory, so maybe part of the system's own page table will be in the unmapped portion of memory.) It's okay to page out the system call handler. System calls are instructions executed by a user process; if the handler needed for a certain system call isn't available, it could be paged in later, just as anything else that a user process needs can be. And of course if user programs couldn't be paged out, that would defeat the whole purpose of paging! Question 9 VM is different from cache because the MISS PENALTY is so much greater for VM -- if you don't find the page in physical memory, you have to do a disk operation to get it. Question 10 A page is 2^12 bytes, so the virtual address is divided into a 20-bit virtual page number and a 12-bit offset. So the address 0x40fa25d4 is VPN: 40fa2 offset: 5d4 We look up the VPN in the TLB and find it in slot 6: VPN 40fa2, physical page 076 Therefore the desired physical address is 0x0765d4. Now we are done with the TLB and need to find this address in the cache. The cache width is 2^5 bytes, so there are five bits of cache offset. The total cache size is 2^9 bytes, so there are 2^4 = 16 cache blocks. The cache is 2-way set associative, so these 16 blocks are organized as 8 (= 2^3) sets of 2 blocks each. So there are three bits of set number. Physical address = 0x765d4 = 0111 0110 0101 1101 0100 binary = 0111 0110 0101 110 10100 --- ----- set # offset so we are looking for an offset of 0x14 in a block of set # 6. Set #6 has two blocks, whose keys are a1c5 and 0765. The second of these matches the leftmost bits of the physical address, so we use that block (block # 0xd), and at offset 0x14 we find the data, 0x46. Question 11 Parts a-c are about the cache. The cache size is 8Kb, which is 2^13. The cache width is 32 words, or 128 bytes, which is 2^7. Therefore there are (2^13)/(2^7) = 2^6 = 64 cache rows. Since the cache is 4-way set associative, these rows are grouped as 16 sets of 4 rows per set. So a memory address is +---------------------+----+-------+ | 21 | 4 | 7 | | tag | set| offset| +---------------------+----+-------+ (a) block offset is 7 bits, because a cache row is 2^7 bytes wide. (b) set index is 4 bits, because there are 16 = 2^4 sets. (c) What's left, 32 - 7 - 4, is the tag, 21 bits. Parts d-f are about virtual memory. The page size is 4K bytes = 2^12. So for VM purposes a 32-bit memory address (both physical and virtual are 32 bits) is +--------------------+------------+ | 20 | 12 | | page number | offset | +--------------------+------------+ (d) page offset is 12 bits, because a page holds 2^12 bytes. (e) and (f) page numbers are the remaining 32-12 = 20 bits. Question 12 Why cache in hardware, VM largely in software? There are many reasons for this, and we accepted a wide range of possible answers. For example: * Virtual memory uses disk, which has a large and unpredictable latency delay, which means that the processor can't wait for a VM miss, but must instead switch to a different process, which requires software support. * Since the miss penalty is not so severe for a cache miss as for a page fault, the cache can use random replacement, which is relatively easy to do in hardware; by contrast, VM generally uses an LRU algorithm, which requires complicated bookkeeping best done with software support. * Since the backup store for VM is the disk, the page fault handler must understand the file system structure, to know which part of the disk to use; this structure is not part of the disk hardware but is imposed by the operating system. (That's why PC diskettes and Mac diskettes are incompatible even though they are physically the same.) * If a cache miss ran some software, the possibility arises that part of that software might not be in the cache, which would create a recursive cache miss. The cache miss handler software would have to be permanently locked into the cache. A typical two-point answer would be one that didn't show a clear understanding of the difference between a hit and a miss, e.g., making an argument that handling each cache hit in software would defeat the purpose of caching. The same is true for VM: TLB hits are handled by the hardware! It's only on a miss that the question of hardware vs. software control is relevant. Question 13 Two point answer: The bit is set whenever something is stored in any address within the page; the operating system examines the bit when that page must be reassigned. If the bit is not set, then the operating system need not write the contents of the page to disk before reading in the new contents. One point answer: The bit is set whenever something is stored in any address within the page. One point answer: The bit is used to keep track of whether anything has been modified and must be written back. [This is a one point answer because it's not clear whether you understand the difference between the cache and the virtual memory system.] Zero point answer: The bit is set when the page table or TLB entry is written; it's used to decide whether or not to write the page table or TLB entry back to memory. [It's not the page table or TLB entry we're concerned with, but rather the data in the page to which that entry refers!] Question 14 a. Reading a page table entry from main memory: Could be either. Whether this operation is built into the hardware or left for the operating system software, a memory reference is still required. Therefore the gain from hardwiring this operation is slight. The MIPS processor leaves this task for the operating system, although other processors have specific hardware for the purpose. b. Reading a TLB entry: Definitely hardware. The whole point of having a TLB is to cache the page table, so that the virtual memory translation process will be really fast in most cases. If every TLB read required a processor exception handler to be called, every memory reference would be really slow! c. Setting the write enable bit for a page: Definitely software. The write enable bit (unlike the dirty bit, which also has to do with writing) indicates a policy decision, not a record of the history of memory use. There is no way the hardware could make this decision; the operating system must keep track of which pages should be writable and which should not. Question 15 When is M on, R off? Answer: This page has been modified on some previous clock tick, but not used during this clock tick, and the operating system is turning off all the R bits every tick in order to get LRU information. (The reason this is a question at all is that every write *is* a reference, so the M bit is never turned on unless the R bit is also turned on at the same time.) Question 16 Answer: *TLB slot* replacement policy and *page* replacement policy are two different things, and unrelated. The statement would be true if it said that the *page* replacement policy must be LRU, but in fact the TLB slot policy is more likely to be Random. Question 17 If a desired page isn't found in the process's physical memory, the system looks for it in the process's virtual memory: FALSE. Virtual memory isn't a place that can be looked in. VM is a function that maps virtual addresses to physical addresses. Information must be in *some* physical place! A statement that would be true: "... looks for it in the swapping space on the disk." Considering VM as a kind of caching, what corresponds to the cache width is the size of a page. TRUE. The cache width says how much information fits into one cache slot. The page size says how much information fits into one "memory slot" -- the smallest unit of information that will be loaded into memory at a time. Considering VM as a kind of caching, what corresponds to the cache size is the size of the virtual address space. FALSE. What corresponds to the cache size is the size of the *physical* memory -- the amount of information that can be "cached" in memory at once. If the TLB design doesn't include a Writeable bit, the software can simulate it using the Dirty bit. FALSE. It's the other way around; if there's no Dirty bit the system can simulate it by disabling writing on every page until the page is first written, which will cause an exception so the system can remember that this page is modified. Question 18 (a) What hardware supports LRU? The Reference bit (or Use bit) in the TLB or page table entry. (b) How can an OS support LRU without it? Periodically turn off all the Valid bits, and use the resulting page faults to control a software-maintained reference bit. Alternative answer to (b), less general but accepted: On the MIPS, where there is no page table hardware, the OS could assume that any entry that is about to be flushed from the TLB is fairly recently used, and turn on a software reference bit in the software-maintained page table. (In other words, pretend that *every* valid TLB entry has its reference bit set.) Question 19 Many people lost points on this question for saying things like "the RISC philosophy is to make things as simple as possible and this isn't simple." Well, yes, but you could say that about almost anything! The point of the question was to see if you could focus on the *specific* bottlenecks in particular proposed instructions. A. Block Transfer. The main answer we were expecting here was "it would delay the pipeline to have multiple memory references in a single instruction cycle" and indeed many people said that. Some people, copying from an answer in the sample exam, said "it would require either multiple bus cycles, which would delay the pipeline, or widening the bus." Come on! The question from which you copied that answer was about a proposed 64-bit memory transfer instruction. This one is a proposed any-number-of-bits transfer! There is no way we could widen the bus to accommodate, for example, a megabyte memory-to-memory copy. People who did this got some credit for mentioning the pipeline but didn't leave us with a strong feeling of confidence that you understand what a bus is, or a pipeline, or a memory, or a transfer, or a computer. Several people suggested that this instruction would be too wide to fit in a word, because it would require three pointer-length operands. But this objection is easily overcome; the three operands could be registers containing the pointers and the length. This answer got half credit. Some people mentioned that an instruction that might take a very long time would leave the machine uninterruptable for that long time, which could lead to I/O data loss. This is a very good answer, showing some real thinking -- connecting the instruction semantics with the not-obviously- related question of interrupt handling. Of course this issue came up in the PDP-10 also, and here's how it was solved: If an interrupt is requested during a BLT instruction, the processor updates the pointers in the register operands to reflect how far the copying has gone, then remembers the address of the BLT itself, rather than the address of the next instruction, as the exception PC. After the interrupt is handled, the BLT continues. (That solution would run into difficulties in the MIPS architecture if the BLT happened to occupy the delay slot of a branch.) A few people suggested that block transfer is an unnecessary idea because arrays are assigned by reference (copying a pointer) rather than by value. That isn't a very good answer. For one thing, it concerns the design of the C language, not of the MIPS architecture. (Pascal passes arrays by value.) Second, even in C, structs are passed by value, so block copies are still useful. B. Jump to Subroutine. Here are the two main answers we were expecting: 1. A procedure called by JSR can't be called recursively. 2. Allowing write access to code pages would prevent the use of shared code. We gave half credit for the response "this would block the pipeline because two memory references would be required." In fact only one memory reference would be required; a branch or jump instruction does *not* fetch the next instruction, but merely puts a new address into the processor's PC register. So a JSR would combine a jump (no memory access required) with a store. On the other hand, the proposed jump-indirect instruction for returning from a procedure *would* block the pipe, not because of doing two loads (it only does one) but because the load would have to happen too early in the processing of the instruction. (An ordinary load actually doesn't complete until after the next instruction!) We gave full credit for saying that jump-indirect would block the pipe, even if you didn't fully understand why. One student mentioned that storing the return address near the procedure's code would require setting the dirty bit for that page when calling the procedure, which would increase the paging load on the system. Good answer. Several students suggested that using memory to save the return address instead of a register would require additional memory allocation. Pfui. First of all, the amount of memory required to hold return addresses is a tiny fraction of a program's memory requirement; second, in most cases the return address ends up in memory anyway, because $31 is saved on the stack. In the apple-pie-motherhood category was the answer "a RISC instruction shouldn't have two purposes." Consider that the MIPS architecture *does* include the instructions BLTZAL and BGEZAL, which combine test-and-branch with procedure calling. A couple of people said "this would violate stack discipline." Like the answer about passing arrays by reference earlier, this confuses a language design issue with an architecture issue. The MIPS JAL instruction doesn't have anything to do with using a stack or not using a stack, and neither does the PDP-10 JSR, really; the difference is between storing the old PC in a register and storing the old PC in memory. And in any case, "stack discipline" doesn't mean having a stack. Scheme, which violates stack discipline, is still generally implemented using a stack for most purposes. Logo is an example of a language that *does* obey stack discipline, but is generally implemented without the use of a hardware stack. (The environment is instead maintained using linked lists, as in the 61A metacircular evaluator.) Several people gave arguments that, if true, would also prove that JAL doesn't fit the MIPS architecture! For example, several people thought it problematic that (in the actual words of the problem) "The instruction stored the PC at that address..." because what has to be stored is the address of the next instruction, not the address of the JSR instruction. But of course exactly the same argument applies to the PC that JAL stores in $31! The stored PC is the address following the procedure call, and the same is true for JSR. A few people argued that the JSR mechanism wouldn't work for exception handling, for various reasons. This is irrelevant, because the MIPS doesn't use JAL for exception handling either! Since there is already a separate mechanism for exceptions, changing the ordinary procedure calling mechanism is independent of it. Actually this objection was particularly ironic because, in the PDP-10, JSR was used precisely for exception handling. The idea is that an interrupt might occur when no stack is set up, and no registers may be available for use by the exception handler. The MIPS solves this problem by building a special piece of hardware, the Exception PC register, for the sole purpose of keeping the old PC value. The PDP-10 solves the problem by saving the old PC in memory, making the assumption that the exception handler won't be reentrant. (It can be made reentrant by copying the old PC from the place where JSR put it to some newly-allocated space.) On the PDP-10, instead of jumping to a fixed address, an exception executed the instruction at a fixed address. Typically that instruction would be a JSR, which would save the PC without relying on any assumptions about registers. (The normal bread-and-butter procedure calling instruction on the PDP-10, by the way, was PUSHJ, Push and Jump. It took two operands, a register and a jump address. The register must have been previously set up to point to a stack, like $29. "Push" means to move the pointer by one word and store something there. So the stack was allocated a word at a time, rather than a frame at a time.) Question 20 (a) $0 always being 0 is useful -- to copy a value from one register to another ADD $d,$s,$0 -- to check if a value is equal to zero BEQ $t,$0,label [But, note, it's NOT needed to check if a value is less than zero or greater than zero; the BLTZ etc instructions do NOT refer to $0.] -- to refer to small absolute memory addresses LW $d,addr($0) -- to discard the result of an instruction LA $26,0xffff0004 LW $0,0($26) ;ignore received character -- to load a small immediate value into a register ADDI $d,$0,value [Ironically, "set a register to zero" is NOT as good an answer. Although ADD $d,$0,$0 works, so do lots of other instructions that don't require $0. See part (b).] A few people suggested that $0 is useful with AND and OR instructions to set or clear bits in a word, but that doesn't make sense. ANDing with $0 will clear all the bits -- the same as loading 0 into the destination -- while ORing with $0 has no effect at all! A few people cheated by saying essentially the same thing twice, e.g., "set a register to zero" and "initialize a counter to zero." (b) If this feature were not in hardware it could easily be simulated by the software. The assembler could simply insert an instruction such as XOR $0,$0,$0 at the beginning of every program and also just after any instruction that has $0 as its destination. Or the programmer could just code that instruction explicitly, and refrain from assigning another value to it! Some people suggested ways in which the operating system would get involved. This is needlessly complicated. Similarly, solutions requiring a special memory location set to zero miss the point; it's not that hard to set a register to zero! Some people interpreted the question as meaning something like, "could software *enforce* keeping $0 zero against the wishes of the programmer?" But that misses the big picture. The point is, we decided in part (a) that it's *useful* to the programmer to have a zero-valued register, given the MIPS architecture. So the question in (b) is, do we absolutely have to implement this useful feature in hardware, or could it be done as a software convention? Question 21 PUSH can't be done. The key point is that two arithmetic operations would be required: One for the effective address calculation (adding the offset to the stack pointer register) and one to increment the register. LWR can be done. Some students worried that the instruction format couldn't fit, but it would be a fine R-format instruction, with three register operands. It would be strange to have an assembly language instruction that looks as if it should be I-format but is really R-format, but not impossible. In the Arithmetic cycle, the ALU adds the two source registers just as for an ADD instruction, but instead of using the result to provide a new register value in the Register-write cycle, the result is used as the effective address for the Memory cycle. INCSW can't be done. Again, two arithmetic operations are required, one for the effective address calculation and one for the increment. ADDM can't be done. It would require that the Memory read come before the Arithmetic, but the pipeline has A before M. EXCH can't be done. It would require two memory references, one read and one write. This is the most obvious way in which RISC machines differ from earlier designs: there are no read-then-store instructions. LWZ can be done. You know that branch instructions complete the testing of the source register values during the Decode cycle; this instruction could do the same thing. Then the instruction would continue as a LW only if the equal-to-zero test came out true. Question 22 Two very good answers that several students gave: * A load and store cycle in the same instruction might block the pipeline, since its circuitry was designed for only one memory data reference per instruction. * Since this instruction changes values in two places (a register and a memory location), it would be hard to restart after an exception if half-completed, with one target changed but the other still with its old value. We gave two points to "an additional temporary register would be needed" (to implement the algorithm TEMP <- MEMORY, MEMORY <- REGISTER, REGISTER <- TEMP). This is true, but any processor is full of buffer registers and perhaps one of those is already usable for this purpose. By the way, an interesting side note to the hard-to-restart argument is that historically, many architecture designers have taken pains to have at least one uninterruptable ("atomic") load-and-store instruction because that simplifies the problem of allowing a processor to reserve some system resource in a multiprocessor environment. Suppose two processors both want to use the system's widget at the same time. There is a word in memory that will contain, say, -1 if the widget is free, or the processor number of the processor that's using it. So each processor would say LI $8, my-processor-number EXCH $8, widget-address and then if the value in $8 is -1, that processor knows it has permission to use the widget. It is possible to do this sort of resource allocation without an atomic load-and-store instruction, but it's trickier. Question 23 Pipelining is a very cheap form of parallelism because it takes advantage of what would otherwise be idle time for components needed in the processor even without pipelining. (For example, there must be an instruction decoder, and there must be an ALU; each instruction uses only one of those at a time.) By contrast, other forms of parallelism require either duplication of circuitry within one processor (multiple ALUs, for example, within a superscalar or vector processor) or duplication of entire processors, for coarser-scale parallelism. We had in mind parallelism within a single processor, but the question was worded broadly enough to include multiprocessor systems, so we also accepted this answer: Parallelism across processors makes programming more complicated because the user must worry about synchronization problems. [That doesn't apply to superscalar processors, even though operations are happening in parallel, because the hardware ensures that there are no races.] Common wrong answer: The MIPS design has fixed-width instructions, etc., which make pipelining easy. This is wrong because it has the design decisions backwards. It is BECAUSE the MIPS designers wanted to build a pipelined machine that they made architecture decisions (such as the fixed-width instruction format) to simplify pipelining. Question 24 R-format: NEVER. The operands of an R-format instruction are registers, not memory addresses, so address relocation is irrelevant to them. I-format: SOMETIMES. Some people said "never" because branch instructions, which are I-format, aren't relocated. That's true, and most immediate instructions aren't either. But the MAL LA (Load Address) instruction expands to an LUI followed by an ORI, both of which require relocation. J-format: We accepted ALWAYS or SOMETIMES. The operand of a jump is an address, and generally does require relocation. The only exception, and the reason we accepted SOMETIMES, is that if what you're writing is a kernel, you might need a jump to an absolute address, e.g., J 0x80000080 although that particular example is actually unlikely. Question 25 (a) r_vaddr r_type r_extern text or data? 0 R_JMPADDR 0 text 0x14 R_JMPADDR 1 text 0x24 R_JMPADDR 0 text Common errors: Entries for the branches. An entry for each label rather than for the instruction that refers to it. Something other than a number in r_vaddr. (b) LOOP: JAL GETCHAR 0c000200 BLTZ $2, SKIPPED 04400003 LI $8, ' ' 20080020 or 34080020 or 24080020 BNE $2, $8, SKIPPED 14480001 or 15020001 J LOOP 0800004b The alternatives for the LI instruction are ADDI, ORI, and ADDIU. The alternate for BNE is because we accepted an answer with the two register fields in the wrong order, since it doesn't matter for BNE. The most common problems were forgetting to divide the jump addresses by four and forgetting that branch addresses are PC-relative. Another one was using LUI as the translation of LI. (If the immediate value were more than 16 bits wide, then the LI would have to be *two* machine instructions, LUI followed by ORI. But certainly LUI alone makes no sense.) Question 26 Memory data references are made by computing an address in a register, then using that register as the base in the LW/SW instruction. It's the prior instruction(s) to compute the address that require relocation. [If MIPS instructions were wide enough to include an address, then a LW/SW instruction could have a relocatable "offset" with $0 as the base register. But they're not, so addresses never appear in LW/SW instructions. The only nonzero offsets are absolute, mainly for fields within a struct/object when we have a register pointing to the beginning of the block of memory. You might imagine that instead of saying LUI $1, addr-left-half ORI $1, add-right-half LW $8, 0($1) we might say LUI $1, addr-left-half LW $8, addr-right-half($1) but in fact that isn't done because the offset would be added to the base, not ORed, and so there might be a carry into the left half which would give an incorrect left-half value.] Common wrong answer: Most LW/SW instructions are stack-relative, and therefore don't need relocation, but the ones for global/static data do need relocation. Nope, it's the LUI and the ORI that are relocated. Common wrong answer: Only instruction fetches are relocated, not data references. (Not usually stated this plainly!) Of course this isn't true. Question 27 ch = *cp++; This means the same as ch = *cp; cp = cp+1; so it's translated by (A) lb $16, 0($4) # ch = *cp; addi $4, $4, 1 # cp = cp+1; ch = (*cp)++; This means the same as ch = *cp; *cp = *cp+1; so it's translated by (F) lb $16, 0($4) # ch = *cp; addi $8, $16, 1 # $8 = ch+1; /* which is *cp + 1 */ sb $8, 0($4) # *cp = ch+1; ch = (*cp)+1; This doesn't modify cp or *cp, so it's translated by (C) lb $16, 0($4) # ch = *cp; addi $16, $16, 1 # ch = ch+1; ch = *(cp+1); This also doesn't modify cp or *cp, but uses a temp pointer: (B) addi $8, $4, 1 # $8 = cp+1; lb $16, 0($8) # ch = *($8); Note that it could also be translated more simply: lb $16, 1($4) # ch = *(cp+1); ch = *++cp; This means the same as cp = cp+1; ch = *cp; so it's translated by (E) addi $4, $4, 1 # cp = cp+1; lb $16, 0($4) # ch = *cp; The unused answer was (D) lb $16, 0($4) addi $16, $16, 1 sb $16, 0($4) which translates ch = ++(*cp);