Question 1

(a) In a four-way set associative cache, two of the address bits are used to
select which cache block within a set will be chosen.

	FALSE.  The whole idea of associativity is that the blocks within a
	set are equivalent, so any address that goes with this set can go in
	any of the four blocks.  Some bits from the address do determine
	WHICH SET will be used, but not which block within a set.

	Some people interpreted the question as asking about how to find
	which block holds an address that's already in the cache.  Some
	address bits are relevant in that situation, namely the tag bits,
	but there will be more than two of those.

(b) Programs using arrays will be more likely than programs using linked
lists to benefit from a wide cache design.

	TRUE.  Wide caches are used to take advantage of spatial locality of
	reference:  If we refer to some address, we're probably going to use
	its neighbors soon.  But linked lists can have their elements
	scattered all throughout memory; only arrays show spatial locality
	of reference.  (This, *not* the use of recursion, is the main reason
	why Lisp-like languages are still a little slower than C-like
	languages, although arrays *are* available in Lisp for programmers
	more concerned with efficiency than with flexibility.)

(c) Increasing the size of a direct mapped cache decreases conflict misses.

	TRUE.  If the cache is bigger, then fewer memory addresses are in
	competition for a given cache slot.  (This would be true for any
	cache, even an associative one.)

	Some people said "false" because they read that the way to decrease
	conflict misses is to increase associativity.  That's also true; for
	a given cache size, there will be fewer conflict misses with
	increased associativity.  But that doesn't mean associativity is the
	*only* way to improve the hit rate!

(d) The choice of write-through or write-back design involves a tradeoff
between miss rate and miss penalty.

	FALSE.  The tradeoff is between HIT COST and miss penalty.  A
	write-through cache spends a little more time on each write that
	hits the cache; a write-back cache spends much more time on some
	writes that misses the cache (namely, the cases in which the
	chosen cache block is dirty).  But since this design choice affects
	the contents of memory, NOT the contents of the cache, it doesn't
	affect the miss rate.

(e) The choice of direct mapped or associative design involves a tradeoff
between miss rate and hit cost.

	TRUE.  A direct mapped cache is simpler, and so the hit cost is
	lower.  An associative cache will have a somewhat better chance to
	avoid conflict misses, and therefore improves (decreases) the miss
	rate.  But empirical studies show that associativity has only a tiny
	effect on the miss rate, so most caches are now direct mapped.


Question 2

size	words	policy		offset		set		tag
(wds)	(wds)

1024	8	direct		24		31		61797
		mapped		0x18		0x1f		0xf165
				11000		0011111

4096	16	4-way set	56		15		61797
		associative	0x38		0xf		0xf165
				111000		011111

4096	4	fully		8		not		15820095
		associative	0x8		applicable	0xf1653f
				1000

The number of bits in the offset depends on the width of a cache slot.  The
slot sizes in bytes are 8*4=32=2^5, 16*4=64=2^6, and 4*4=16=2^4.  Therefore,
the offset sizes are 5, 6, and 4 bits.  The offset is the rightmost part of
the address, which is 0000 1111 0001 0110 0101 0011 1111 1000.  The values
listed above under the offset column are the rightmost five, six, and four
bits of this address.

The number of bits in the set number depends on the number of sets in the
cache.  We determine this in two steps.  First, how many slots (or blocks)
are in the cache?  This is the total cache size divided by the width of
one block.  For these three caches there are 1024/8=127=2^7,
4096/16=256=2^8, and 4096/4=1024=2^10.

The next step is to divide the total number of slots (computed above) by
the number of slots in each set.  For a direct mapped cache, each slot is
its own set.  (In other words, each set contains only one slot; there is
only one place that a given address can appear in the cache.)  So this
cache has 2^7 sets, so the seven bits to the left of the offset are the
set number.  For a 4-way set associative cache, each set contains four
slots, so this cache has 256/4=64=2^6 sets, so its set number takes up
six bits.  For a fully associative cache, the entire cache is a single
set, and there is no set number.

Some people thought that the set number is in the high-order (leftmost)
bits of the word.  If that were true, then consecutive addresses would
compete for the same cache set.  Locality of reference tells us that
such a cache would have a very high miss rate even when most of the
slots were unused!  The idea is to distribute nearby addresses among
all the sets.

Finally, the tag is everything not used up for the offset or set number.


Question 3

What is the cache size?  All of the times for an array size of 256
bytes are small (about 100 ns), whereas for an array of 512 bytes
most of the times suddenly are around 900 ns.  So probably the cache
size is 256 bytes.

What is the block size?  For array sizes at or above 512,  a stride
of 4 gives access times around 450 ns, whereas larger strides take
about 900 ns.  So a stride of 8 moves to the next block, whereas a
stride of 4 is within the same block half the time.  Therefore the
block size is 8 bytes.

What is the allocation policy?  For an array size of 512, strides of
128 and 256 lead to lower access times (around 300 ns).  For an array
size of 1024, a stride of 256 leads to a similar low time.  In other
words, if the stride is 1/2 or 1/4 of the array size, then not every
reference leads to a cache miss, whereas if the stride is 1/8 or less
of the array size, every reference misses.  This suggests a 4-way
set associative cache.


Question 4

(a) size=64K words, stride=16K words.

Although the program only uses four words of the array, all four of
them will occupy the same cache slot.  (It's a direct mapped cache,
so there is no flexibility about which slot to use.)  Therefore the
hit rate is near zero for this example.

(b) size=16K words, stride=2 words.

This array entirely fits in the 16K-word cache, so we'll have
a near-100% hit rate.  (The initialization pass fills up the
cache with the array, and so on the second pass everything is
where it should be.)

(c) size=32K words, stride=2 words.

This array is twice the size of the cache, and the stride is small
enough so that every slot will be used (twice over).  In the
initialization phase, the cache fills up with the first half of
the array, then fills up with the second half.  So when the second
pass starts, we're looking for the first half and finding the
second half, then looking for the second half and finding the
first half.  So this would be all misses, except for the fact
that the cache width (4 words) is bigger than the stride (2 words),
so every second attempt is a cache hit -- we get about a 50% hit rate.


Question 5

The width is 8 bytes, so there are 3 bits of offset within a cache row.
The number of rows is 64/8 = 8, and the cache is 4-way set associative,
so there are 2 sets, which means we need a 1-bit index to select the set.
That adds up to 4 bits, i.e., the rightmost hex digit (9) which is 1001
in binary, divided as 1 001.  The remaining address bits are the tag.

	tag = 1232123	set = 1	  offset = 1

So we want the row in set 1 whose tag is 1232123; that's the fifth row
of the entire cache.  In that row, we want byte 1 of word 0, which is
the 56.


Question 6

The cache holds 2^14 words; each slot holds eight (2^3) words.  That
means there are (2^14)/(2^3) = 2^11 slots.

Since the cache is two-way set associative, these slots are grouped
in sets of two, so there are 2^10 of these sets.  Therefore we need
10 bits for the INDEX, the part of the address that selects a set.

Each slot is eight words, which is 32 bytes (2^5), so we need five
bits for the OFFSET, the part of the address that selects a byte or
word within a cache slot.

This leaves 32 - (10+5) = 17 bits for the tag.


Question 7

Cache A:  Two blocks, one word each
Cache B:  One block of two words

(a) a[0]++; a[1]++; a[2]++; a[3]++;

	Cache A: a[0] misses, into first block;
		 a[1] misses, into second block;
		 a[2] misses, into first block;
		 a[3] misses, into second block.

	Cache B: a[0] misses, loads a[0] and a[1];
		 a[1] hits;
		 a[2] misses, loads a[2] and a[3];
		 a[3] hits.

	So cache B has a better hit rate (2 hits) than cache A (0 hits).

(b) a[1]++; a[3]++; a[1]++; a[3]++;

	Cache A: a[1] misses, into second block;
		 a[3] misses, into second block;
		 a[1] misses, into second block;
		 a[3] misses, into second block.

	Cache B: a[1] misses, loads a[0] and a[1];
		 a[3] misses, loads a[2] and a[3];
		 a[1] misses, loads a[0] and a[1];
		 a[3] misses, loads a[2] and a[3].

	So the two caches have the same hit rate (0 hits).

(c) a[1]++; a[2]++; a[1]++; a[2]++;

	Cache A: a[1] misses, into second block;
		 a[2] misses, into first block;
		 a[1] hits;
		 a[2] hits.

	Cache B: a[1] misses, loads a[0] and a[1];
		 a[2] misses, loads a[2] and a[3];
		 a[1] misses, loads a[0] and a[1];
		 a[2] misses, loads a[2] and a[3];

	So cache A has a better hit rate (2 hits) than cache B (0 hits).

So the two answers are C and A.


Question 8

A.  Find 0x7f55a356 in the cache.

Since the cache is 16 bytes wide, the lowest 4 bits (4 = lg 16) of
the address are the offset within a block.  That's the last hex
digit, the 6.  So we're looking for the cache block that would
contain block number 0x7f55a35.

The cache contains 4K bytes, and one block is 16 bytes, so there
are 4K/16 = (2^12)/(2^4) = 2^8 blocks in the cache.  Since the
cache is direct mapped, every block is a set of one, and the
address uniquely determines which block (i.e., which set) to use.
We find out by looking at the lowest 8 bits (because there are
2^8 blocks) of the block number, which is 0x35.  What's left,
0x7f55a, is the tag.

So (B) is the correct answer.

B.  How many sets?

A block is 32 bytes; there are five blocks to a set.  So the
size of a set is 32*5=160 bytes.  The entire cache size is
80K = 80*(2^10), so the number of sets is 80K/160 = 2^9 = 512 bytes.

C.  TLB mapping.

A page is 4Kb = 2^12 bytes.  That means that the bottom 12 bits
of the address (3 hex digits) are the offset, and the rest is the
virtual page number.  For this virtual address, the offset is
0x678 and the virtual page number is 0x12345.  We look up the
latter in the TLB and find it in slot number 2, corresponding to
physical page number 0x200.  Combining this with the offset, the
physical address is 0x200678.

D.  What can't be paged?

The answer is (B); if the disk I/O handler were paged out, then it
could never be brought back into memory, and neither could anything
else.

It's okay to page out the page tables; they are in the operating
system's virtual address space, and the same mechanism that pages
other stuff in and out can deal with missing page table entries.
(Of course there has to be access to SOME memory, so maybe part of the
system's own page table will be in the unmapped portion of memory.)

It's okay to page out the system call handler.  System calls are
instructions executed by a user process; if the handler needed for a
certain system call isn't available, it could be paged in later, just
as anything else that a user process needs can be.

And of course if user programs couldn't be paged out, that would
defeat the whole purpose of paging!


Question 9

VM is different from cache because the MISS PENALTY is
so much greater for VM -- if you don't find the page in physical memory,
you have to do a disk operation to get it.


Question 10

A page is 2^12 bytes, so the virtual address is divided into a 20-bit
virtual page number and a 12-bit offset.  So the address 0x40fa25d4 is

	VPN:	40fa2
	offset:	5d4

We look up the VPN in the TLB and find it in slot 6:

	VPN 40fa2, physical page 076

Therefore the desired physical address is 0x0765d4.

Now we are done with the TLB and need to find this address in the cache.
The cache width is 2^5 bytes, so there are five bits of cache offset.
The total cache size is 2^9 bytes, so there are 2^4 = 16 cache blocks.
The cache is 2-way set associative, so these 16 blocks are organized as
8 (= 2^3) sets of 2 blocks each.  So there are three bits of set number.

Physical address = 0x765d4 = 0111 0110 0101 1101 0100 binary
			   = 0111 0110 0101 110 10100
                                            --- -----
                                         set #    offset

so we are looking for an offset of 0x14 in a block of set # 6.

Set #6 has two blocks, whose keys are a1c5 and 0765.  The second of
these matches the leftmost bits of the physical address, so we use
that block (block # 0xd), and at offset 0x14 we find the data, 0x46.


Question 11

Parts a-c are about the cache.  The cache size is 8Kb, which is 2^13.
The cache width is 32 words, or 128 bytes, which is 2^7.  Therefore
there are (2^13)/(2^7) = 2^6 = 64 cache rows.  Since the cache is
4-way set associative, these rows are grouped as 16 sets of 4 rows
per set.  So a memory address is

+---------------------+----+-------+
|         21          |  4 |   7   |
|        tag          | set| offset|
+---------------------+----+-------+

(a) block offset is 7 bits, because a cache row is 2^7 bytes wide.

(b) set index is 4 bits, because there are 16 = 2^4 sets.

(c) What's left, 32 - 7 - 4, is the tag, 21 bits.


Parts d-f are about virtual memory.  The page size is 4K bytes = 2^12.
So for VM purposes a 32-bit memory address (both physical and virtual
are 32 bits) is

+--------------------+------------+
|        20          |      12    |
|    page number     |    offset  |
+--------------------+------------+

(d) page offset is 12 bits, because a page holds 2^12 bytes.

(e) and (f) page numbers are the remaining 32-12 = 20 bits.


Question 12

Why cache in hardware, VM largely in software?
There are many reasons for this, and we accepted a wide range of possible
answers.  For example:

	* Virtual memory uses disk, which has a large and unpredictable
	  latency delay, which means that the processor can't wait for a
	  VM miss, but must instead switch to a different process, which
	  requires software support.

	* Since the miss penalty is not so severe for a cache miss as
	  for a page fault, the cache can use random replacement, which
	  is relatively easy to do in hardware; by contrast, VM generally
	  uses an LRU algorithm, which requires complicated bookkeeping
	  best done with software support.

	* Since the backup store for VM is the disk, the page fault
	  handler must understand the file system structure, to know
	  which part of the disk to use; this structure is not part
	  of the disk hardware but is imposed by the operating system.
	  (That's why PC diskettes and Mac diskettes are incompatible
	  even though they are physically the same.)

	* If a cache miss ran some software, the possibility arises that
	  part of that software might not be in the cache, which would
	  create a recursive cache miss.  The cache miss handler software
	  would have to be permanently locked into the cache.

A typical two-point answer would be one that didn't show a clear
understanding of the difference between a hit and a miss, e.g., making
an argument that handling each cache hit in software would defeat the
purpose of caching.  The same is true for VM: TLB hits are handled by
the hardware!  It's only on a miss that the question of hardware vs.
software control is relevant.


Question 13

Two point answer:  The bit is set whenever something is stored in any
address within the page; the operating system examines the bit when
that page must be reassigned.  If the bit is not set, then the operating
system need not write the contents of the page to disk before reading
in the new contents.

One point answer:  The bit is set whenever something is stored in any
address within the page.

One point answer:  The bit is used to keep track of whether anything
has been modified and must be written back.  [This is a one point
answer because it's not clear whether you understand the difference
between the cache and the virtual memory system.]

Zero point answer:  The bit is set when the page table or TLB entry
is written; it's used to decide whether or not to write the page table
or TLB entry back to memory.  [It's not the page table or TLB entry
we're concerned with, but rather the data in the page to which that
entry refers!]


Question 14

a. Reading a page table entry from main memory:  Could be either.

	Whether this operation is built into the hardware or left
	for the operating system software, a memory reference is
	still required.  Therefore the gain from hardwiring this
	operation is slight.  The MIPS processor leaves this task
	for the operating system, although other processors have
	specific hardware for the purpose.

b. Reading a TLB entry:  Definitely hardware.

	The whole point of having a TLB is to cache the page
	table, so that the virtual memory translation process
	will be really fast in most cases.  If every TLB read
	required a processor exception handler to be called,
	every memory reference would be really slow!

c. Setting the write enable bit for a page:  Definitely software.

	The write enable bit (unlike the dirty bit, which also
	has to do with writing) indicates a policy decision, not
	a record of the history of memory use.  There is no way
	the hardware could make this decision; the operating
	system must keep track of which pages should be writable
	and which should not.


Question 15

When is M on, R off?

Answer:  This page has been modified on some previous clock tick, but
not used during this clock tick, and the operating system is turning off
all the R bits every tick in order to get LRU information.

(The reason this is a question at all is that every write *is* a reference,
so the M bit is never turned on unless the R bit is also turned on at the
same time.)


Question 16

Answer:  *TLB slot* replacement policy and *page* replacement policy are
two different things, and unrelated.  The statement would be true if it
said that the *page* replacement policy must be LRU, but in fact the TLB
slot policy is more likely to be Random.


Question 17

If a desired page isn't found in the process's physical memory, the
system looks for it in the process's virtual memory:

	FALSE.  Virtual memory isn't a place that can be looked in.
	VM is a function that maps virtual addresses to physical
	addresses.  Information must be in *some* physical place!

	A statement that would be true:  "... looks for it in the
	swapping space on the disk."

Considering VM as a kind of caching, what corresponds to the cache
width is the size of a page.

	TRUE.  The cache width says how much information fits into
	one cache slot.  The page size says how much information
	fits into one "memory slot" -- the smallest unit of information
	that will be loaded into memory at a time.

Considering VM as a kind of caching, what corresponds to the cache
size is the size of the virtual address space.

	FALSE.  What corresponds to the cache size is the size of
	the *physical* memory -- the amount of information that can
	be "cached" in memory at once.

If the TLB design doesn't include a Writeable bit, the software can
simulate it using the Dirty bit.

	FALSE.  It's the other way around; if there's no Dirty bit
	the system can simulate it by disabling writing on every
	page until the page is first written, which will cause an
	exception so the system can remember that this page is
	modified.


Question 18

(a) What hardware supports LRU?

	The Reference bit (or Use bit) in the TLB or page table entry.

(b) How can an OS support LRU without it?

	Periodically turn off all the Valid bits, and use the resulting
	page faults to control a software-maintained reference bit.

Alternative answer to (b), less general but accepted:

	On the MIPS, where there is no page table hardware, the OS could
	assume that any entry that is about to be flushed from the TLB is
	fairly recently used, and turn on a software reference bit in the
	software-maintained page table.  (In other words, pretend that
	*every* valid TLB entry has its reference bit set.)


Question 19

Many people lost points on this question for saying things like "the RISC
philosophy is to make things as simple as possible and this isn't simple."
Well, yes, but you could say that about almost anything!  The point of the
question was to see if you could focus on the *specific* bottlenecks in
particular proposed instructions.

A.  Block Transfer.  The main answer we were expecting here was "it would
delay the pipeline to have multiple memory references in a single instruction
cycle" and indeed many people said that.

Some people, copying from an answer in the sample exam, said "it would
require either multiple bus cycles, which would delay the pipeline, or
widening the bus."  Come on!  The question from which you copied that
answer was about a proposed 64-bit memory transfer instruction.  This one
is a proposed any-number-of-bits transfer!  There is no way we could
widen the bus to accommodate, for example, a megabyte memory-to-memory
copy.  People who did this got some credit for mentioning the pipeline
but didn't leave us with a strong feeling of confidence that you understand
what a bus is, or a pipeline, or a memory, or a transfer, or a computer.

Several people suggested that this instruction would be too wide to fit
in a word, because it would require three pointer-length operands.  But
this objection is easily overcome; the three operands could be registers
containing the pointers and the length.  This answer got half credit.

Some people mentioned that an instruction that might take a very long time
would leave the machine uninterruptable for that long time, which could
lead to I/O data loss.  This is a very good answer, showing some real
thinking -- connecting the instruction semantics with the not-obviously-
related question of interrupt handling.  Of course this issue came up in
the PDP-10 also, and here's how it was solved:  If an interrupt is
requested during a BLT instruction, the processor updates the pointers
in the register operands to reflect how far the copying has gone, then
remembers the address of the BLT itself, rather than the address of the
next instruction, as the exception PC.  After the interrupt is handled,
the BLT continues.  (That solution would run into difficulties in the
MIPS architecture if the BLT happened to occupy the delay slot of a branch.)

A few people suggested that block transfer is an unnecessary idea because
arrays are assigned by reference (copying a pointer) rather than by value.
That isn't a very good answer.  For one thing, it concerns the design of
the C language, not of the MIPS architecture.  (Pascal passes arrays by
value.)  Second, even in C, structs are passed by value, so block copies
are still useful.

B.  Jump to Subroutine.  Here are the two main answers we were expecting:

	1.  A procedure called by JSR can't be called recursively.
	2.  Allowing write access to code pages would prevent the use
	    of shared code.

We gave half credit for the response "this would block the pipeline
because two memory references would be required."  In fact only one
memory reference would be required; a branch or jump instruction does
*not* fetch the next instruction, but merely puts a new address into
the processor's PC register.  So a JSR would combine a jump (no memory
access required) with a store.  On the other hand, the proposed
jump-indirect instruction for returning from a procedure *would* block
the pipe, not because of doing two loads (it only does one) but because
the load would have to happen too early in the processing of the
instruction.  (An ordinary load actually doesn't complete until after
the next instruction!)  We gave full credit for saying that jump-indirect
would block the pipe, even if you didn't fully understand why.

One student mentioned that storing the return address near the procedure's
code would require setting the dirty bit for that page when calling the
procedure, which would increase the paging load on the system.  Good answer.

Several students suggested that using memory to save the return address
instead of a register would require additional memory allocation.  Pfui.
First of all, the amount of memory required to hold return addresses is
a tiny fraction of a program's memory requirement; second, in most cases
the return address ends up in memory anyway, because $31 is saved on the
stack.

In the apple-pie-motherhood category was the answer "a RISC instruction
shouldn't have two purposes."  Consider that the MIPS architecture *does*
include the instructions BLTZAL and BGEZAL, which combine test-and-branch
with procedure calling.

A couple of people said "this would violate stack discipline."  Like the
answer about passing arrays by reference earlier, this confuses a language
design issue with an architecture issue.  The MIPS JAL instruction
doesn't have anything to do with using a stack or not using a stack,
and neither does the PDP-10 JSR, really; the difference is between
storing the old PC in a register and storing the old PC in memory.
And in any case, "stack discipline" doesn't mean having a stack.  Scheme,
which violates stack discipline, is still generally implemented using a
stack for most purposes.  Logo is an example of a language that *does*
obey stack discipline, but is generally implemented without the use of
a hardware stack.  (The environment is instead maintained using linked
lists, as in the 61A metacircular evaluator.)

Several people gave arguments that, if true, would also prove that JAL
doesn't fit the MIPS architecture!  For example, several people thought
it problematic that (in the actual words of the problem) "The instruction
stored the PC at that address..." because what has to be stored is the
address of the next instruction, not the address of the JSR instruction.
But of course exactly the same argument applies to the PC that JAL stores
in $31!  The stored PC is the address following the procedure call, and
the same is true for JSR.

A few people argued that the JSR mechanism wouldn't work for exception
handling, for various reasons.  This is irrelevant, because the MIPS
doesn't use JAL for exception handling either!  Since there is already
a separate mechanism for exceptions, changing the ordinary procedure
calling mechanism is independent of it.  Actually this objection was
particularly ironic because, in the PDP-10, JSR was used precisely for
exception handling.  The idea is that an interrupt might occur when
no stack is set up, and no registers may be available for use by the
exception handler.  The MIPS solves this problem by building a special
piece of hardware, the Exception PC register, for the sole purpose of
keeping the old PC value.  The PDP-10 solves the problem by saving the
old PC in memory, making the assumption that the exception handler
won't be reentrant.  (It can be made reentrant by copying the old PC
from the place where JSR put it to some newly-allocated space.)  On the
PDP-10, instead of jumping to a fixed address, an exception executed
the instruction at a fixed address.  Typically that instruction would
be a JSR, which would save the PC without relying on any assumptions
about registers.

(The normal bread-and-butter procedure calling instruction on the PDP-10,
by the way, was PUSHJ, Push and Jump.  It took two operands, a register
and a jump address.  The register must have been previously set up to
point to a stack, like $29.  "Push" means to move the pointer by one
word and store something there.  So the stack was allocated a word at
a time, rather than a frame at a time.)


Question 20

(a)  $0 always being 0 is useful

	-- to copy a value from one register to another
		ADD $d,$s,$0

	-- to check if a value is equal to zero
		BEQ $t,$0,label

	   [But, note, it's NOT needed to check if a value is less
	    than zero or greater than zero; the BLTZ etc instructions
	    do NOT refer to $0.]

	-- to refer to small absolute memory addresses
		LW $d,addr($0)

	-- to discard the result of an instruction
		LA $26,0xffff0004
		LW $0,0($26)	;ignore received character

	-- to load a small immediate value into a register
		ADDI $d,$0,value

	   [Ironically, "set a register to zero" is NOT as good an
	    answer.  Although ADD $d,$0,$0 works, so do lots of other
	    instructions that don't require $0.  See part (b).]

A few people suggested that $0 is useful with AND and OR instructions
to set or clear bits in a word, but that doesn't make sense.  ANDing
with $0 will clear all the bits -- the same as loading 0 into the
destination -- while ORing with $0 has no effect at all!

A few people cheated by saying essentially the same thing twice,
e.g., "set a register to zero" and "initialize a counter to zero."

(b) If this feature were not in hardware it could easily be simulated
by the software.  The assembler could simply insert an instruction
such as
		XOR $0,$0,$0
at the beginning of every program and also just after any instruction
that has $0 as its destination.  Or the programmer could just code
that instruction explicitly, and refrain from assigning another value
to it!

Some people suggested ways in which the operating system would get
involved.  This is needlessly complicated.  Similarly, solutions
requiring a special memory location set to zero miss the point; it's
not that hard to set a register to zero!

Some people interpreted the question as meaning something like, "could
software *enforce* keeping $0 zero against the wishes of the programmer?"
But that misses the big picture.  The point is, we decided in part (a)
that it's *useful* to the programmer to have a zero-valued register,
given the MIPS architecture.  So the question in (b) is, do we absolutely
have to implement this useful feature in hardware, or could it be done
as a software convention?


Question 21

PUSH can't be done.  The key point is that two arithmetic operations
would be required:  One for the effective address calculation (adding
the offset to the stack pointer register) and one to increment the
register.

LWR can be done.  Some students worried that the instruction format
couldn't fit, but it would be a fine R-format instruction, with
three register operands.  It would be strange to have an assembly
language instruction that looks as if it should be I-format but is
really R-format, but not impossible.  In the Arithmetic cycle, the
ALU adds the two source registers just as for an ADD instruction,
but instead of using the result to provide a new register value
in the Register-write cycle, the result is used as the effective
address for the Memory cycle.

INCSW can't be done.  Again, two arithmetic operations are required,
one for the effective address calculation and one for the increment.

ADDM can't be done.  It would require that the Memory read come
before the Arithmetic, but the pipeline has A before M.

EXCH can't be done.  It would require two memory references, one read
and one write.  This is the most obvious way in which RISC machines differ
from earlier designs: there are no read-then-store instructions.

LWZ can be done.  You know that branch instructions complete the testing
of the source register values during the Decode cycle; this instruction
could do the same thing.  Then the instruction would continue as a LW
only if the equal-to-zero test came out true.


Question 22

Two very good answers that several students gave:

	* A load and store cycle in the same instruction might block
	  the pipeline, since its circuitry was designed for only one
	  memory data reference per instruction.

	* Since this instruction changes values in two places (a register
	  and a memory location), it would be hard to restart after an
	  exception if half-completed, with one target changed but the
	  other still with its old value.

We gave two points to "an additional temporary register would be needed"
(to implement the algorithm TEMP <- MEMORY, MEMORY <- REGISTER,
REGISTER <- TEMP).  This is true, but any processor is full of buffer
registers and perhaps one of those is already usable for this purpose.

By the way, an interesting side note to the hard-to-restart argument is
that historically, many architecture designers have taken pains to have
at least one uninterruptable ("atomic") load-and-store instruction
because that simplifies the problem of allowing a processor to reserve
some system resource in a multiprocessor environment.  Suppose two
processors both want to use the system's widget at the same time.  There
is a word in memory that will contain, say, -1 if the widget is free,
or the processor number of the processor that's using it.  So each
processor would say

	LI $8, my-processor-number
	EXCH $8, widget-address

and then if the value in $8 is -1, that processor knows it has permission
to use the widget.  It is possible to do this sort of resource allocation
without an atomic load-and-store instruction, but it's trickier.


Question 23

Pipelining is a very cheap form of parallelism because it takes advantage
of what would otherwise be idle time for components needed in the processor
even without pipelining.  (For example, there must be an instruction
decoder, and there must be an ALU; each instruction uses only one of those
at a time.)

By contrast, other forms of parallelism require either duplication of
circuitry within one processor (multiple ALUs, for example, within a
superscalar or vector processor) or duplication of entire processors,
for coarser-scale parallelism.

We had in mind parallelism within a single processor, but the question
was worded broadly enough to include multiprocessor systems, so we
also accepted this answer:

Parallelism across processors makes programming more complicated because
the user must worry about synchronization problems.

[That doesn't apply to superscalar processors, even though operations
are happening in parallel, because the hardware ensures that there are
no races.]

Common wrong answer:  The MIPS design has fixed-width instructions, etc.,
which make pipelining easy.  This is wrong because it has the design
decisions backwards.  It is BECAUSE the MIPS designers wanted to build
a pipelined machine that they made architecture decisions (such as the
fixed-width instruction format) to simplify pipelining.


Question 24

R-format: NEVER.  The operands of an R-format instruction are registers,
not memory addresses, so address relocation is irrelevant to them.

I-format: SOMETIMES.  Some people said "never" because branch instructions,
which are I-format, aren't relocated.  That's true, and most immediate
instructions aren't either.  But the MAL LA (Load Address) instruction
expands to an LUI followed by an ORI, both of which require relocation.

J-format: We accepted ALWAYS or SOMETIMES.  The operand of a jump is
an address, and generally does require relocation.  The only exception,
and the reason we accepted SOMETIMES, is that if what you're writing is
a kernel, you might need a jump to an absolute address, e.g.,
	J 0x80000080
although that particular example is actually unlikely.


Question 25

(a)

r_vaddr		r_type		r_extern	text or data?

0		R_JMPADDR	0		text
0x14		R_JMPADDR	1		text
0x24		R_JMPADDR	0		text

Common errors:

Entries for the branches.
An entry for each label rather than for the instruction that refers to it.
Something other than a number in r_vaddr.

(b)

LOOP:	JAL  GETCHAR		0c000200
	BLTZ $2, SKIPPED	04400003
	LI   $8, ' '		20080020 or 34080020 or 24080020
	BNE  $2, $8, SKIPPED	14480001 or 15020001
	J    LOOP		0800004b

The alternatives for the LI instruction are ADDI, ORI, and ADDIU.
The alternate for BNE is because we accepted an answer with the
two register fields in the wrong order, since it doesn't matter
for BNE.

The most common problems were forgetting to divide the jump addresses
by four and forgetting that branch addresses are PC-relative.  Another
one was using LUI as the translation of LI.  (If the immediate value
were more than 16 bits wide, then the LI would have to be *two* machine
instructions, LUI followed by ORI.  But certainly LUI alone makes no
sense.)


Question 26

Memory data references are made by computing an address in a register,
then using that register as the base in the LW/SW instruction.  It's
the prior instruction(s) to compute the address that require relocation.

[If MIPS instructions were wide enough to include an address, then
a LW/SW instruction could have a relocatable "offset" with $0 as the
base register.  But they're not, so addresses never appear in LW/SW
instructions.  The only nonzero offsets are absolute, mainly for
fields within a struct/object when we have a register pointing to
the beginning of the block of memory.  You might imagine that instead
of saying
	LUI	$1, addr-left-half
	ORI	$1, add-right-half
	LW	$8, 0($1)
we might say
	LUI	$1, addr-left-half
	LW	$8, addr-right-half($1)
but in fact that isn't done because the offset would be added to the
base, not ORed, and so there might be a carry into the left half which
would give an incorrect left-half value.]

Common wrong answer:  Most LW/SW instructions are stack-relative, and
therefore don't need relocation, but the ones for global/static data
do need relocation.  Nope, it's the LUI and the ORI that are relocated.

Common wrong answer:  Only instruction fetches are relocated, not
data references.  (Not usually stated this plainly!)  Of course this
isn't true.


Question 27

ch = *cp++;

	This means the same as

		ch = *cp;
		cp = cp+1;

	so it's translated by

	(A)	lb   $16, 0($4)		# ch = *cp;
		addi $4, $4, 1		# cp = cp+1;

ch = (*cp)++;

	This means the same as

		ch = *cp;
		*cp = *cp+1;

	so it's translated by

	(F)	lb   $16, 0($4)		# ch = *cp;
		addi $8, $16, 1		# $8 = ch+1;  /* which is  *cp + 1 */
		sb   $8, 0($4)		# *cp = ch+1;

ch = (*cp)+1;

	This doesn't modify cp or *cp, so it's translated by

	(C)	lb    $16, 0($4)	# ch = *cp;
		addi  $16, $16, 1	# ch = ch+1;

ch = *(cp+1);

	This also doesn't modify cp or *cp, but uses a temp pointer:

	(B)	addi $8, $4, 1		# $8 = cp+1;
		lb   $16, 0($8)		# ch = *($8);

	Note that it could also be translated more simply:

		lb   $16, 1($4)		# ch = *(cp+1);

ch = *++cp;

	This means the same as

		cp = cp+1;
		ch = *cp;

	so it's translated by

	(E)	addi  $4, $4, 1		# cp = cp+1;
		lb    $16, 0($4)	# ch = *cp;


The unused answer was

	(D)	lb    $16, 0($4)
		addi  $16, $16, 1
		sb    $16, 0($4)

which translates  ch = ++(*cp);