CS61C Spring 1999
Section 23 TA: Gek Siong Low (cs61c-tb)
Week 6 Discussion notes
What compilers really do
Okay, so you have learned how to translate C to MIPS assembly, and then to machine code. Great, now I know what a compiler does. Or do I? In fact, there's more to a compiler than simply translating a high-level language source to machine code.
There are several steps in compiling and running a program, as mentioned in lecture:
High-level source (that's your C program) --> Compiler --> Assembly program (*.s files) --> Assembler --> Object files (*.o) --> Linker --> Executable program (*.out) --> Loader --> Program runs in memory (finally!)
Why do it like this?
There are several reasons why we break up a compiler into these pieces instead of jumping
staight into generating machine code directly.
1. Generating machine code directly is too big a task. This makes the compiler very complex.
2. Abstraction (one of the BIG ideas). Breaking up the program into several (quite) independent stages
is good software engineering. The product will be less prone to bugs (ideally), and bugs can be easily
located and fixed.
3. You can perform optimization. It is much easier to optimize assembly language than to do it
by trying to manipulate the machine code directly. You do not have to deal with issues like
memory references getting mangled when you cut a few instructions here, add a few instructions
there, etc.
4. It is possible to get different languages working together. You can reference a external
function in another module which may not be written in C at all. If the various compilers
generate the same type of object files, the linker can link them together without any problem.
Some OSes, such as Linux and Windows, have dynamically linked libraries, which are basically object
files that can be linked to the program during runtime. These dlls (as they are known in Windows)
can be written in any language. It's slightly different from the sequence of stages given above, but
you get the idea.
5. If you have a compiler that does everything in one step, everytime you make a change to
a small section of your code the compiler is going to rebuild everything. By having intermediate
stages, and intermediate files, the compiler is smart enough to know what files require recompiling and
then link them together, which is much faster and efficient.
Note that the sequence of stages given above may not be explicit. Some compilers appears to skip the assembly language step and do not produce any assembly files, but in fact they still do it internally. For example, gcc do not explicitly generate the .s files unless you tell it to do so.
Each stage demystified
C to Assembly
Well, there isn't much to talk about for this stage. You have done translation from C to MIPS assembly and that's basically it. The nitty-gritties of this stage forms the bulk of CS164. Note that the assembly language programs produced may contain external references, that is, references to data and functions (basically just labels) somewhere out there, and may also have data and functions that can be referenced by other modules (exported data and functions).
There may be additional code generated that is not in your source files, such as bounds checking code, exception handling, memory management routines, etc. So don't be bewildered when the assembly program has so much more stuff than what you wrote in your C program.
Assembler
The assembler produces object files, which are just the assembled code/text + data segments + some additional information required by the linker. One issue here is the problem of forward references, which are references to labels that are declared later in the program. One solution is to assemble the program in two passes, one to gather all the label names and locations and put them in a symbol table, and the second to do the actual assembly to machine code translation. It is possible to write one-pass assemblers that keeps track of the forward references and then fills them in when they find the label, but that's more complicated to write, of course.
Note that there is no meaning in defining the text or data segments to start at any particular locations. They are just pieces of the final executable, and their final placement in the executable may be anywhere. Therefore, usually the assember just assumes that the text and data segment starts at 0. The values don't matter. What matters is that the information is provided to the linker so that these segments can be relocated without problems.
Pseudoinstructions are generated into two or more instructions, so watch out for that.
Linker
Before we talk about what gets put into the object files, we have to understand what a linker does. The task of the linker is to take all the text and data segments in the object files and "glues" them together to form the executable program. However, the various segments can be in any sequence, so information must be provided in the object files on how to relocate them.
Example:
.o file 1 .o file 2
text 1
text 1 text 2 may be linked into:
text 2
data 1 data 2
data 1
info 1 info 2
data 2
So what gets put into the object file (other than text/data segments and object file header) are:
1. What locations in the text/data segments require the linker to fix (relocate).
2. Is that fix for a text reference or a data reference? Is it local or external?
3. symbol table
4. exported symbols (i.e. can be referenced by other modules)
5. unresolved symbols (they are probably external references)
6. debug information (optional)
A relocation algorithm (from CS61C Spring 1998)
o Decide on the new starting address of each segment of each .o file
o For each file, do:
o For each item that requires relocation, do:
o if it's a local reference, add the new starting address (for that segment)
to the reference address
o if it's an external reference,
o look up the symbol table for that symbol
o find that symbol in the symbol tables for the other files
o get the actual address for the symbol
actual address = starting address for that segment (where we found the symbol)
+ relative position of the symbol from that starting address
o if cannot find symbol, maybe it's from one of the system libraries,
so search there next
Branches do not require relocation because they use relative addressing. The instructions that require relocation are those that uses abolsute address. So j and jal need to relocated. Variable references also require relocation. Up until now you may have been referencing variables by putting the label of the variable into a register and loading and storing using that register. That label which you put into the register will require relocating. In practice, static variables are referenced relative to $gp, so now that $gp is changed, you have to recalculate the new offset from $gp.
Loader
The output from the linker is the executable itself, with all references resolved (otherwise the linker will fail and exit). The program is ready to be loaded and executed. Running a program is actually quite simple. You just copy the program into the correct memory location, clear the registers, reset the stack pointer and sets the PC to the starting address for the first instruction. Note that addresses when loaded must match those when linked, otherwise you'll need to do some form of relinking of addresses prior to running the program.
The loader is usually part of the OS. Every executable will have a header with probably a magic number. It's just for the loader to identify if it is a valid executable for the machine and OS. That's why the Solaris machines don't stupidly load and run your program which you compiled on a HP.
There are no more relocation information in the executable, but it may still contain the symbol table and debugging information, for use by a debugger such as gdb.
You may ask "how do we run multiple programs at the same time if we don't relocate the program during runtime to separate memory locations?" The answer is that it is best done by hardware, not the loader, and that is what is done nowadays. We'll cover that when we get to virtual memory. Virtual memory allows every program to think that it is the only program running on the machine, but in reality it is just running on a virtual address space. The hardware handles the mapping between the virtual address and the physical addres on the machine. Details later.
Compilers vs Interpreters and Virtual Machines
Since we are talking about compilers here, we may as well talk about interpreters. As everybody knows, Scheme is an interpreted language. Spim is also an interpreted language, and that is what you'll be doing for the simulator project - to interpret a subset of the MIPS assembly language. The obvious disadvantage of an interpreter is that since we have to basically "compile" every single instruction as we go along, it's going to be very inefficient and slow. But interpreters offers other advantages too.
1. It is usually more difficult to write a compiler than an interpreter.
2. Interpreters are even higher-level than compilers. It can give much better error messages and is more
interactive. Compilers try to match this interactivity by using debugging information and symbol tables.
3. Interpreted source code is inefficient, but smaller.
4. Interpreted code is machine independent.
5. Interpreted code can even generate more code on the fly and execute them!
There are some languages that are halfway between compilers and interpreters. Instead of translating each line of code into machine code, it translates it into another code and runs that instead on a virtual machine. Java is the best known example of such a language, but it is by no means a recent idea. Pascal has been doing that for years with P-code, and some forms of the Basic language also did that (I think QuickBasic in the 80's did that).
Virtual machines are yet another example of the "Abstraction" BIG idea. Spim is also a virtual machine (sort of). It runs MIPS programs (although not directly) even though the underlying machine is not MIPS-based.
Written by : Gek Siong Low (cs61c-tb), Spring 1999