This is a tutorial which guides you through the process of making a genetic algorithm (GA) program. To make and run the program, you'll need to use a C compiler on your computer.

Ariel View
A GA tries to simulate the process of evolution that happens on Earth. First you create a bunch of organisms who each have a unique set of genes (usually chosen randomly). They are the first generation. Then you evaluate and rank the fitness of each of the organizms according to some criteria (e.g. which creature gets the fattest after two years). Then you chose some of the more fit organisms and let them reproduce with each other to produce the second generation. One the first generation has reproduced, all the members of the first generation "die". Then you evaluate the second generation and the cycle repeats.

The GA keeps producing new generations until a perfect organism has been created or we get to, say, the 1000th generation and decide to stop. A run is the process of doing all the generations.

Program Outline
The program can be seen as a small hierarchy of functions which are all in one file:

Function Hierarchy
main() AllocateMemory() DoOneRun() InitializeOrganisms() EvaluateOrganisms() ProduceNextGeneration() SelectOneOrganism()

Here is the code for main():

main()'s Code
#include <stdio.h> int main(){ int finalGeneration; AllocateMemory(); finalGeneration = DoOneRun(); printf("The final generation was: %d\n", finalGeneration); }

We include stdio.h so we can use the printf() function to print how many generations it took to get a perfect organism.

Our first real task is to allocate memory to store the organisms in. The only information we need to store about each organism is its genetics (i.e. its set of genes). We'll start with 100 organisms who each have 20 genes. We'll store each organism's genes as an array of char variables (not a "string" per se). Here is the code for AllocateMemory():

AllocateMemory()'s Code
#define NUMBER_ORGANISMS 100 #define NUMBER_GENES 20 char **currentGeneration, **nextGeneration; // globals char *modelOrganism; int *organismsFitnesses; void AllocateMemory(void){ int organism; currentGeneration = (char**)malloc(sizeof(char*) * NUMBER_ORGANISMS); nextGeneration = (char**)malloc(sizeof(char*) * NUMBER_ORGANISMS); modelOrganism = (char*)malloc(sizeof(char) * NUMBER_GENES); organismsFitnesses = (int*)malloc(sizeof(int) * NUMBER_ORGANISMS); for(organism=0; organism<NUMBER_ORGANISMS; ++organism){ currentGeneration[organism] = (char*)malloc(sizeof(char) * NUMBER_GENES); nextGeneration[organism] = (char*)malloc(sizeof(char) * NUMBER_GENES); } }

Notice that currentGeneration and nextGeneration are two global pointers which we'll use so we can manipulate the organism data in our various functions. modelOrganism is what we'll use as the "perfect" organism. Later on, we'll be comparing our regular organisms with the model to see how perfect they are. Conceptually though, the model organism isn't a real organism the way the others are. I hope the rest of the code isn't too dense to understand, I'll be skipping over most of the coding-specific issues for brevity.

First we initialize the organisms, then we use a simple loop to go from one generation to the next until a perfect generation is found. A perfect generation is one that has at least one organism which has the exact same genes as the model organism. When we get to a perfect generation, we return the generation number.

DoOneRun()'s Code
#define FALSE 0 #define TRUE 1 int DoOneRun(void){ int generations = 1; int perfectGeneration = FALSE; InitializeOrganisms(); while(TRUE){ perfectGeneration = EvaluateOrganisms(); if( perfectGeneration==TRUE ) return generations; ProduceNextGeneration(); ++generations; } }

You'll notice I've written the code in a very simple way so it is easier to learn. Latter on, I'll give a more concise and efficient version of the program.

Before we start each run, we need to randomize the genes of the normal organisms and the model. Notice that we don't need to do anything to the nextGeneration data structure since it is only a temporary holding area we use when we're producing the following generation.

Each gene can be one of four alleles: C, G, A, or T. But for simplicity, we'll represent their values as 0, 1, 2, and 3, respectively.

InitializeOrganisms()'s Code
#include <stdlib.h> #define ALLELES 4 void InitializeOrganisms(void){ int organism; int gene; // initialize the normal organisms for(organism=0; organism<NUMBER_ORGANISMS; ++organism){ for(gene=0; gene<NUMBER_GENES; ++gene){ currentGeneration[organism][gene] = rand()%ALLELES; } } // initialize the model organism for(gene=0; gene<NUMBER_GENES; ++gene){ modelOrganism[gene] = rand()%ALLELES; } }

The evaluation stage has two purposes. Primarily, we have to determine the fitness of all the organisms so that later on, in ProduceNextGeneration(), we'll know which were the better organisms and therefore which should reproduce more often. Our secondary purpose is to decide if we've got a perfect generation, one with at least one organism that has the same genetics as the model.

Each organism's fitness is simply the number of its genes that match the model (they have to be in the right position too). So, if the organism we're considering is AAAAAACAAAAAAAAAAAAA and the model is CCCCCCCCCCCCCCCCCCCC, then the organism gets a fitness of 1. The one point comes from both the model and the organism having a 'C' as their seventh gene.

We introduce another global variable totalOfFitnesses here. It is simply the sum of each organisms fitness. Ignore it for now, we'll use it later on in SelectOneMember(). Here's the code:

EvaluateOrganisms()'s Code
#define MAXIMUM_FITNESS NUMBER_GENES int totalOfFitnesses; int EvaluateOrganisms(void){ int organism; int gene; int currentOrganismsFitnessTally; totalOfFitnesses = 0; for(organism=0; organism<NUMBER_ORGANISMS; ++organism){ currentOrganismsFitnessTally = 0; // tally up the current organism's fitness for(gene=0; gene<NUMBER_GENES; ++gene){ if( currentGeneration[organism][gene] == modelOrganism[gene] ){ ++currentOrganismsFitnessTally; } } // save the tally in the fitnesses data structure // and add its fitness to the generation's total organismsFitnesses[organism] = currentOrganismsFitnessTally; totalOfFitnesses += currentOrganismsFitnessTally; // check if we have a perfect generation if( currentOrganismsFitnessTally == MAXIMUM_FITNESS ){ return TRUE; } } return FALSE; }

Once we've figured out the fitness of all the organisms in our current generation, we can select the best organisms and reproduce them. We'll temporarily store each organism of the new generation, the children, in the nextGeneration data structure. After we've created all the children, we copy them into the currentGeneration data structure and the reproduction phase is complete.

Now a little more detail on how the children are created. For each child, we first use SelectOneMember() twice to get the two parents. Then we randomly select a crossover point and start copying over the parents genes to the child. The genes to the left of the crossover point are copied over from parent one, while the genes right of the crossover point come from parent two.

Each time we copy over a gene, there is a chance it will mutate into a random gene.

ProduceNextGeneration()'s Code
#define MUTATION_RATE 0.001 void ProduceNextGeneration(void){ int organism; int gene; int parentOne; int parentTwo; int crossoverPoint; int mutateThisGene; // fill the nextGeneration data structure with the // children for(organism=0; organism<NUMBER_ORGANISMS; ++organism){ parentOne = SelectOneOrganism(); parentTwo = SelectOneOrganism(); crossoverPoint = rand() % NUMBER_GENES; for(gene=0; gene<NUMBER_GENES; ++gene){ // copy over a single gene mutateThisGene = rand() % (int)(1.0 / MUTATION_RATE); if(mutateThisGene == 0){ // we decided to make this gene a mutation nextGeneration[organism][gene] = rand() % ALLELES; } else { // we decided to copy this gene from a parent if (gene < crossoverPoint){ nextGeneration[organism][gene] = currentGeneration[parentOne][gene]; } else { nextGeneration[organism][gene] = currentGeneration[parentTwo][gene]; } } } } // copy the children in nextGeneration into // currentGeneration for(organism=0; organism<NUMBER_ORGANISMS; ++organism){ for(gene=0; gene<NUMBER_GENES; ++gene){ currentGeneration[organism][gene] = nextGeneration[organism][gene]; } } }

How you select organisms for reproduction will determine how effective your GA is. The simplest method is roulette wheel sampling, which we'll use here. Metaphorically, each organism is "assigned" a slice of the roulette wheel. The size of the slice each organism gets is proportional to its fitness. Then, we "spin" the wheel and whichever slice we land on, that organism gets selected.

How we'll implement this is best shown with a visual example. Imagine we have only five organisms with fitnesses of 3, 2, 0, 5, and 2. So the totalOfFitnesses is 12. I'll abbreviate organismsFitnesses below to oF. We "line up" the organisms from left to right as follows:


Notice oF[2] didn't get any slots since its fitness was 0. Anyway, the process is, first we pick a random number from 1 to 12. Then we start adding up organisms from left to right until our running total is as big as that number. Suppose we randomly chose 9. Then we'd start by adding oF[0] to runningTotal to get 3, which isn't as big as than 9 so we keep going. Next we add oF[1] so runningTotal is 5. After adding oF[2] the total is still 5. Finally, when we add oF[3], and runningTotal becomes 10, which is as big as 9, so we select oF[3].

Here's how we implement it:

SelectOneOrganism()'s Code
int SelectOneOrganism(void){ int organism; int runningTotal; int randomSelectPoint; runningTotal = 0; randomSelectPoint = rand() % (totalOfFitnesses + 1); for(organism=0; organism<NUMBER_ORGANISMS; ++organism){ runningTotal += organismsFitnesses[organism]; if(runningTotal >= randomSelectPoint) return organism; } }

The Complete Code
Now that we've seen all the functions, here is the complete program. Here's an insanely dense version.

John LeFlohic
February 24, 1999