Learning | ||

Patrick Doyle | AI Qual Summary | November 8, 19100 |

- Definitions
- Introduction
- Inductive Learning
- PAC Learning
- Connectionist Learning
- Reinforcement Learning
- Genetic Algorithms and Evolutionary Programming
- Learning with Knowledge
- Important Learning Systems
- Timeline and Accomplishments
- Open Problems in Learning
- Sources

**learning**- [Simon] "Changes in the system that are adaptive in the sense that
they enable the system to do the same task or tasks drawn from the
same population more efficiently the next time." Dietterich's is "an
increase in 'knowledge'" where "knowledge" is
*knowledge in principle.* **attribute**- A measure of some important feature of an object. The three kinds
are
*nominal*(a finite set of unordered attributes, e.g., color),*linear*(an attribute with a linearly-ordered set of mutually exclusive values), or*tree structured*(an attribute with a finite set of hierarchically ordered values, e.g., shapes). **analogy**- A mapping of knowledge from one domain into another which conveys that a system of relations holding in the former also holds in the latter.
**bounded inconsistency**- The data is noisy, but only instances close to the target concept boundary are misclassified.
**utility problem**- The more knowledge that is added to a system, the more intelligently it can perform -- but more slowly it works, because it must decide what knowledge to apply.
**concept**- A classification rule that partitions a domain into two parts: those instances that do, and those that do not, satisfy it.
**learning frame problem**- Over time, the learned concept may change; it is needful to detect this and modify the concept.
**within-trial transfer**- Chunking early in the task is already useful later in the task.
**across-task transfer**- Learning of one task improves performance on another.
**speedup****learning**- Learning to improve the system's performance on some task it already knows how to do.
**supervised****learning**- A situation in which both the inputs and the outputs of a component can be observed (e.g., the agent predicts it can stop its motion in 10 feet, and it actually stops in 15). Outputs may be provided by a teacher.
**reinforcement****learning**- The agent is given an evaluation of its action (rear-ending
another car is a bad idea), but not told the correct action. The
reward or punishment is called a
*reinforcement*. **unsupervised****learning**- Learning when there is no information about what the correct outputs are. Unsupervised learners can learn to predict future percepts based upon present ones, but cannot learn which actions to take without a utility function.
**inductive****learning**- A form of supervised learning. Given a set of examples (inputs and their evaluations via some function), the system tries to approximate the evaluation function.
**bias**- A preference for one hypothesis over another. Since in most learning situations there are a variety of possible consistent hypotheses, all learning algorithms have some sort of bias.
**decision trees**- A simple structure for inductive learning. Given an instance of the problem, a decision tree returns a "yes" or "no" decision about it; thus decision trees are Boolean classifiers. Each branching node in the tree represents a test on some aspect of the instance.
**training set**- Used in supervised learning, a training set is a set of problem instances (described as a set of properties and their values), together with a classification of the instance.
**test set**- A set of instances and their classificiations used to test the accuracy of a learned hypothesis. The training set is used to create the hypothesis, and the test set to see how well it works.
**Ockham's razor**- The most likely hypothesis is the simplest one that is consistent with all observations. Propounded by the 13th century philosopher William of Ockham.
**noise**- The appearance in the data of contradictory information, e.g., two identical problem instances with different classifications.
**overfitting**- Finding meaningless regularity in data due to a large number of possible hypotheses.
**version space**- The set of all possible hypotheses consistent with the current data. The idea was introduced by Tom Mitchell in 1977.
**generalization hierarchy**- A hierarchy of increasingly-general predicates, designed to cut down on the number of disjunctions needed to represent a hypothesis.
**PAC****learning**- Probably Approximately Correct learning. A system is PAC if
*Pr[error(f, h) > e] < d*, where*e*is the*accuracy parameter*and*d*is the*confidence parameter*. **Ockham algorithm**- An algorithm capable of finding a consistent hypothesis that achieves a "significant" compression of the data it represents. If an Ockham algorithm exists for a class of concepts, that class is PAC-learnable, and vice versa.
**Kolmogorov complexity**- Also called
*algorithmic complexity*. The complexity of an algorithm is measured by the length of the shortest universal Turing machine program that correctly reproduces the observed data. **connectionist****learning**- Learning in which the data structure is a set of nodes connected by weighted links, each node passing a 0 or 1 to other links depending on whether a function of its inputs reaches its activation level.
**linearly separable functions**- Functions for which a line (
*n*-dimensional plane in an*n+1*-dimensional space) separates members from non-members. **back-propagation**- [Rumelhart] The standard method for learning in multi-layer feed-forward networks, it is identical to the perceptron learning rule except for the addition of a back-propagated error term.
**neural network**- A computational model somewhat similar to the human brain; it has many simple units that work in parallel with no central control. Connections between units are weighted, and these weights can be modified by the learning system.
**epoch**- One time step in a neural network; the network runs, a result is computed, and the weights are adjusted accordingly. In reinforcement learning, the period from the initial state to the final state.
**Bayesian****learning**- Learning that treats the problem of building hypotheses as a particular case of the problem of making predictions. The probabilities of various hypotheses are estimated, and predictions are made using the posterior probabilities of the hypotheses to weight them.
**reinforcement****learning**- A kind of learning in which the agent is given
*rewards*or*punishments*based upon its performance, but is not directly told what the correct actions are. **Q-learning**- Learning an
*action-value*function, that is, learning the expected utility of taking a particular action in a given state. **reward-to-go**- The sum of the rewards from the agent's current state until a terminal state is reached; a concept used in reinforcement learning.
**adaptive dynamic programming**- Adaptive dynamic programming is any kind of reinforcement learning method that works by solving the utility equations using a dynamic programming algorithm.
**bandit problem**- A formal model for decision making. The problem is formulated as a set of arms on a slot machine, each of which has a certain payoff. The question is whether to pull an untried arm or continue to use the one that has paid off best. The goal is to maximize expected utility over the lifetime of the agent.
**input generalization**- Process by which an estimated utility function can be learned as a function of certain aspects of the envirionment, rather than the complete environment.
**explanation-based****learning***Oh, it never occurred to me you could do that.*This deductive learning converts first principles into usable rules. The agent would have been able to deduce this without the observations, if it had thought about it.**relevance-based****learning**- Learning in which background knowledge together with the classifications and descriptions suggests a hypothesis. The agent wouldn't have been able to come to this conclusion without the observations.
**knowledge-based inductive****learning**- Learning in which both the background knowledge and the new hypothesis combine to explain the observations. This inductive learning generates new knowledge.
**memoization**- Accumulating a database of input/output values to avoid having to reason again about an already-solved problem.
**functional dependency**- A logical sentence saying, roughly, that something is a function of another thing, e.g., language is a function of nationality.
**constructive induction algorithms**- Inductive learning algorithms that generate new predicates.
**cumulative****learning**- Learning in which the agent improves the learning ability as more knowledge is acquired.
**VC-dimension**- [Vapnik and Chervonenkis] A measure
of the expressive power of a hypothesis space. It is the size of the
largest set of examples which a hypothesis from the space can fit.
For example, if
*H*is the real-interval space, the VC-dimension is 2, since it cannot accomodate a hypothesis in which only the middle of three consecutive points is negative, while the endpoints are positive._{int} **decision lists**- Essentially, CONS lists. A decision list is a list of pairs
*(function, value)*, the last function of which is*True*. The value of a decision list, given an input, is the first value for which the function of the input is true. Decision lists with conjunctive clauses of size*k*properly include classes like*k*-CNF and*k*-DNF. **incremental****learning**- A kind of learning in which, as more examples are shown over time, the system improves its hypotheses. Version space algorithms are incremental learners; induction of decision trees is not incremental (once the tree is fixed, more information doesn't help).
**generalization-to-n problem**- A problem for explanation-based learning systems; their inability to learn iterative procedures. Various solutions have been proposed, including formation of recursive rules containing the iterate as an argument.

Learning agents consist of four main components. They are:

**learning element**-- the part of the agent responsible for improving its performance**performance element**-- the part that chooses the actions to take**critic**-- tells the learning element how the agent is doing**problem generator**-- suggests actions that could lead to new, informative experiences (suboptimal from the point of view of the performance element, but designed to improve that element)

In designing a learning system, there are four major issues to consider:

**components**-- which parts of the performance element are to be improved**representation**of those components**feedback**available to the system**prior information**available to the system

**All learning can be thought of as learning the representation of
a function.**

There are six main types of learning.

**speed-up****learning**- A type of deductive learning that requires no additional input,
but improves the agent's performance over time. There are two kinds,
*rote learning*and*generalization*(e.g., EBL). Data caching is an example of how it would be used. **learning by taking advice**- Deductive learning in which the system can reason about new information added to its knowledge base. McCarthy proposed the "advice taker" which was such a system, and TEIRESIAS [Davis, 1976] was the first such system.
**learning from examples**- Inductive learning in which concepts are learned from sets of labeled instances.
**clustering**- Unsupervised, inductive learning in which "natural classes" are found for data instances, as well as ways of classifying them. Examples include COBWEB, AUTOCLASS.
**learning by****analogy**- Inductive learning in which a system transfers knowledge from one database into a that of a different domain.
**discovery**- Both inductive and deductive learning in which an agent learns without help from a teacher. It is deductive if it proves theorems and discovers concepts about those theorems; it is inductive when it raises conjectures.

Inductive learning is a kind of learning in which, given a set of
examples an agent tries to estimate or create an evaluation function.
Most inductive learning is supervised learning, in which examples
provided with classifications. (The alternative is
*clustering*.) More formally, an *example* is a pair
*(x, f(x))*, where *x* is the input and *f(x)* is
the output of the function applied to *x*. The task of
*pure inductive inference* (or *induction*) is, given a
set of examples of *f*, to find a hypothesis *h* that
approximates *f*.

A decision tree is a simple inductive learning structure. Given an instance of an object or situation, which is specified by a set of properties, the tree returns a "yes" or "no" decision about that instance. Each internal node in the tree represents a test on one of those properties, and the branches from the node are labeled with the possible outcomes of the test. Each leaf node is a Boolean classifier for the input instance.

From a logical viewpoint, decision trees are attempting to learn
sets of implication sentences; this set is called the *goal
predicate*. Note that these implications can only test
propositions on the input object. However, **any Boolean function
can be represented as a decision tree.**

Note that some Boolean functions are very inefficently represented
by decision trees. The *parity* function is one example; it
requires an exponentially large tree, since it must test all of the
attributes. Another difficult function is the *majority
function*. (I think we could characterize this kind of function
roughly as one that can't be determined on the basis only of parts of
the input; the problem is that the whole thing has to be examined.
The whole point of the decision tree is to find ways that only parts
of the input need to be examined in order to reach a decision. --
POD) The point of the decision tree algorithm is to find a "smallish"
tree that correctly classifies most of the training examples. The
push for a simple tree is inspired by Ockham's razor: the simplest
explanation is likely the right one.

The decision tree algorithm works roughly as follows:

if there are no examples to split on, return a default classification. if all examples have the same classification, return it. otherwise, choose the property with the highest information gain: create a new decision tree rooted on a test of that property. for each possible value of that property: recursively call this algorithm with the subset of examples that match the property on the new tree, and their classification (by majority vote in case of conflict). add a branch to the subtree pointing to the result. return this tree.

The approach is to test the attribute with the highest information gain first, and then those with successively smaller gains, until all examples are classified.

There are two problems that arise in this induction. If a split leads to a branch with no extant examples, its classification is given by some preset default. If a set of identical examples has different classifications (noise), then some method must be used to determine what classification to use; majority-vote is a simple one.

When there is a large set of possible hypotheses, learning
algorithms can find meaningless regularities, such as deciding the day
of the week affects the roll of a die. This problem is called
*overfitting* the data. A related problem is deciding when
adding more levels to the tree will not improve its predictive
accuracy; we do not want a needlessly-complex tree.

A solution to both problems is to use *decision tree
pruning*, in which the statistical relevance of an attribute is
measure as well as its information gain. Using the chi-square test
for stochastic independence, it is possible to determine the
likelihood with which one can reject the hypothesis that the attribute
is independent of the training instances. The algorithm can then be
modified to prevent testing any attribute whose irrelevance cannot be
rejected with a very high probability.

Another approach is to use *cross-validation*, a technique
that tests subsets of the known data to see how well the current
hypothesis would classify them; these results can be averaged over
several hypotheses to see which would produce the best results. It
can be used in conjunction with other solutions to produce better
trees.

There are several ways to expand upon the trees described. One is to allow them to handle missing data, where the values of some attributes in the test set are not known. Several approaches have been suggested: assign a value based upon the distribution of values in other instances, or construct another decision tree just to decide the value. The simplest is to give it the most common value, and in practice this seems to produce nearly as good results as the other methods.

A second extension is to allow multi-valued attributes, beyond Boolean. Yet another is to permit continuous-valued attributes, which would need to be discretized in some way to be used in a decision tree. Another is allowing more than two possible classifications; ID3 does not do this, but later systems do.

Two problems with ID3 that were later solved by Quinlan: the smaller the subtree gets, the harder to intelligently choose an attribute to split on, and decision trees can't represent DNF concepts efficiently. These problems were solved by converting the tree into production rules of the form

Statistically insignificant conditions from the left-hand side are removed, and then entire rules that do not significantly improve performance (determined by comparing classification using rulesets with and without the rule under consideration) are deleted. This can both improve accuracy and dramatically reduce the complexity of the learned concept. (Example of endocrinology, where average 52.4 nodes in a tree were converted to an average of 1.8 rules, and error dropped from 1.8% to 1.3%.)

ID5 [Utgoff, 1988] is an improved version of ID3 that supports incremental learning. It processes the examples one at a time and produces an updated decision tree after each example. The basic idea of growing the tree top-down remains the same, but as the examples are moved through the tree, they are stored at the leaves. If a leaf node comes to have both positive and negative examples, a new subtree is created at that node, splitting on the attribute with the highest information gain for those examples.

EPAM (the Elementary Perceiver And Memorizer) [Feigenbaum, 1961]
was the first system to use decision trees (then called
*discrimination nets*). CLS [Hunt, et al., 1966] used
heuristic lookahead to construct trees. ID3 [Quinlan, 1979] added the
idea of using information content to choose the attribute to split; it
also initially chooses a *window* (a subset) of the training
examples, and tries to learn a concept that will correctly classify
all based on that window; if not, it increases window size. Quinlan
later constructed C4.5, an industral version of ID3. Utgoff's ID5 was
an extension of ID3 that allowed many-valued classifications as well
as incremental learning.

[Michie, 1986] BP designed a system called GASOIL that built gas-oil separation systems for offshore oil platforms; at this time it was the world's leargest commercial expert system.

[Sammut, et al., 1992] A tree was designed to learn to fly a Cessna on a flight simulator. The C4.5 system was used to induce the tree.

A subfield of mathematics founded by Claude Shannon in the late 1940's. In our context, it is used to measure the amount of information available in a result and how more data increases the amount of information.

Information is measured in *bits*, where one bit is enough
information to answer a single yes/no question. In general, if the
possible answers *v _{i}* have probabilities

This is the average information content of the events, weighed by their probabilities.

For decision trees, we want to split on the attibute that gives us
the largest *information gain*; that is, the attribute that
gives us the most information. Gain is defined thus:

where *A* is an attribute, and *p* and *n* are
the numbers of examples classified positively or negatively for that
attribute. Remainder is the amount of information still required
after splitting on that attribute. Splitting will generate *i*
new subsets, each of which has *p _{i}* positive and

We can look at inductive learning generally as trying to find a logical description that is equivalent to the (unknown) evaluation function. By making our hypothesis more or less specific we can tailor it to match the evaluation function.

The way in which we improve the hypothesis is by modifying it when
we encounter *false positives* (false instances we classify as
true) and *false negatives* (true instances we classify as
false). There are two kinds of search to do this.

First described by John Stuart Mill in 1843. The algorithm is extremely simple; if a new example is encountered that our hypothesis misclassifies, then change the hypothesis as follows.

If it is a false positive, *specialize* the hypothesis not
to cover it. This can be done by dropping disjuncts or adding new
terms.

If it is a false negative, *generalize* the hypothesis by
adding disjuncts or dropping terms.

If no consistent specalization/generalization can be found, backtrack.

(Think of the Russell and Norvig example of two sets, the positive instances inside the negative ones, and extending or shrinking the positive set as more data comes in.)

The earliest machine learning system to use this approach was the arch-learning program of [Winston, 1970]. However, it naturally suffers from inefficiency in large search spaces; after every modification the past instances must be checked to ensure they are still classified correctly, and it is difficult to find good search heuristics for generalizing or specializing the definition.

Rather than backtracking, this approach keeps all hypotheses that
are consistent with the data seen so far. As more data becomes
available, this *version space* shrinks. The algorithm for
doing this is called the *candidate elimination* learning
algorithm or the *version space learning* algorithm [Mitchell,
1977], and consists simply of constraining the version space to be
consistent with all data seen so far, updating it as each new instance
is seen.

This is a least-commitment approach since it never favors one
possible hypothesis over another; all remaining hypotheses are
consistent. Note that this method implicitly assumes that there is a
partial ordering on all of the hypotheses in the space, the
*more-specific-than* ordering. A generalization
*G _{1}* is more specific than a generalization

The obvious problem is that this method potentially requires an
enormous number of hypotheses to record. The solution is to use
*boundary sets* to circumscribe the space of possible
hypotheses. The two boundary sets are the *G-set* (the most
general boundary) and the *S-set* (the most specific boundary).
Every member of the G-set is consistent, and there are no more general
consistent hypotheses; every member of the S-set is consistent and
there are no more specific hypotheses.

The algorithm works as follows. Initially, the G-set is simply
*True*, and the S-set *False*. For every new instance,
there are four possible cases:

**false positive for S**-- The hypothesis_{i}*S*is too general, but as it has no consistent specializations, we throw it out._{i}**false negative for S**--_{i}*S*is too specific, so we replace it by all of its immediate generalizations._{i}**false positive for G**--_{i}*G*is too general, so we replace it by all of its immediate specializations._{i}**false negative for G**-- The hypothesis_{i}*G*is too specific, but as it has no consistent generalizations, we throw it out._{i}

This process is repeated until one of three things happens.
Eventually, either there is only one hypothesis left in the version
space (in which case we return it), the version space
*collapses* (either S or G becomes empty), meaning there is no
consistent hypothesis, or we run out of examples and our version space
still has several hypotheses, so we can use their collective
evaluation (breaking disagreements with majority vote).

The main problems with this approach are two. If there is
*any* noise, or insufficient attributes for classification, the
version space will collapse. Also, if unlimited disjunction is
allowed in the hypothesis space, S will contain only the most-specific
hypothesis (the conjunction of the positive examples), and G will
contain only the most-general hypothesis (the negation of the
disjunctions of the negative examples). The latter problem can
partially be solved by using a *generalization hierarchy*.

Such learning systems cannot handle noisy data. One solution is to
maintain several *S* and *G* sets, consistent with
decreasing numbers of training instances.

The pure version-space algorithm was first used in META-DENDRAL [Buchanan and Mitchell, 1978]. It was also used in LEX [Mitchell, 1983] which learned to solve symbolic integration problems.

Strategy | Time Complexity | Space Complexity |
---|---|---|

DFS (current best hypothesis) | O(pn) |
O(p + n) |

Version Space | O(sg(p + n) + s^{2}p + g^{2}n) |
O(s + g) |

In this table, *p* and *n* are the number of positive
and negative training instances, respectively, and *s* and
*g* are the largest sizes of sets *S* and *G*.

The complexity of the depth-first approach stems from the need to reexamine old instances after each revision to the hypothesis; in particular, each time a positive instance forces the hypothesis to be changed, all past negative hypotheses must be examined. Similarly, revising the hypothesis in response to negative instances requires reexamining all positive instances.

In the version space strategy, no training instances need be saved,
so the space complexity is just the largest sizes of *S* and
*G*. Notice that for this strategy, processing time grows
linearly with the number of training instances. However, it grows as
the square of the sizes of the boundary sets.

[Introduced by Valiant, 1984] The field of *computational
learning theory* attempts to provide a theoretical framework for
learning systems. The question is, how can we justify the accuracy of
a hypothesis *h* with respect to some function *f* if we
don't know what *f* is?

The argument is, roughly, that incorrect hypotheses will be
discovered relatively quickly because they will incorrectly identify
instances. Any hypothesis that is consistent with a large enough set
of training examples is unlikely to be seriously wrong. That is, it
is *probably approximately correct*. PAC learning is the part
of computational learning theory that studies this idea.

The key assumption underlying this argument is that the training
set and the test set are randomly drawn from the same population of
examples, and that they are drawn using the *same probability
distribution*. This idea (due to Valiant) is the
*stationarity* assumption. It is required to associate any
future instances with the ones seen so far.

In particular, we would like to bound the likelihood that our hypothesis is not within a certain range of the correct hypothesis. Specifically, we define the error of a hypothesis thus:

(where D is a distribution over the samples). A hypothesis *h*
is called *approximately correct* if *error(h) <= e*.
An approximately correct hypothesis can be thought of as lying within
an *e-ball* of the true hypothesis in the hypothesis space.
All hypothesis outside of this ball are in the set
*H _{bad}*.

We calculate the probability that a hypothesis
*h _{b}* is in

For *H _{bad}* to contain a consistent hypothesis, at
least one of its hypotheses must be consistent. The probability of
this occurring is bounded by the sum of the individual probabilities,
thus

We would like to reduce the probability of this event to some
number *d*, the *confidence parameter*, so that

which we can do if we show

training examples to the system. Note that this approach assumes that
the true function *f* is somewhere in the hypothesis space.
Roughly speaking, by theorem of Blumer, the number of examples
required for learning is proportional to the log of the size of the
hypothesis space.

**Theorem [Blumer et al., 1989]**: A space of hypotheses
*H* is PAC-learnable iff it has finite VC-dimension.

Some examples of PAC-learnable classes in polynomial time:
*k*-CNF, *k*-DNF, *k*-DL, Boolean conjunctions.
In contrast, *k*-term-DNF and *k*-3NN (a 3-layer neural
network) are NP-hard PAC-learning problems. Note that the size of the
hypothesis space is not necessarily the whole problem;
*k*-term-DNF is a proper subset of *k*-CNF, but the
former is not polynomially learnable while the latter is. Note also
that this means in order to prove that learning a concept class is
computationally intractable, it must be shown intractable
*regardless of the representation employed by the learning
algorithm.*

The connectionist approach is inspired by the model of the human
brain as an enormous parallel computer, in which small computational
units feed simple, low-bandwidth signals to one another, and from
which intelligence arises. It attempts to replicate this behavior on
an abstract level with what are called *neural networks*, an
idea introduced by McCullough and Pitts in 1943.

More specifically, a neural network consists of a set of
*nodes* (or *units*), *links* that connect one
node to another, and *weights* associated with each link. Some
nodes receive inputs via links; others directly from the environment,
and some nodes send outputs out of the network. Learning usually
occurs by adjusting the weights on the links.

Each unit has a set of weighted inputs, an *activation
level*, and a way to compute its activation level at the next time
step. This is done by applying an *activation function* to the
weighted sum of the node's inputs. Generally, the weighted sum (also
called the *input function*) is a strictly linear sum, while
the activation function may be nonlinear. If the value of the
activation function is above a *threshold*, the node "fires."

Generally, all nodes share the same activation function and threshold value, and only the topology and weights change.

The two fundamental types of network structure are
*feed-forward* and *recurrent*. A feed-forward network
is a directed acyclic graph; information flows in one direction only,
and there are no cycles. Such networks cannot represent internal
state. Usually, neural networks are also *layered*, meaning
that nodes are organized into groups of layers, and links only go from
nodes to nodes in adjacent layers.

Recurrent networks allow loops, and as a result can represent
state, though they are much more complex to analyze. *Hopfield
networks* and *Boltzmann machines* are examples of
recurrent networks; Hopfield networks are the best understood. All
connections in Hopfield networks are bidirectional with symmetric
weights, all units have outputs of 1 or -1, and the activation
function is the sign function. Also, all nodes in a Hopfield network
are both input and output nodes. Interestingly, it has been shown
that a Hopfield network can reliably recognize *0.138N* training
examples, where *N* is the number of units in the network.

Boltzmann machines allow non-input/output units, and they use a stochastic evaluation function that is based upon the sum of the total weighted input. Boltzmann machines are formally equivalent to a certain kind of belief network evaluated with a stochastic simulation algorithm.

One problem in building neural networks is deciding on the initial
topology, e.g., how many nodes there are and how they are connected.
Genetic algorithms have been used to explore this problem, but it is a
large search space and this is a computationally intensive approach.
The *optimal brain damage* method uses information theory to
determine whether weights can be removed from the network without loss
of performance, and possibly improving it. The alternative of making
the network larger has been tested with the *tiling algorithm*
[Mezard and Nadal, 1989] which takes an approach similar to induction
on decision trees; it expands a unit by adding new ones to cover
instances it misclassified. *Cross-validation* techniques can
be used to determine when the network size is right.

Perceptrons are single-layer, feed-forward networks that were first
studied in the 1950's. They are only capable of learning *linearly
separable* functions. That is, if we view *F* features as
defining an *F*-dimensional space, the network can recognize
any class that involves placing a single hyperplane between the
instances of two classes. So, for example, they can easily represent
**AND**, **OR**, or **NOT**, but cannot represent **XOR**.

Perceptrons learn by updating the weights on their links in
response to the difference between their output value and the correct
output value. The updating rule (due to Frank Rosenblatt, 1960) is as
follows. Define *Err* as the difference between the correct
output and actual output. Then the learning rule for each weight is

where *A* is a constant called the *learning rate*.

Of course, this was too good to last, and in *Perceptrons*
[Minsky and Papert, 1969] it was observed how limited linearly
separable functions were. Work on perceptrons withered, and neural
networks didn't come into vogue again until the 1980's, when
multi-layer networks became the focus.

The standard method for learning in multi-layer feed-forward
networks is *back-propagation* [Bryson and Ho, 1969]. Such
networks have an input layer, and output layer, and one or more
*hidden layers* in between. The difficulty is to divide the
blame for an erroneous output among the nodes in the hidden layers.

The back-propagation rule is similar to the perceptron learning
rule. If *Err _{i}* is the error at the output node,
then the weight update for the link from unit

where *g'* is the derivative of the activation function, and
*a _{j}* is the activation of the unit

This updates the weights leading to the output node. To update the
weights on the interior links, we use the idea that the hidden node
*j* is responsible for part of the error in each of the nodes
to which it connects. Thus the error at the output is divided
according to the strength of the connection between the output node
and the hidden node, and propogated backward to previous layers.
Specifically,

Thus the updating rule for internal nodes is

Lastly, the weight updating rule for the weights from the input layer to the hidden layer is is

where *k* is the input node and *j* the hidden node, and
*I _{k}* is the input value of

A neural network requires *2 ^{n}/n* hidden units to
represent all Boolean functions of

In general, if the number of hidden nodes is too large, the network may learn only the training examples, while if the number is too small it may never converge on a set of weights consistent with the training examples.

Multi-layer feed-forward networks can represent any continuous function with a single hidden layer, and any function with two hidden layers [Cybenko, 1988, 1989].

John Denker remarked that "neural networks are the second best way of doing just about anything." They provide passable performance on a wide variety of problems that are difficult to solve well using other methods.

NETtalk [Sejnowski and Rosenberg, 1987] was designed to learn how to pronounce written text. Input was a seven-character centered on the target character, and output was a set of Booleans controlling the form of the sound to be produced. It learned 95% accuracy on its training set, but had only 78% accuracy on the test set. Not spectacularly good, but important because it impressed many people with the potential of neural networks.

Other applications include a ZIP code recognition [Le Cun et al., 1989] system that achieves 99% accuracy on handwritten codes, and driving [Pomerleau, 1993] in the ALVINN system at CMU. ALVINN controls the NavLab vehicles, and translates inputs from a video image into steering control directions. ALVINN performs exceptionally well on the particular road-type it learns, but poorly on other terrain types. The extended MANIAC system [Jochem et al., 1993] has multiple ALVINN subnets combined to handle different road types.

*(NOTE: As I am largely unfamiliar with belief networks, I would
take the material in this section with a grain of salt, if I were
you. -- POD)*

Bayesian learning maintains a number of hypotheses about the data, each one weighted its posterior probability when a prediction is made. The idea is that, rather than keeping only one hypothesis, many are entertained, and weighted based on their likelihoods.

Of course, maintaining and reasoning with a large number of
hypotheses can be intractable. The most common approximation is to
use a *most probable* hypothesis, that is, an
*H _{i}* of

To find *H _{MAP}*, we apply Bayes' rule:

Since *P(D)* is fixed across the hypotheses, we only need to
maximize the numerator. The first term represents the probability
that this particular data set would be seen, given
*H _{i}* as the model of the world. The second is the
prior probability assigned to the model.

There are four kinds of belief networks, depending upon whether the structure of the network is known or unknown, and whether the variables in the network are observable or hidden.

**known structure, fully observable**-- In this case the only learnable part is the conditional probability tables. These can be estimated directly using the statistics of the sample data set.

**unknown structure, fully observable**-- Here the problem is to reconstruct the network topology. The problem can be thought of as a search through structure space, and fitting data to each structure reduces to the fixed-structure problem, so the MAP or ML probability value can be used as a heuristic in hill-climbing or SA search.

**known structure, hidden variables**-- This is analagous to neural network learning.

**unknown structure, hidden variables**-- When some variables are unobservable, it becomes difficult to apply prior techniques for recovering structure, but they require averaging over all possible values of the unknown variables. No good general algorithms are known for handling this case.

The two formalisms can be compared as representations, as inference systems, and as learning systems.

Both kinds of network are attribute-based representations. Both can handle either discrete or continuous output. The major difference is that belief networks are localized representations, while neural networks are distributed. Belief network nodes represent propositions with clearly defined semantics and relationships to other nodes. In neural networks, nodes generally don't represent specific propositions, and the calculations would not treat them in a semantically-meaningful way. The effect is that human beings can neither construct nor understand neural network representations, where both can be done with belief networks.

Belief networks handle two kinds of activation, both in terms of the values a proposition may take, and the probabilities assigned to each value. Neural network outputs could be values or probabilities, but they cannot handle both simultaneously.

Trained feed-forward neural network inference can execute in linear time, where in belief networks inference is NP-hard. However, a neural network may have to be exponentially larger to represent the same things that a belief network can.

As for learning, belief networks have the advantage of being easier
to give prior knowledge; also, since they represent propositions
locally, it may be easier for them to converge, since they are
directly affected only by a small number of other propositions.

As opposed to *supervised learning*, reinforcement learning
takes place in an environment where the agent cannot directly compare
the results of its action to a desired result. Instead, it is given
some *reward* or *punishment* that relates to its
actions. It may win or lose a game, or be told it has made a good
move or a poor one. The job of reinforcement learning is to find a
successful function using these rewards.

The reason reinforcement learning is harder than supervised learning is that the agent is never told what the right action is, only whether it is doing well or poorly, and in some cases (such as chess) it may only receive feedback after a long string of actions.

There are two basic kinds of information an agent can try to learn.

**utility function**-- The agent learns the utility of being in various states, and chooses actions to maximize the expected utility of their outcomes. This requires the agent keep a model of the environment.

**action-value**-- The agent learns an action-value function giving the expected utility of performing an action in a given state. This is called*Q-learning*. This is the*model-free*approach.

Assuming an environment consisting of a set of states, some
terminal and some nonterminal, and a model that specifies the
probabilities of transition from state to state, an agent learns
passively by observing a set of *training sequences*, which
consist of a set of state transitions followed by a reward.

The goal is to use the reward information to learn the expected
utility of each of the nonterminal states. **An important
simplifying assumption is that the utility of a sequence is the sum of
the rewards accumulated in the states of the sequence**. That is,
the utility function is *additive*.

A passive learning agent keeps an estimate *U* of the
utility of each state, a table *N* of how many times each state
was seen, and a table *M* of transition probabilities. There
are a variety of ways the agent can update its table *U*.

One simple updating method is the *least mean squares* (LMS)
approach [Widrow and Hoff, 1960]. It assumes that the observed
reward-to-go of a state in a sequence provides direct evidence of the
actual reward-to-go. The approach is simply to keep the utility as a
running average of the rewards based upon the number of times the state
has been seen. This approach minimizes the mean square error with
respect to the observed data.

This approach converges very slowly, because it ignores the fact
that **the actual utility of a state is the probability-weighted
average of its successors' utilities, plus its own reward**. LMS
disregards these probabilities.

If the transition probabilities and the rewards of the states are known (which will usually happen after a reasonably small set of training examples), then the actual utilities can be computed directly as

where *U(i)* is the utility of state *i*, *R* is
its reward, and *M _{ij}* is the probability of
transition from state

[Richard Sutton] Temporal difference learning uses the difference
in utility values between successive states to adjust them from one
epoch to another. The key idea is to use the observed transitions to
adjust the values of the observed states so that they agree with the
ADP constraint equations. Practically, this means updating the
utility of state *i* so that it agrees better with its
successor *j*. This is done with the
*temporal-difference* (TD) equation:

where *a* is a *learning rate* parameter.

Temporal difference learning is a way of approximating the ADP constraint equations without solving them for all possible states. The idea generally is to define conditions that hold over local transitions when the utility estimates are correct, and then create update rules that nudge the estimates toward this equation.

This approach will cause *U(i)* to converge to the correct
value if the learning rate parameter decreases with the number of
times a state has been visited [Dayan, 1992]. In general, as the
number of training sequences tends to infinity, TD will converge on
the same utilities as ADP.

Since neither temporal difference learning nor LMS actually use the
model *M* of state transition probabilities, they will operate
unchanged in an unknown environment. The ADP approach, however,
updates its estimated model of an unknown environment after each step,
and this model is used to revise the utility estimates.

Any method for learning stochastic functions can be used to learn
the environment model; in particular, in a simple environment the
transition probability *M _{ij}* is just the percentage
of times state

The basic difference between TD and ADP is that TD adjusts a state to agree with the observed successor, while ADP makes a state agree with all successors that might occur, weighted by their probabilities. More importantly, ADP's adjustments may need to be propagated across all of the utility equations, while TD's affect only the current equation. TD is essentially a crude first approximation to ADP.

A middle-ground can be found by bounding or ordering the number of
adjustments made in ADP, beyond the simple one made in TD. The
*prioritized-sweeping* heuristic prefers only to make
adjustments to states whose *likely* successors have just
undergone *large* adjustments in their utility estimates. Such
approximate ADP systems can be very nearly as efficient as ADP in
terms of convergence, but operate much more quickly.

The difference between active and passive agents is that passive
agents learn a fixed policy, while the active agent must decide what
action to take and how it will affect its rewards. To represent an
active agent, the environment model *M* is extended to give the
probability of a transition from a state *i* to a state
*j*, *given an action a*. Utility is modified to be the
reward of the state plus the maximum utility expected depending upon
the agent's action:

An ADP agent is extended to learn transition probabilities given actions; this is simply another dimension in its transition table. A TD agent must similarly be extended to have a model of the environment.

An action-value function assigns an expected utility to the result
of performing a given action in a given state. If *Q(a, i)* is
the value of doing action *a* in state *i*, then

The equations for Q-learning are similar to those for state-based learning agents. The difference is that Q-learning agents do not need models of the world. The equilibrium equation, which can be used directly (as with ADP agents) is

The temporal difference version does not require that a model be learned; its update equation is

The first significant reinforcement learning system was used in Arthur Samuel's checker-playing program. It used a weighted linear function to evaluate positions, though it did not use observed rewards in its learning process.

TD-gammon [Tesauro, 1992] has an evaluation function represented by a fully-connected neural network with one hidden layer of 80 nodes; with the inclusion of some precomputed board features in its input, it was able to reach world-class play after about 300,000 training games.

A case of reinforcement learning in robotics is the famous
*cart-pole* balancing problem. The problem is to control the
position of the cart (along a single axis) so as to keep a pole
balanced on top of it upright, while staying within the limits of the
track length. Actions are usually to jerk left or right, the
so-called *bang-bang control* approach. The first work on this
problem was the BOXES system [Michie and Chambers, 1968], in which
state space was partitioned into boxes, and reinforcement propogated
into the boxes. More recent simulated work using neural networks
[Furuta et al., 1984] simulated the *triple-inverted-pendulum*
problem, in which three poles balance one atop another on a cart.

Genetic algorithms are inspired by natural evolution. In the natural world, organisms that are poorly suited for an environment die off, while those well-suited for it prosper. Children are much like their parents, though occasionally mutations occur; most are harmful, but rarely there is an improvement in the strain. Genetic algorithms simulate this behavior.

Genetic algorithms search the space of individuals for good
candidates. The "goodness" of an individual is measured by some
*fitness function*. Search takes place in parallel, with many
individuals in each generation. The approach is a hill-climbing one,
since in each generation the offspring of the best candidates are
preserved.

In the standard genetic algorithm approach, each individual is a
bit-string that encodes its characteristics. Each element of the
string is called a *gene*, so the string is likened to a strand
of DNA.

The algorithm consists of looping through generations. In each
generation, a subset of the population is selected to reproduce;
usually this is a random selection in which the probability of choice
is proportional to fitness. Selection is usually done with
replacement (so a fit individual may reproduce many times).
Reproduction occurs by randomly pairing all of the individuals in the
selection pool, and then generating two new individuals by performing
*crossover*, in which the initial *n* bits (where
*n* is random) of the parents are exchanged. There is a small
chance that one of the genes in the resulting individuals will
*mutate* to a new value.

Genetic algorithms are broadly applicable and have the advantage that they require little knowledge encoded in the system. However, as might be expected from a knowledge-poor approach, they give very poor performance on some problems.

*Evolutionary programming* is usually taken to mean a more
complex form of genetic programming in which the individuals are more
complex structures -- for example, sections of program code -- and the
crossover and mutation operations correspondingly more elaborate.

*Classifier systems* are genetic programs that develop rules
for classifying input instances. Each rule is weighted, and has some
set of feature patterns as a condition and some pattern as an action.
On each cycle, the rules that match make bids proportional to their
weights. One or more rules are then selected for application with
probabilities based on their bids. Weights are adjusted according to
the contribution of the rules to desired behavior; the *bucket
brigade* algorithm [Holland, 1985] is an effective method of doing
this.

By considering the kinds of logical constraints placed upon
different kinds of knowledge-based learning, we can classify them more
clearly. Examples are composed of *Descriptions* and
*Classifications*, and we are trying to find a
*Hypothesis* to explain the data.

Inductive learning can be characterized by the following
*entailment constraint*:

That is, given our hypothesis and descriptions of problem instances, we want to generate classifications. This is inductive learning. Other kinds of learning that use prior knowledge are described below.

This kind of learning occurs when the system finds an explanation of an instance it has seen, and generalizes the explanation. The general rule follows logically from the background knowledge possessed by the system. The entailment constraints for EBL are

Background |= Hypothesis

The agent does not actually learn anything factually new, since the hypothesis was entailed by background knowledge. This kind of learning is regarded as a way to convert first principles into useful specialized knowledge (converting problem-solving search into pattern-matching search).

The basic idea is to construct an explanation of the observed result, and then generalize the explanation. More specifically, while constructing a proof of the solution, a parallel proof is performed, in which each constant of the first is made into a variable. Then a new rule is built in which the left-hand side is the leaves of the proof tree, and the right-hand side is the variabilized goal, up to any bindings that must be made with the generalized proof. Any conditions true regardless of the variables are dropped.

Note that by pruning the tree before the leaves, even more general
rules may be learned. However, the more general, the more computation
may be required to apply the rule. One approach is to require the
*operationality* of the subgoals in the new rule -- that they
be "easy" to solve.

This is a kind of learning in which background knowledge relates
the *relevance* of a set of features in an instance to the
general goal predicate. For example, if I see men in the Forum in
Rome speaking Latin, and I know that if seeing someone in a city
speaking a language usually means all people in the city speak that
language, I can conclude Romans speak Latin.

In general, background knowledge, together with the observations, allows the agent to form a new, general rule to explain the observations. The entailment constraint for RBL is

Background ^ Descriptions ^ Classifications |= Hypothesis

This is a deductive form of learning, because it cannot produce hypotheses that go beyond the background knowledge and observations.

We presume that our knowledge base has a set of *functional
dependencies* or *determiners* that support the
construction of hypotheses. The learning algorithm then tries to find
the minimal consistent determination (e.g., a sentence of the form "P
determines Q," meaning that if the examples match on P they match on
Q).

This is a kind of learning in which our background knowledge,
together with our observations, lead us to make a hypothesis that
explains the examples we see. If I see the Old Man from Scene 24 on
the Bridge of Despair, and notice that he asks a simple question of
every *other* knight that attempts to cross, I can hypothesize
that only the odd-numbered knights are able to cross the Gorge of
Eternal Peril. The entailment constraint in this case is

Such knowledge-based inductive learning has been studied mainly in
the field of *inductive logic programming*. Such systems
reduce learning complexity in two ways. First, by requiring all new
hypotheses to be consistent with existing knowledge, they reduce the
search space of hypotheses. Secondly, the more prior knowledge
available, the less new knowledge required in the hypothesis to
explain the observations.

Attribute-based learning algorithms are incapable of learning predicates. One of the advantages of ILP algorithms is their much broader range of applicability.

Inverse resolution is based upon the fact that if the example
*Classifications* follow from the *Background ^ Hypothesis ^
Descriptions*, then it must be possible to prove this by
resolution. The goal is to invert the resolution process in order to
find the initial hypothesis.

An inverse resolution step takes a resolvent *C*, and
produces two clauses *C _{1}* and

This allows a very large search space, since the resolving clauses could be taken from anywhere; from the hypotheses, the example descriptions, the negated classifications, or already-generated clauses. Possible constraints include using restricted resolution strateiges, and making sure that introduced clauses are consistent with the observations.

Inverted resolution strategies do, in principle, have the ability to discover first-order theories, though until effective means of handling the search space are found this is a difficult problem. However, they will invent new predicates through the process of resolution, which can help to reduce the size of the goal predicate.

Examples of ILP systems using inverse resolution are GOLEM [Muggleton, 1992], which generated predictions of protein structure from sequence data; it was also used to predict the theraputic use of drugs based on their molecular structures [King, 1992]. This work is similar to META-DENDRAL's, but the problem domains are considerably more complex, and GOLEM is entirely general purpose.

The other approach to ILP is essentially a generalization of decision tree learning to first-order logic. The idea is to start with a very general rule and specialize it gradually so it fits the data. In the first-order case, first-order literals are used instead of attributes, and the hypothesis is a set of clauses rather than a decision tree.

The following is a description of the approach used in FOIL [Quinlan, 1990], one of the early systems to do this.

First, we are given a predicate we want to learn (e.g.,
*Grandfather(x, y)*), and a set of training examples. We
create a Horn clause, initially empty, that implies our predicate. We
then add new clauses, one at a time, each one giving us the best
increase in accuracy in classifying the examples. This continues
until we have a clause that agrees with some subset of the positive
examples and none of the negative examples. Then the positive
examples are removed from the training set as they are satisfied,
until no positive examples remain.

The two kinds of choices are: how to generate new literals to add to the clause, and how to choose which literal to add.

There are three kinds of literals that can be created, literals using predicates, equality and inequality literals, and arithmetic comparisons. In the first case, the literals can be negated or unnegated, any existing predicates can be used, and all arguments are variables, but must use at least one variable from the head of the clause or an earlier literal. (Notice that FOIL can learn recursive definitions since it can use the predicate from the head of the clause.)

Naturally this leads to a large branching space. Type information
may be used to restrict the space, so, for example, the system may
know that a literal such as *Parent(x, n)* where *x* is
a person and *n* a number makes no sense.

Choosing a literal uses a heuristic process similar to information gain.

FOIL and its descendants have been used to learn a variety of
complex definitions; a particular example is learning definitions from
a Prolog textbook given a small set of examples and the
previously-learned definitions [Quinlan and Cameron-Jones, 1993].

(For a more comprehensive list, see Ronny Kohavi's learning summary.)

[Hunt, 1966]. Concept Learning System. CLS was a learning algorithm that learned concepts and used them to classify new cases. CLS was the precursor to decision trees; it lead to Quinlan's ID3 system.

[Feigenbaum, 1963] Elementary Perceiver and Memorizer. It memorized pairs of nonsense syllables by using discrimination nets in which it could find images of syllabues it had seen. It stored as its cue only enough information to disambiguate the syllable from others seen at the time the association was formed. Thus old cues might be inadequate to retrieve data later, so the system could "forget."

[Arthur Samuel, 1947-67] The program used minimax search and a static evaulation in the game. It rote-learned positions by their minimax values. It could thus reuse the value in other games where the situation reappeared. He used a least-recently-used approach to caching.

To learn, the system used sixteen features multiplied by weights. The weights changed through learning. The program actually knew 38 features, but only used 16 for evaluation at any given time; occasionally one was deleted and replaced by another in the evaluation function.

[Winston, 1975] An early structural learning program that used semantic nets to describe block structures. Matching and hill climbing allowed the program to learn more complex structures; one major problem was that a teacher had to guide the program through a set of helpful examples.

[Mitchell and Utgoff] A concept learning program that created and refined heuristics that suggested whether a given operator would be applied to a given problem state in a forward-search problem solver. Each operator had associated with it a set of states in which it "should" be applied.

LEX's domain was integral calculus. It used candidate elimination to determine good and bad instances of operator applications The algorithm refined the version spaces of the operators. Multiple boundary sets were used to handle noise.

- Through hand coding by programmers.
- Through a static analysis of the domain's operators.
- Through looking at traces of its own problem-solving behavior.

PRODIGY learns control rules from experience, but *unlike SOAR
it also learns from its failures*. If it pursues a path that
fails, it will try to find an explanation of why it failed, and use
that to build control knowledge to avoid such paths in the future.

PRODIGY keeps a utility measure for each control rule. This takes into account average savings provided by the rule, frequency of use, and cost of matching. If a proposed rule has negative utility it is discarded.

[Cheeseman, 1988] Given a set of training cases, it hypothesizes a set of classes. For any given case it provides probabilities to predict the class into which it will fall. AUTOCLASS found meaningful new classes of stars from infra-red spectral data.

A connectionist, hardware implementation of a semantic network using local representations. Each node is a simple processing unit, capable of storing a few single-bit markers and performing simple Boolean operations on them. Each link is also a simple processor, wired to two or more units. Link units can also perform Boolean operations, which usually consist of passing a marker from one node to another.

[Lenat, 1977] Amateur Mathematician worked from basic concepts of set theory to discover a fair amount of standard number theory. It used a variety of general-purpose AI techniques, including a frame system to represent concepts. One of AM's major tasks was to create new concepts and fill their slots.

An agenda controls the discovery process. AM operates in cycles, each cycle choosing the most promising task on the agenda. AM discovered the concept of prime numbers, the Unique Factorization Theorem, Goldbach's Conjecture (even numbers greater than 2 are the sum of two primes).

AM's performance was limited by its static heuristics; the main source of its power was the large set of heuristics about what things are "interesting." AM has an implicit bias toward learning number theory concepts.

[Langley, 1981] A model of data-driven scientific discovery. BACON creates proportionalities in order to derive relations between data values. It uses other heuristics to postulate intrinsic properties of objects and to reason by analogy.

It "discovered" Kepler's third law, Ohm's law, Joule's law, and
conservation of momentum. There were several versions of BACON; early
versions had trouble dealing with noise in the data, and were sensitive to
the order in which data were presented. (See *Handbook of Artificial
Intelligence*, vol. III, for details.)

Date | Learning System | Authors | Significant Contributions |
---|---|---|---|

1958 | Perceptrons | Rosenblatt | introduced LTU, perceptron learning rule |

1959-1967 | Checkers Program | Samuel | minimax search, weights on heuristic evaluation function, changing terms, rote-learning |

1963 | EPAM | Feigenbaum | first computational model of many human verbal-learning behaviors (discrimination nets) |

1966 | CLS | Hunt | precursor of decision trees (lead to ID3) |

1969 | Perceptrons book |
Minsky and Papert | proved that perceptrons can only learn linearly-separable functions |

1971 | STRIPS MACROPs | Fikes and Nilsson | earliest EBL |

1973 | HACKER | Sussman | ? |

1975 | Winston's Learning Program | Winston | early structural learner, used semantic nets, hill climbing |

1978 | Meta-DENDRAL | Buchanan and Mitchell | automatic theory formation; introduced version spaces |

1979 | ID3 | Quinlan | decision tree learning system, uses information gain |

1983 | LEX | Mitchell | heuristic learner, uses candidate elimination, multiple boundary sets in version space |

1989 | PRODIGY | Minton | learns from success and failure, utility measures on control rules |

1990's | ILP | Quinlan, et al. | learn general first-order theories |

**choice of good generalization languages**-- bad choices lead to too large a search space, an inability to find the concept, or exponential resource consumption; good choices lead to good results, but it isn't clear how to design "good" languages in general

**using expectations and prior knowledge**-- generalizations can be acceptable wrt training instances, but also acceptable wrt prior expectations (e.g., model-based constraints built into the language); the problem is to find ways to combine prior knowledge with training data to constrain learning

**inconsistency**-- problems when (1) the generalization language can't describe the target generalization or (2) the training instances contain errors

**partially learned generalizations**-- in general cases, unlikely the training data will determine a unique generalization; problem is how to cope with this

Russell and Norvig, *Artificial Intelligence: A Modern
Approach*, Chapters 18-21.

Kohavi, *AI Qual Note #19: Learning*.

Mitchell, *Generalization as Search*.

`Dietterich, Machine Learning.
`

See the AI Qual Reading List for further details on these sources.

Patrick Doyle | pdoyle@cs.stanford.edu |