Learning
Patrick Doyle AI Qual Summary November 8, 19100


Table of Contents




Definitions

learning
[Simon] "Changes in the system that are adaptive in the sense that they enable the system to do the same task or tasks drawn from the same population more efficiently the next time." Dietterich's is "an increase in 'knowledge'" where "knowledge" is knowledge in principle.


attribute
A measure of some important feature of an object. The three kinds are nominal (a finite set of unordered attributes, e.g., color), linear (an attribute with a linearly-ordered set of mutually exclusive values), or tree structured (an attribute with a finite set of hierarchically ordered values, e.g., shapes).


analogy
A mapping of knowledge from one domain into another which conveys that a system of relations holding in the former also holds in the latter.


bounded inconsistency
The data is noisy, but only instances close to the target concept boundary are misclassified.


utility problem
The more knowledge that is added to a system, the more intelligently it can perform -- but more slowly it works, because it must decide what knowledge to apply.


concept
A classification rule that partitions a domain into two parts: those instances that do, and those that do not, satisfy it.


learning frame problem
Over time, the learned concept may change; it is needful to detect this and modify the concept.


within-trial transfer
Chunking early in the task is already useful later in the task.


across-task transfer
Learning of one task improves performance on another.


speedup learning
Learning to improve the system's performance on some task it already knows how to do.


supervised learning
A situation in which both the inputs and the outputs of a component can be observed (e.g., the agent predicts it can stop its motion in 10 feet, and it actually stops in 15). Outputs may be provided by a teacher.


reinforcement learning
The agent is given an evaluation of its action (rear-ending another car is a bad idea), but not told the correct action. The reward or punishment is called a reinforcement.


unsupervised learning
Learning when there is no information about what the correct outputs are. Unsupervised learners can learn to predict future percepts based upon present ones, but cannot learn which actions to take without a utility function.


inductive learning
A form of supervised learning. Given a set of examples (inputs and their evaluations via some function), the system tries to approximate the evaluation function.


bias
A preference for one hypothesis over another. Since in most learning situations there are a variety of possible consistent hypotheses, all learning algorithms have some sort of bias.


decision trees
A simple structure for inductive learning. Given an instance of the problem, a decision tree returns a "yes" or "no" decision about it; thus decision trees are Boolean classifiers. Each branching node in the tree represents a test on some aspect of the instance.


training set
Used in supervised learning, a training set is a set of problem instances (described as a set of properties and their values), together with a classification of the instance.


test set
A set of instances and their classificiations used to test the accuracy of a learned hypothesis. The training set is used to create the hypothesis, and the test set to see how well it works.


Ockham's razor
The most likely hypothesis is the simplest one that is consistent with all observations. Propounded by the 13th century philosopher William of Ockham.


noise
The appearance in the data of contradictory information, e.g., two identical problem instances with different classifications.


overfitting
Finding meaningless regularity in data due to a large number of possible hypotheses.


version space
The set of all possible hypotheses consistent with the current data. The idea was introduced by Tom Mitchell in 1977.


generalization hierarchy
A hierarchy of increasingly-general predicates, designed to cut down on the number of disjunctions needed to represent a hypothesis.


PAC learning
Probably Approximately Correct learning. A system is PAC if Pr[error(f, h) > e] < d, where e is the accuracy parameter and d is the confidence parameter.


Ockham algorithm
An algorithm capable of finding a consistent hypothesis that achieves a "significant" compression of the data it represents. If an Ockham algorithm exists for a class of concepts, that class is PAC-learnable, and vice versa.


Kolmogorov complexity
Also called algorithmic complexity. The complexity of an algorithm is measured by the length of the shortest universal Turing machine program that correctly reproduces the observed data.


connectionist learning
Learning in which the data structure is a set of nodes connected by weighted links, each node passing a 0 or 1 to other links depending on whether a function of its inputs reaches its activation level.


linearly separable functions
Functions for which a line (n-dimensional plane in an n+1-dimensional space) separates members from non-members.


back-propagation
[Rumelhart] The standard method for learning in multi-layer feed-forward networks, it is identical to the perceptron learning rule except for the addition of a back-propagated error term.


neural network
A computational model somewhat similar to the human brain; it has many simple units that work in parallel with no central control. Connections between units are weighted, and these weights can be modified by the learning system.


epoch
One time step in a neural network; the network runs, a result is computed, and the weights are adjusted accordingly. In reinforcement learning, the period from the initial state to the final state.


Bayesian learning
Learning that treats the problem of building hypotheses as a particular case of the problem of making predictions. The probabilities of various hypotheses are estimated, and predictions are made using the posterior probabilities of the hypotheses to weight them.


reinforcement learning
A kind of learning in which the agent is given rewards or punishments based upon its performance, but is not directly told what the correct actions are.


Q-learning
Learning an action-value function, that is, learning the expected utility of taking a particular action in a given state.


reward-to-go
The sum of the rewards from the agent's current state until a terminal state is reached; a concept used in reinforcement learning.


adaptive dynamic programming
Adaptive dynamic programming is any kind of reinforcement learning method that works by solving the utility equations using a dynamic programming algorithm.


bandit problem
A formal model for decision making. The problem is formulated as a set of arms on a slot machine, each of which has a certain payoff. The question is whether to pull an untried arm or continue to use the one that has paid off best. The goal is to maximize expected utility over the lifetime of the agent.


input generalization
Process by which an estimated utility function can be learned as a function of certain aspects of the envirionment, rather than the complete environment.


explanation-based learning
Oh, it never occurred to me you could do that. This deductive learning converts first principles into usable rules. The agent would have been able to deduce this without the observations, if it had thought about it.


relevance-based learning
Learning in which background knowledge together with the classifications and descriptions suggests a hypothesis. The agent wouldn't have been able to come to this conclusion without the observations.


knowledge-based inductive learning
Learning in which both the background knowledge and the new hypothesis combine to explain the observations. This inductive learning generates new knowledge.


memoization
Accumulating a database of input/output values to avoid having to reason again about an already-solved problem.


functional dependency
A logical sentence saying, roughly, that something is a function of another thing, e.g., language is a function of nationality.


constructive induction algorithms
Inductive learning algorithms that generate new predicates.


cumulative learning
Learning in which the agent improves the learning ability as more knowledge is acquired.


VC-dimension
[Vapnik and Chervonenkis] A measure of the expressive power of a hypothesis space. It is the size of the largest set of examples which a hypothesis from the space can fit. For example, if Hint is the real-interval space, the VC-dimension is 2, since it cannot accomodate a hypothesis in which only the middle of three consecutive points is negative, while the endpoints are positive.


decision lists
Essentially, CONS lists. A decision list is a list of pairs (function, value), the last function of which is True. The value of a decision list, given an input, is the first value for which the function of the input is true. Decision lists with conjunctive clauses of size k properly include classes like k-CNF and k-DNF.


incremental learning
A kind of learning in which, as more examples are shown over time, the system improves its hypotheses. Version space algorithms are incremental learners; induction of decision trees is not incremental (once the tree is fixed, more information doesn't help).


generalization-to-n problem
A problem for explanation-based learning systems; their inability to learn iterative procedures. Various solutions have been proposed, including formation of recursive rules containing the iterate as an argument.




Introduction

The Basic Learning Model

Learning agents consist of four main components. They are:

In designing a learning system, there are four major issues to consider:

All learning can be thought of as learning the representation of a function.

Types of Learning

There are six main types of learning.

speed-up learning
A type of deductive learning that requires no additional input, but improves the agent's performance over time. There are two kinds, rote learning and generalization (e.g., EBL). Data caching is an example of how it would be used.


learning by taking advice
Deductive learning in which the system can reason about new information added to its knowledge base. McCarthy proposed the "advice taker" which was such a system, and TEIRESIAS [Davis, 1976] was the first such system.


learning from examples
Inductive learning in which concepts are learned from sets of labeled instances.


clustering
Unsupervised, inductive learning in which "natural classes" are found for data instances, as well as ways of classifying them. Examples include COBWEB, AUTOCLASS.


learning by analogy
Inductive learning in which a system transfers knowledge from one database into a that of a different domain.


discovery
Both inductive and deductive learning in which an agent learns without help from a teacher. It is deductive if it proves theorems and discovers concepts about those theorems; it is inductive when it raises conjectures.



Inductive Learning

Inductive learning is a kind of learning in which, given a set of examples an agent tries to estimate or create an evaluation function. Most inductive learning is supervised learning, in which examples provided with classifications. (The alternative is clustering.) More formally, an example is a pair (x, f(x)), where x is the input and f(x) is the output of the function applied to x. The task of pure inductive inference (or induction) is, given a set of examples of f, to find a hypothesis h that approximates f.

Decision Trees

A decision tree is a simple inductive learning structure. Given an instance of an object or situation, which is specified by a set of properties, the tree returns a "yes" or "no" decision about that instance. Each internal node in the tree represents a test on one of those properties, and the branches from the node are labeled with the possible outcomes of the test. Each leaf node is a Boolean classifier for the input instance.

From a logical viewpoint, decision trees are attempting to learn sets of implication sentences; this set is called the goal predicate. Note that these implications can only test propositions on the input object. However, any Boolean function can be represented as a decision tree.

Note that some Boolean functions are very inefficently represented by decision trees. The parity function is one example; it requires an exponentially large tree, since it must test all of the attributes. Another difficult function is the majority function. (I think we could characterize this kind of function roughly as one that can't be determined on the basis only of parts of the input; the problem is that the whole thing has to be examined. The whole point of the decision tree is to find ways that only parts of the input need to be examined in order to reach a decision. -- POD) The point of the decision tree algorithm is to find a "smallish" tree that correctly classifies most of the training examples. The push for a simple tree is inspired by Ockham's razor: the simplest explanation is likely the right one.

Algorithm

The decision tree algorithm works roughly as follows:

if there are no examples to split on, return a default classification.
if all examples have the same classification, return it.
otherwise, choose the property with the highest information gain:
   create a new decision tree rooted on a test of that property.
   for each possible value of that property:
      recursively call this algorithm with the subset of examples that
         match the property on the new tree, and their classification
         (by majority vote in case of conflict).
      add a branch to the subtree pointing to the result.
   return this tree.

The approach is to test the attribute with the highest information gain first, and then those with successively smaller gains, until all examples are classified.

There are two problems that arise in this induction. If a split leads to a branch with no extant examples, its classification is given by some preset default. If a set of identical examples has different classifications (noise), then some method must be used to determine what classification to use; majority-vote is a simple one.

Overfitting and Diminishing Utility

When there is a large set of possible hypotheses, learning algorithms can find meaningless regularities, such as deciding the day of the week affects the roll of a die. This problem is called overfitting the data. A related problem is deciding when adding more levels to the tree will not improve its predictive accuracy; we do not want a needlessly-complex tree.

A solution to both problems is to use decision tree pruning, in which the statistical relevance of an attribute is measure as well as its information gain. Using the chi-square test for stochastic independence, it is possible to determine the likelihood with which one can reject the hypothesis that the attribute is independent of the training instances. The algorithm can then be modified to prevent testing any attribute whose irrelevance cannot be rejected with a very high probability.

Another approach is to use cross-validation, a technique that tests subsets of the known data to see how well the current hypothesis would classify them; these results can be averaged over several hypotheses to see which would produce the best results. It can be used in conjunction with other solutions to produce better trees.

Extensions of Basic Decision Trees

There are several ways to expand upon the trees described. One is to allow them to handle missing data, where the values of some attributes in the test set are not known. Several approaches have been suggested: assign a value based upon the distribution of values in other instances, or construct another decision tree just to decide the value. The simplest is to give it the most common value, and in practice this seems to produce nearly as good results as the other methods.

A second extension is to allow multi-valued attributes, beyond Boolean. Yet another is to permit continuous-valued attributes, which would need to be discretized in some way to be used in a decision tree. Another is allowing more than two possible classifications; ID3 does not do this, but later systems do.

Two problems with ID3 that were later solved by Quinlan: the smaller the subtree gets, the harder to intelligently choose an attribute to split on, and decision trees can't represent DNF concepts efficiently. These problems were solved by converting the tree into production rules of the form

f1 ^ f2 ^ ... fk -> class.

Statistically insignificant conditions from the left-hand side are removed, and then entire rules that do not significantly improve performance (determined by comparing classification using rulesets with and without the rule under consideration) are deleted. This can both improve accuracy and dramatically reduce the complexity of the learned concept. (Example of endocrinology, where average 52.4 nodes in a tree were converted to an average of 1.8 rules, and error dropped from 1.8% to 1.3%.)

ID5 [Utgoff, 1988] is an improved version of ID3 that supports incremental learning. It processes the examples one at a time and produces an updated decision tree after each example. The basic idea of growing the tree top-down remains the same, but as the examples are moved through the tree, they are stored at the leaves. If a leaf node comes to have both positive and negative examples, a new subtree is created at that node, splitting on the attribute with the highest information gain for those examples.

History

EPAM (the Elementary Perceiver And Memorizer) [Feigenbaum, 1961] was the first system to use decision trees (then called discrimination nets). CLS [Hunt, et al., 1966] used heuristic lookahead to construct trees. ID3 [Quinlan, 1979] added the idea of using information content to choose the attribute to split; it also initially chooses a window (a subset) of the training examples, and tries to learn a concept that will correctly classify all based on that window; if not, it increases window size. Quinlan later constructed C4.5, an industral version of ID3. Utgoff's ID5 was an extension of ID3 that allowed many-valued classifications as well as incremental learning.

Applications of Decision Trees

[Michie, 1986] BP designed a system called GASOIL that built gas-oil separation systems for offshore oil platforms; at this time it was the world's leargest commercial expert system.

[Sammut, et al., 1992] A tree was designed to learn to fly a Cessna on a flight simulator. The C4.5 system was used to induce the tree.

Information Theory

A subfield of mathematics founded by Claude Shannon in the late 1940's. In our context, it is used to measure the amount of information available in a result and how more data increases the amount of information.

Information is measured in bits, where one bit is enough information to answer a single yes/no question. In general, if the possible answers vi have probabilities P(vi), then the information content I of the answer is given by

I(P(vi), ..., P(vn)) = SUMi=1 to n -P(vi) log2 P(vi)

This is the average information content of the events, weighed by their probabilities.

For decision trees, we want to split on the attibute that gives us the largest information gain; that is, the attribute that gives us the most information. Gain is defined thus:

Gain(A) = I(p/(p+n), n/(p+n)) - Remainder(A)

where A is an attribute, and p and n are the numbers of examples classified positively or negatively for that attribute. Remainder is the amount of information still required after splitting on that attribute. Splitting will generate i new subsets, each of which has pi positive and ni negative examples. Remainder is defined as

Remainder(A) = SUMi = 1 to v (pi + ni)/(p+n) * I(pi/(pi + ni), ni/(pi + ni))

Inducing General Logical Descriptions

We can look at inductive learning generally as trying to find a logical description that is equivalent to the (unknown) evaluation function. By making our hypothesis more or less specific we can tailor it to match the evaluation function.

The way in which we improve the hypothesis is by modifying it when we encounter false positives (false instances we classify as true) and false negatives (true instances we classify as false). There are two kinds of search to do this.

Current-Best-Hypothesis Search

First described by John Stuart Mill in 1843. The algorithm is extremely simple; if a new example is encountered that our hypothesis misclassifies, then change the hypothesis as follows.

If it is a false positive, specialize the hypothesis not to cover it. This can be done by dropping disjuncts or adding new terms.

If it is a false negative, generalize the hypothesis by adding disjuncts or dropping terms.

If no consistent specalization/generalization can be found, backtrack.

(Think of the Russell and Norvig example of two sets, the positive instances inside the negative ones, and extending or shrinking the positive set as more data comes in.)

The earliest machine learning system to use this approach was the arch-learning program of [Winston, 1970]. However, it naturally suffers from inefficiency in large search spaces; after every modification the past instances must be checked to ensure they are still classified correctly, and it is difficult to find good search heuristics for generalizing or specializing the definition.

Least-Commitment Search

Rather than backtracking, this approach keeps all hypotheses that are consistent with the data seen so far. As more data becomes available, this version space shrinks. The algorithm for doing this is called the candidate elimination learning algorithm or the version space learning algorithm [Mitchell, 1977], and consists simply of constraining the version space to be consistent with all data seen so far, updating it as each new instance is seen.

This is a least-commitment approach since it never favors one possible hypothesis over another; all remaining hypotheses are consistent. Note that this method implicitly assumes that there is a partial ordering on all of the hypotheses in the space, the more-specific-than ordering. A generalization G1 is more specific than a generalization G2 if it matches a proper subset of the instances that G2 matches.

The obvious problem is that this method potentially requires an enormous number of hypotheses to record. The solution is to use boundary sets to circumscribe the space of possible hypotheses. The two boundary sets are the G-set (the most general boundary) and the S-set (the most specific boundary). Every member of the G-set is consistent, and there are no more general consistent hypotheses; every member of the S-set is consistent and there are no more specific hypotheses.

The algorithm works as follows. Initially, the G-set is simply True, and the S-set False. For every new instance, there are four possible cases:

This process is repeated until one of three things happens. Eventually, either there is only one hypothesis left in the version space (in which case we return it), the version space collapses (either S or G becomes empty), meaning there is no consistent hypothesis, or we run out of examples and our version space still has several hypotheses, so we can use their collective evaluation (breaking disagreements with majority vote).

The main problems with this approach are two. If there is any noise, or insufficient attributes for classification, the version space will collapse. Also, if unlimited disjunction is allowed in the hypothesis space, S will contain only the most-specific hypothesis (the conjunction of the positive examples), and G will contain only the most-general hypothesis (the negation of the disjunctions of the negative examples). The latter problem can partially be solved by using a generalization hierarchy.

Such learning systems cannot handle noisy data. One solution is to maintain several S and G sets, consistent with decreasing numbers of training instances.

The pure version-space algorithm was first used in META-DENDRAL [Buchanan and Mitchell, 1978]. It was also used in LEX [Mitchell, 1983] which learned to solve symbolic integration problems.

Computational Complexity

Strategy Time Complexity Space Complexity
DFS (current best hypothesis) O(pn) O(p + n)
Version Space O(sg(p + n) + s2p + g2n) O(s + g)

In this table, p and n are the number of positive and negative training instances, respectively, and s and g are the largest sizes of sets S and G.

The complexity of the depth-first approach stems from the need to reexamine old instances after each revision to the hypothesis; in particular, each time a positive instance forces the hypothesis to be changed, all past negative hypotheses must be examined. Similarly, revising the hypothesis in response to negative instances requires reexamining all positive instances.

In the version space strategy, no training instances need be saved, so the space complexity is just the largest sizes of S and G. Notice that for this strategy, processing time grows linearly with the number of training instances. However, it grows as the square of the sizes of the boundary sets.


PAC Learning

[Introduced by Valiant, 1984] The field of computational learning theory attempts to provide a theoretical framework for learning systems. The question is, how can we justify the accuracy of a hypothesis h with respect to some function f if we don't know what f is?

The argument is, roughly, that incorrect hypotheses will be discovered relatively quickly because they will incorrectly identify instances. Any hypothesis that is consistent with a large enough set of training examples is unlikely to be seriously wrong. That is, it is probably approximately correct. PAC learning is the part of computational learning theory that studies this idea.

The key assumption underlying this argument is that the training set and the test set are randomly drawn from the same population of examples, and that they are drawn using the same probability distribution. This idea (due to Valiant) is the stationarity assumption. It is required to associate any future instances with the ones seen so far.

In particular, we would like to bound the likelihood that our hypothesis is not within a certain range of the correct hypothesis. Specifically, we define the error of a hypothesis thus:

error(h) = Pr(h(x) != f(x) | x drawn from D)

(where D is a distribution over the samples). A hypothesis h is called approximately correct if error(h) <= e. An approximately correct hypothesis can be thought of as lying within an e-ball of the true hypothesis in the hypothesis space. All hypothesis outside of this ball are in the set Hbad.

We calculate the probability that a hypothesis hb is in Hbad as follows. By supposition, error(hb) > e. Thus the probability of agreement with any example is less than 1 - e. The bound for m examples is then <= (1-e)m.

For Hbad to contain a consistent hypothesis, at least one of its hypotheses must be consistent. The probability of this occurring is bounded by the sum of the individual probabilities, thus

P(Hbad contains a consistent hypothesis) <= |Hbad|(1 - e)m <= |H|(1 - e)m.

We would like to reduce the probability of this event to some number d, the confidence parameter, so that

|H|(1 - e)m <= d

which we can do if we show

m >= 1/e (ln (1/d) + ln |H|)

training examples to the system. Note that this approach assumes that the true function f is somewhere in the hypothesis space. Roughly speaking, by theorem of Blumer, the number of examples required for learning is proportional to the log of the size of the hypothesis space.

Theorem [Blumer et al., 1989]: A space of hypotheses H is PAC-learnable iff it has finite VC-dimension.

Some examples of PAC-learnable classes in polynomial time: k-CNF, k-DNF, k-DL, Boolean conjunctions. In contrast, k-term-DNF and k-3NN (a 3-layer neural network) are NP-hard PAC-learning problems. Note that the size of the hypothesis space is not necessarily the whole problem; k-term-DNF is a proper subset of k-CNF, but the former is not polynomially learnable while the latter is. Note also that this means in order to prove that learning a concept class is computationally intractable, it must be shown intractable regardless of the representation employed by the learning algorithm.


Connectionist Learning

The connectionist approach is inspired by the model of the human brain as an enormous parallel computer, in which small computational units feed simple, low-bandwidth signals to one another, and from which intelligence arises. It attempts to replicate this behavior on an abstract level with what are called neural networks, an idea introduced by McCullough and Pitts in 1943.

Neural Networks

More specifically, a neural network consists of a set of nodes (or units), links that connect one node to another, and weights associated with each link. Some nodes receive inputs via links; others directly from the environment, and some nodes send outputs out of the network. Learning usually occurs by adjusting the weights on the links.

Each unit has a set of weighted inputs, an activation level, and a way to compute its activation level at the next time step. This is done by applying an activation function to the weighted sum of the node's inputs. Generally, the weighted sum (also called the input function) is a strictly linear sum, while the activation function may be nonlinear. If the value of the activation function is above a threshold, the node "fires."

Generally, all nodes share the same activation function and threshold value, and only the topology and weights change.

Network Structures

The two fundamental types of network structure are feed-forward and recurrent. A feed-forward network is a directed acyclic graph; information flows in one direction only, and there are no cycles. Such networks cannot represent internal state. Usually, neural networks are also layered, meaning that nodes are organized into groups of layers, and links only go from nodes to nodes in adjacent layers.

Recurrent networks allow loops, and as a result can represent state, though they are much more complex to analyze. Hopfield networks and Boltzmann machines are examples of recurrent networks; Hopfield networks are the best understood. All connections in Hopfield networks are bidirectional with symmetric weights, all units have outputs of 1 or -1, and the activation function is the sign function. Also, all nodes in a Hopfield network are both input and output nodes. Interestingly, it has been shown that a Hopfield network can reliably recognize 0.138N training examples, where N is the number of units in the network.

Boltzmann machines allow non-input/output units, and they use a stochastic evaluation function that is based upon the sum of the total weighted input. Boltzmann machines are formally equivalent to a certain kind of belief network evaluated with a stochastic simulation algorithm.

One problem in building neural networks is deciding on the initial topology, e.g., how many nodes there are and how they are connected. Genetic algorithms have been used to explore this problem, but it is a large search space and this is a computationally intensive approach. The optimal brain damage method uses information theory to determine whether weights can be removed from the network without loss of performance, and possibly improving it. The alternative of making the network larger has been tested with the tiling algorithm [Mezard and Nadal, 1989] which takes an approach similar to induction on decision trees; it expands a unit by adding new ones to cover instances it misclassified. Cross-validation techniques can be used to determine when the network size is right.

Perceptrons

Perceptrons are single-layer, feed-forward networks that were first studied in the 1950's. They are only capable of learning linearly separable functions. That is, if we view F features as defining an F-dimensional space, the network can recognize any class that involves placing a single hyperplane between the instances of two classes. So, for example, they can easily represent AND, OR, or NOT, but cannot represent XOR.

Perceptrons learn by updating the weights on their links in response to the difference between their output value and the correct output value. The updating rule (due to Frank Rosenblatt, 1960) is as follows. Define Err as the difference between the correct output and actual output. Then the learning rule for each weight is

Wj <- Wj + A x Ij x Err

where A is a constant called the learning rate.

Of course, this was too good to last, and in Perceptrons [Minsky and Papert, 1969] it was observed how limited linearly separable functions were. Work on perceptrons withered, and neural networks didn't come into vogue again until the 1980's, when multi-layer networks became the focus.

Multi-Layer Feed-Forward Networks

The standard method for learning in multi-layer feed-forward networks is back-propagation [Bryson and Ho, 1969]. Such networks have an input layer, and output layer, and one or more hidden layers in between. The difficulty is to divide the blame for an erroneous output among the nodes in the hidden layers.

The back-propagation rule is similar to the perceptron learning rule. If Erri is the error at the output node, then the weight update for the link from unit j to unit i (the output node) is

Wj,i <- Wj,i + A x aj x Erri x g'(ini)

where g' is the derivative of the activation function, and aj is the activation of the unit j. (Note that this means the activation function must have a derivative, so the sigmoid function is usually used rather than the step function.) Define Di as Erri x g'(ini).

This updates the weights leading to the output node. To update the weights on the interior links, we use the idea that the hidden node j is responsible for part of the error in each of the nodes to which it connects. Thus the error at the output is divided according to the strength of the connection between the output node and the hidden node, and propogated backward to previous layers. Specifically,

Dj = g'(inj) x SUMi Wj,i Di

Thus the updating rule for internal nodes is

Wj,i <- Wj,i + A x aj x Di.

Lastly, the weight updating rule for the weights from the input layer to the hidden layer is is

Wk,j <- Wk,j + A x Ik x Dj

where k is the input node and j the hidden node, and Ik is the input value of k.

A neural network requires 2n/n hidden units to represent all Boolean functions of n inputs. For m training examples and W weights, each epoch in the learning process takes O(mW) time; but in the worst case, the number of epochs can be exponential in the number of inputs.

In general, if the number of hidden nodes is too large, the network may learn only the training examples, while if the number is too small it may never converge on a set of weights consistent with the training examples.

Multi-layer feed-forward networks can represent any continuous function with a single hidden layer, and any function with two hidden layers [Cybenko, 1988, 1989].

Applications of Neural Networks

John Denker remarked that "neural networks are the second best way of doing just about anything." They provide passable performance on a wide variety of problems that are difficult to solve well using other methods.

NETtalk [Sejnowski and Rosenberg, 1987] was designed to learn how to pronounce written text. Input was a seven-character centered on the target character, and output was a set of Booleans controlling the form of the sound to be produced. It learned 95% accuracy on its training set, but had only 78% accuracy on the test set. Not spectacularly good, but important because it impressed many people with the potential of neural networks.

Other applications include a ZIP code recognition [Le Cun et al., 1989] system that achieves 99% accuracy on handwritten codes, and driving [Pomerleau, 1993] in the ALVINN system at CMU. ALVINN controls the NavLab vehicles, and translates inputs from a video image into steering control directions. ALVINN performs exceptionally well on the particular road-type it learns, but poorly on other terrain types. The extended MANIAC system [Jochem et al., 1993] has multiple ALVINN subnets combined to handle different road types.

Bayesian Learning in Belief Networks

(NOTE: As I am largely unfamiliar with belief networks, I would take the material in this section with a grain of salt, if I were you. -- POD)

Bayesian learning maintains a number of hypotheses about the data, each one weighted its posterior probability when a prediction is made. The idea is that, rather than keeping only one hypothesis, many are entertained, and weighted based on their likelihoods.

Of course, maintaining and reasoning with a large number of hypotheses can be intractable. The most common approximation is to use a most probable hypothesis, that is, an Hi of H that maximizes P(Hi | D), where D is the data. This is often called the maximum a posteriori (MAP) hypothesis HMAP:

P(X | D) ~= P(X | HMAP) x P(HMAP | D)

To find HMAP, we apply Bayes' rule:

P(Hi | D) = [P(D | Hi) x P(Hi)] / P(D)

Since P(D) is fixed across the hypotheses, we only need to maximize the numerator. The first term represents the probability that this particular data set would be seen, given Hi as the model of the world. The second is the prior probability assigned to the model.

Belief Network Learning Problems

There are four kinds of belief networks, depending upon whether the structure of the network is known or unknown, and whether the variables in the network are observable or hidden.

Comparison of Belief and Neural Networks

The two formalisms can be compared as representations, as inference systems, and as learning systems.

Both kinds of network are attribute-based representations. Both can handle either discrete or continuous output. The major difference is that belief networks are localized representations, while neural networks are distributed. Belief network nodes represent propositions with clearly defined semantics and relationships to other nodes. In neural networks, nodes generally don't represent specific propositions, and the calculations would not treat them in a semantically-meaningful way. The effect is that human beings can neither construct nor understand neural network representations, where both can be done with belief networks.

Belief networks handle two kinds of activation, both in terms of the values a proposition may take, and the probabilities assigned to each value. Neural network outputs could be values or probabilities, but they cannot handle both simultaneously.

Trained feed-forward neural network inference can execute in linear time, where in belief networks inference is NP-hard. However, a neural network may have to be exponentially larger to represent the same things that a belief network can.

As for learning, belief networks have the advantage of being easier to give prior knowledge; also, since they represent propositions locally, it may be easier for them to converge, since they are directly affected only by a small number of other propositions.


Reinforcement Learning

As opposed to supervised learning, reinforcement learning takes place in an environment where the agent cannot directly compare the results of its action to a desired result. Instead, it is given some reward or punishment that relates to its actions. It may win or lose a game, or be told it has made a good move or a poor one. The job of reinforcement learning is to find a successful function using these rewards.

The reason reinforcement learning is harder than supervised learning is that the agent is never told what the right action is, only whether it is doing well or poorly, and in some cases (such as chess) it may only receive feedback after a long string of actions.

There are two basic kinds of information an agent can try to learn.

Passive Learning in a Known Environment

Assuming an environment consisting of a set of states, some terminal and some nonterminal, and a model that specifies the probabilities of transition from state to state, an agent learns passively by observing a set of training sequences, which consist of a set of state transitions followed by a reward.

The goal is to use the reward information to learn the expected utility of each of the nonterminal states. An important simplifying assumption is that the utility of a sequence is the sum of the rewards accumulated in the states of the sequence. That is, the utility function is additive.

A passive learning agent keeps an estimate U of the utility of each state, a table N of how many times each state was seen, and a table M of transition probabilities. There are a variety of ways the agent can update its table U.

Naive Updating

One simple updating method is the least mean squares (LMS) approach [Widrow and Hoff, 1960]. It assumes that the observed reward-to-go of a state in a sequence provides direct evidence of the actual reward-to-go. The approach is simply to keep the utility as a running average of the rewards based upon the number of times the state has been seen. This approach minimizes the mean square error with respect to the observed data.

This approach converges very slowly, because it ignores the fact that the actual utility of a state is the probability-weighted average of its successors' utilities, plus its own reward. LMS disregards these probabilities.

Adaptive Dynamic Programming

If the transition probabilities and the rewards of the states are known (which will usually happen after a reasonably small set of training examples), then the actual utilities can be computed directly as

U(i) = R(i) + SUMj MijU(j)

where U(i) is the utility of state i, R is its reward, and Mij is the probability of transition from state i to state j. This is identical to a single value determination in the policy iteration algorithm for Markov decision processes. Adaptive dynamic programming is any kind of reinforcement learning method that works by solving the utility equations using a dynamic programming algorithm. It is exact, but of course highly inefficient in large state spaces.

Temporal Difference Learning

[Richard Sutton] Temporal difference learning uses the difference in utility values between successive states to adjust them from one epoch to another. The key idea is to use the observed transitions to adjust the values of the observed states so that they agree with the ADP constraint equations. Practically, this means updating the utility of state i so that it agrees better with its successor j. This is done with the temporal-difference (TD) equation:

U(i) <- U(i) + a(R(i) + U(j) - U(i))

where a is a learning rate parameter.

Temporal difference learning is a way of approximating the ADP constraint equations without solving them for all possible states. The idea generally is to define conditions that hold over local transitions when the utility estimates are correct, and then create update rules that nudge the estimates toward this equation.

This approach will cause U(i) to converge to the correct value if the learning rate parameter decreases with the number of times a state has been visited [Dayan, 1992]. In general, as the number of training sequences tends to infinity, TD will converge on the same utilities as ADP.

Passive Learning in an Unknown Environment

Since neither temporal difference learning nor LMS actually use the model M of state transition probabilities, they will operate unchanged in an unknown environment. The ADP approach, however, updates its estimated model of an unknown environment after each step, and this model is used to revise the utility estimates.

Any method for learning stochastic functions can be used to learn the environment model; in particular, in a simple environment the transition probability Mij is just the percentage of times state i has transititoned to j.

The basic difference between TD and ADP is that TD adjusts a state to agree with the observed successor, while ADP makes a state agree with all successors that might occur, weighted by their probabilities. More importantly, ADP's adjustments may need to be propagated across all of the utility equations, while TD's affect only the current equation. TD is essentially a crude first approximation to ADP.

A middle-ground can be found by bounding or ordering the number of adjustments made in ADP, beyond the simple one made in TD. The prioritized-sweeping heuristic prefers only to make adjustments to states whose likely successors have just undergone large adjustments in their utility estimates. Such approximate ADP systems can be very nearly as efficient as ADP in terms of convergence, but operate much more quickly.

Active Learning in an Unknown Environment

The difference between active and passive agents is that passive agents learn a fixed policy, while the active agent must decide what action to take and how it will affect its rewards. To represent an active agent, the environment model M is extended to give the probability of a transition from a state i to a state j, given an action a. Utility is modified to be the reward of the state plus the maximum utility expected depending upon the agent's action:

U(i) = R(i) + maxa x SUMj MaijU(j)

An ADP agent is extended to learn transition probabilities given actions; this is simply another dimension in its transition table. A TD agent must similarly be extended to have a model of the environment.

Learning Action-Value Functions

An action-value function assigns an expected utility to the result of performing a given action in a given state. If Q(a, i) is the value of doing action a in state i, then

U(i) = maxa Q(a, i)

The equations for Q-learning are similar to those for state-based learning agents. The difference is that Q-learning agents do not need models of the world. The equilibrium equation, which can be used directly (as with ADP agents) is

Q(a, i) = R(i) + SUMj Maij maxa' Q(a', j)

The temporal difference version does not require that a model be learned; its update equation is

Q(a, i) <- Q(a, i) + a(R(i) + maxa' Q(a', j) - Q(a, i))

Applications of Reinforcement Learning

The first significant reinforcement learning system was used in Arthur Samuel's checker-playing program. It used a weighted linear function to evaluate positions, though it did not use observed rewards in its learning process.

TD-gammon [Tesauro, 1992] has an evaluation function represented by a fully-connected neural network with one hidden layer of 80 nodes; with the inclusion of some precomputed board features in its input, it was able to reach world-class play after about 300,000 training games.

A case of reinforcement learning in robotics is the famous cart-pole balancing problem. The problem is to control the position of the cart (along a single axis) so as to keep a pole balanced on top of it upright, while staying within the limits of the track length. Actions are usually to jerk left or right, the so-called bang-bang control approach. The first work on this problem was the BOXES system [Michie and Chambers, 1968], in which state space was partitioned into boxes, and reinforcement propogated into the boxes. More recent simulated work using neural networks [Furuta et al., 1984] simulated the triple-inverted-pendulum problem, in which three poles balance one atop another on a cart.

Genetic Algorithms and Evolutionary Programming

Genetic algorithms are inspired by natural evolution. In the natural world, organisms that are poorly suited for an environment die off, while those well-suited for it prosper. Children are much like their parents, though occasionally mutations occur; most are harmful, but rarely there is an improvement in the strain. Genetic algorithms simulate this behavior.

Genetic algorithms search the space of individuals for good candidates. The "goodness" of an individual is measured by some fitness function. Search takes place in parallel, with many individuals in each generation. The approach is a hill-climbing one, since in each generation the offspring of the best candidates are preserved.

In the standard genetic algorithm approach, each individual is a bit-string that encodes its characteristics. Each element of the string is called a gene, so the string is likened to a strand of DNA.

The algorithm consists of looping through generations. In each generation, a subset of the population is selected to reproduce; usually this is a random selection in which the probability of choice is proportional to fitness. Selection is usually done with replacement (so a fit individual may reproduce many times). Reproduction occurs by randomly pairing all of the individuals in the selection pool, and then generating two new individuals by performing crossover, in which the initial n bits (where n is random) of the parents are exchanged. There is a small chance that one of the genes in the resulting individuals will mutate to a new value.

Genetic algorithms are broadly applicable and have the advantage that they require little knowledge encoded in the system. However, as might be expected from a knowledge-poor approach, they give very poor performance on some problems.

Evolutionary programming is usually taken to mean a more complex form of genetic programming in which the individuals are more complex structures -- for example, sections of program code -- and the crossover and mutation operations correspondingly more elaborate.

Classifier systems are genetic programs that develop rules for classifying input instances. Each rule is weighted, and has some set of feature patterns as a condition and some pattern as an action. On each cycle, the rules that match make bids proportional to their weights. One or more rules are then selected for application with probabilities based on their bids. Weights are adjusted according to the contribution of the rules to desired behavior; the bucket brigade algorithm [Holland, 1985] is an effective method of doing this.

Learning with Knowledge

By considering the kinds of logical constraints placed upon different kinds of knowledge-based learning, we can classify them more clearly. Examples are composed of Descriptions and Classifications, and we are trying to find a Hypothesis to explain the data.

Inductive learning can be characterized by the following entailment constraint:

Hypothesis ^ Descriptions |= Classifications

That is, given our hypothesis and descriptions of problem instances, we want to generate classifications. This is inductive learning. Other kinds of learning that use prior knowledge are described below.

Explanation-Based Learning

This kind of learning occurs when the system finds an explanation of an instance it has seen, and generalizes the explanation. The general rule follows logically from the background knowledge possessed by the system. The entailment constraints for EBL are

Hypothesis ^ Descriptions |= Classifications
Background |= Hypothesis

The agent does not actually learn anything factually new, since the hypothesis was entailed by background knowledge. This kind of learning is regarded as a way to convert first principles into useful specialized knowledge (converting problem-solving search into pattern-matching search).

The basic idea is to construct an explanation of the observed result, and then generalize the explanation. More specifically, while constructing a proof of the solution, a parallel proof is performed, in which each constant of the first is made into a variable. Then a new rule is built in which the left-hand side is the leaves of the proof tree, and the right-hand side is the variabilized goal, up to any bindings that must be made with the generalized proof. Any conditions true regardless of the variables are dropped.

Note that by pruning the tree before the leaves, even more general rules may be learned. However, the more general, the more computation may be required to apply the rule. One approach is to require the operationality of the subgoals in the new rule -- that they be "easy" to solve.

Relevance-Based Learning

This is a kind of learning in which background knowledge relates the relevance of a set of features in an instance to the general goal predicate. For example, if I see men in the Forum in Rome speaking Latin, and I know that if seeing someone in a city speaking a language usually means all people in the city speak that language, I can conclude Romans speak Latin.

In general, background knowledge, together with the observations, allows the agent to form a new, general rule to explain the observations. The entailment constraint for RBL is

Hypothesis ^ Descriptions |= Classifications
Background ^ Descriptions ^ Classifications |= Hypothesis

This is a deductive form of learning, because it cannot produce hypotheses that go beyond the background knowledge and observations.

We presume that our knowledge base has a set of functional dependencies or determiners that support the construction of hypotheses. The learning algorithm then tries to find the minimal consistent determination (e.g., a sentence of the form "P determines Q," meaning that if the examples match on P they match on Q).

Knowledge-Based Inductive Learning

This is a kind of learning in which our background knowledge, together with our observations, lead us to make a hypothesis that explains the examples we see. If I see the Old Man from Scene 24 on the Bridge of Despair, and notice that he asks a simple question of every other knight that attempts to cross, I can hypothesize that only the odd-numbered knights are able to cross the Gorge of Eternal Peril. The entailment constraint in this case is

Background ^ Hypothesis ^ Descriptions |= Classifications

Such knowledge-based inductive learning has been studied mainly in the field of inductive logic programming. Such systems reduce learning complexity in two ways. First, by requiring all new hypotheses to be consistent with existing knowledge, they reduce the search space of hypotheses. Secondly, the more prior knowledge available, the less new knowledge required in the hypothesis to explain the observations.

Attribute-based learning algorithms are incapable of learning predicates. One of the advantages of ILP algorithms is their much broader range of applicability.

Inverse Resolution

Inverse resolution is based upon the fact that if the example Classifications follow from the Background ^ Hypothesis ^ Descriptions, then it must be possible to prove this by resolution. The goal is to invert the resolution process in order to find the initial hypothesis.

An inverse resolution step takes a resolvent C, and produces two clauses C1 and C2 which produce it via resolution; alternately, it takes C and C1 and produces C2.

This allows a very large search space, since the resolving clauses could be taken from anywhere; from the hypotheses, the example descriptions, the negated classifications, or already-generated clauses. Possible constraints include using restricted resolution strateiges, and making sure that introduced clauses are consistent with the observations.

Inverted resolution strategies do, in principle, have the ability to discover first-order theories, though until effective means of handling the search space are found this is a difficult problem. However, they will invent new predicates through the process of resolution, which can help to reduce the size of the goal predicate.

Examples of ILP systems using inverse resolution are GOLEM [Muggleton, 1992], which generated predictions of protein structure from sequence data; it was also used to predict the theraputic use of drugs based on their molecular structures [King, 1992]. This work is similar to META-DENDRAL's, but the problem domains are considerably more complex, and GOLEM is entirely general purpose.

Top-Down Learning Methods

The other approach to ILP is essentially a generalization of decision tree learning to first-order logic. The idea is to start with a very general rule and specialize it gradually so it fits the data. In the first-order case, first-order literals are used instead of attributes, and the hypothesis is a set of clauses rather than a decision tree.

The following is a description of the approach used in FOIL [Quinlan, 1990], one of the early systems to do this.

First, we are given a predicate we want to learn (e.g., Grandfather(x, y)), and a set of training examples. We create a Horn clause, initially empty, that implies our predicate. We then add new clauses, one at a time, each one giving us the best increase in accuracy in classifying the examples. This continues until we have a clause that agrees with some subset of the positive examples and none of the negative examples. Then the positive examples are removed from the training set as they are satisfied, until no positive examples remain.

The two kinds of choices are: how to generate new literals to add to the clause, and how to choose which literal to add.

There are three kinds of literals that can be created, literals using predicates, equality and inequality literals, and arithmetic comparisons. In the first case, the literals can be negated or unnegated, any existing predicates can be used, and all arguments are variables, but must use at least one variable from the head of the clause or an earlier literal. (Notice that FOIL can learn recursive definitions since it can use the predicate from the head of the clause.)

Naturally this leads to a large branching space. Type information may be used to restrict the space, so, for example, the system may know that a literal such as Parent(x, n) where x is a person and n a number makes no sense.

Choosing a literal uses a heuristic process similar to information gain.

FOIL and its descendants have been used to learn a variety of complex definitions; a particular example is learning definitions from a Prolog textbook given a small set of examples and the previously-learned definitions [Quinlan and Cameron-Jones, 1993].


Important Learning Systems

(For a more comprehensive list, see Ronny Kohavi's learning summary.)

CLS

[Hunt, 1966]. Concept Learning System. CLS was a learning algorithm that learned concepts and used them to classify new cases. CLS was the precursor to decision trees; it lead to Quinlan's ID3 system.

EPAM

[Feigenbaum, 1963] Elementary Perceiver and Memorizer. It memorized pairs of nonsense syllables by using discrimination nets in which it could find images of syllabues it had seen. It stored as its cue only enough information to disambiguate the syllable from others seen at the time the association was formed. Thus old cues might be inadequate to retrieve data later, so the system could "forget."

Samuel's Checker Program

[Arthur Samuel, 1947-67] The program used minimax search and a static evaulation in the game. It rote-learned positions by their minimax values. It could thus reuse the value in other games where the situation reappeared. He used a least-recently-used approach to caching.

To learn, the system used sixteen features multiplied by weights. The weights changed through learning. The program actually knew 38 features, but only used 16 for evaluation at any given time; occasionally one was deleted and replaced by another in the evaluation function.

Winston's Learning Program

[Winston, 1975] An early structural learning program that used semantic nets to describe block structures. Matching and hill climbing allowed the program to learn more complex structures; one major problem was that a teacher had to guide the program through a set of helpful examples.

LEX

[Mitchell and Utgoff] A concept learning program that created and refined heuristics that suggested whether a given operator would be applied to a given problem state in a forward-search problem solver. Each operator had associated with it a set of states in which it "should" be applied.

LEX's domain was integral calculus. It used candidate elimination to determine good and bad instances of operator applications The algorithm refined the version spaces of the operators. Multiple boundary sets were used to handle noise.

Meta-DENDRAL

[Buchanan and Mitchell, 1978] A program that does automatic theory formation using the version space learning algorithm. It is designed to infer theories (rule-sets) for heuristic DENDRAL, which represents knowledge about mass spectroscopy as production rules.

HACKER

[Sussman, 1975] HACKER learns by doing; it tries to generalize subroutines and bugs. It has a gallery of critics which inspects the plans it generates for known generalized bugs. It uses the meaning of operators to guide the generalization process.

PRODIGY

[Minton, 1989] A general-purpose problem-solving system that uses several learning methods. Most learning is directed at automatically constructing a set of control rules to improve search in a given domain. It can acquire control rules in three ways:
  1. Through hand coding by programmers.
  2. Through a static analysis of the domain's operators.
  3. Through looking at traces of its own problem-solving behavior.

PRODIGY learns control rules from experience, but unlike SOAR it also learns from its failures. If it pursues a path that fails, it will try to find an explanation of why it failed, and use that to build control knowledge to avoid such paths in the future.

PRODIGY keeps a utility measure for each control rule. This takes into account average savings provided by the rule, frequency of use, and cost of matching. If a proposed rule has negative utility it is discarded.

AUTOCLASS

[Cheeseman, 1988] Given a set of training cases, it hypothesizes a set of classes. For any given case it provides probabilities to predict the class into which it will fall. AUTOCLASS found meaningful new classes of stars from infra-red spectral data.

NETL

A connectionist, hardware implementation of a semantic network using local representations. Each node is a simple processing unit, capable of storing a few single-bit markers and performing simple Boolean operations on them. Each link is also a simple processor, wired to two or more units. Link units can also perform Boolean operations, which usually consist of passing a marker from one node to another.

AM / Eurisko

[Lenat, 1977] Amateur Mathematician worked from basic concepts of set theory to discover a fair amount of standard number theory. It used a variety of general-purpose AI techniques, including a frame system to represent concepts. One of AM's major tasks was to create new concepts and fill their slots.

An agenda controls the discovery process. AM operates in cycles, each cycle choosing the most promising task on the agenda. AM discovered the concept of prime numbers, the Unique Factorization Theorem, Goldbach's Conjecture (even numbers greater than 2 are the sum of two primes).

AM's performance was limited by its static heuristics; the main source of its power was the large set of heuristics about what things are "interesting." AM has an implicit bias toward learning number theory concepts.

BACON

[Langley, 1981] A model of data-driven scientific discovery. BACON creates proportionalities in order to derive relations between data values. It uses other heuristics to postulate intrinsic properties of objects and to reason by analogy.

It "discovered" Kepler's third law, Ohm's law, Joule's law, and conservation of momentum. There were several versions of BACON; early versions had trouble dealing with noise in the data, and were sensitive to the order in which data were presented. (See Handbook of Artificial Intelligence, vol. III, for details.)


Timeline and Accomplishments

Date Learning System Authors Significant Contributions
1958 Perceptrons Rosenblatt introduced LTU, perceptron learning rule
1959-1967 Checkers Program Samuel minimax search, weights on heuristic evaluation function, changing terms, rote-learning
1963 EPAM Feigenbaum first computational model of many human verbal-learning behaviors (discrimination nets)
1966 CLS Hunt precursor of decision trees (lead to ID3)
1969 Perceptrons book Minsky and Papert proved that perceptrons can only learn linearly-separable functions
1971 STRIPS MACROPs Fikes and Nilsson earliest EBL
1973 HACKER Sussman ?
1975 Winston's Learning Program Winston early structural learner, used semantic nets, hill climbing
1978 Meta-DENDRAL Buchanan and Mitchell automatic theory formation; introduced version spaces
1979 ID3 Quinlan decision tree learning system, uses information gain
1983 LEX Mitchell heuristic learner, uses candidate elimination, multiple boundary sets in version space
1989 PRODIGY Minton learns from success and failure, utility measures on control rules
1990's ILP Quinlan, et al. learn general first-order theories



Open Problems in Learning




Sources

Russell and Norvig, Artificial Intelligence: A Modern Approach, Chapters 18-21.

Kohavi, AI Qual Note #19: Learning.

Mitchell, Generalization as Search.

Dietterich, Machine Learning.

See the AI Qual Reading List for further details on these sources.


Patrick Doyle pdoyle@cs.stanford.edu