Summary of Tom M. Mitchell's "Generalization as Search"
(by Christian Shelton)
This paper beings by giving examples of the wide range of abilities all
grouped as learning. Mitchell then limits the scope of the paper to
generalization by highlighting it as a central capability in many aspects
of learning. Generalization is described here as the ability to take a
set of events or descriptions and their classification (binary in this
case) and predict the classification of other events or descriptions following
the same set of "rules" as the first set.
Definitions
A set of definitions are required to understand this general conversation
about generalization:
- instance language
- the language used to represent the observations to
be generalized
- generalization language
- the language used to represent the learned
generalization. It should be noted that a generalization language often
times will not allow all possible generalizations.
- matching predicate
- a test to determine whether or not an instance is
classified as positive in a given generalization language.
- training instance
- an instance from the instance language and its
classification used for learning the generalization
- Generalization Problem
-
"Given:
- A language in which to describe instances
- A language in which to describe generalizations
- A matching predicate that matches generalizations to instances
- A set of positive and negative training instances of a target
generalization to be learned
Determine: Generalizations within the provided language that are
consistent with the presented training instances (i.e., plausible
descriptions of the target generalization)."
Ordering and Search
"The above generalization problem is essentially a search problem. The
generalization language corresponds to an hypothesis space (search space) of
possible solutions, and the learning task is to examine this hypothesis
space, subject to constraints imposed by the training instances, to
determine plausible generalizations."
An important feature of this hypothesis space is that it is partially
ordered by the "more-specific-than" operator. "This partial ordering is
important because it provides a powerful basis for organizing the
search through the hypothesis space."
Thus there are three major ways of searching this space:
- Depth-first search: obviously this one exists. Here, one begins with
a very specific definition consistent with the first observation. Then,
for each observation, make the definition either more restrictive (for
negative observations) or more relaxed (for positive observations) so that
the current "working" generalization satisfies the current observation.
In doing so, one must look back at all of the previous observations to insure
that consistency is maintained. As well, there are many different ways of
changing the "working" generalization. Depth-first search explores all of
these combinations through traditional backtracking with all of its problems
and advantages.
- Breadth-first search (specific to general): Here, one constructs a
set S such that for each s in S, s is consistent with the observations thus
far and as specific as possible. Or S = { s | s is a generalization that is
consistent with the observed instances, and there is no generalization which
is both more specific than s and consistent with the observed instances}.
In such a search, S is initialized based on the first instance and then
for each negative instance, members of S are removed and for each positive
instance, the members of S are generalized and then checked for consistency.
- Version Space strategy: This is one specific to general and one general to
specific breadth-first search done at the same time. One defines
a set G similar to S, but such that the members are as general as possible
(p. 100). Then the S set is advanced as before and the G set is advanced
in the complimentary manner. Thus, the whole space of possible generalizations
is "represented" by these two "bounding" sets in the sense that all possible
generalizations could be created from these two sets. Specifically, a
generalization is represented if it is more specific than or equal to some
member of G and more general than or equal to some member of S. There are
a few restraints that should be noted for this technique:
- As with the breadth-first search, this only works when a the more
specific than operator and more general than operator can computed by direct
examination.
- In addition, this technique assumes a most general and most specific
generalization. This might not always be possible.
Considerations when Comparing Search Methods
- not efficiency
- "ability to detect the point at which the target generalization is
completely determined by the training instances" (when only one generalization
is possible -- the data is completely learned)
This is very simple with the version space search. In this case, the
generalization has been completely learned when S and G are equal and contain
only one generalization.
- the ability to "use incompletely determined generalizations in a
reasonable manner"
The version space search also works well here. An incomplete generalization
consists of an S and a G set which "bound" all possible generalizations. Thus,
a new set can be classified as positive if it matches every element of S and
negative if it doesn't match any elements of G with certainty even though the
exact generalization has not (yet) been learned. This is correct under the
assumption that all training sets were correct and that the generalization
language will allow a complete description of the desired generalization.
Note that the breadth-first strategy does all such unambiguous classifications
for positive instances, but not for negative instances (as the G set is not
computed).
- "ability to direct the presentation of training instances to obtain
informative instances"
Now we consider the case where the program can control the training sets
presented to it. It would be most advantageous for the program to pick a
training set which would match 1/2 of the plausible generalizations. In
the Version Space search, this corresponds to an instance which matches
1/2 of S and G. For the breadth-first search, this is one which matches
1/2 of S.
Complexity
(where p is the number of positive sets, n is the number of negative sets,
s is the maximum size of the set S and g is the maximum size of the set G)
Processing Time:
- Depth-First: O(pn)
- Breadth-First: O(spn + ssp)
- Version Space: O(sg(p+n) + ssp + ggn)
Storage Space:
- Depth-First: O(p + n)
- Breadth-First: O(s + n)
- Version Space: O(s + g)
Other Generalization Strategies
These are non-data-driven methods (the above three were all data-driven).
- Generate-and-test
- a class of strategies that produce hypotheses independently of the data given
and then test them against all of the data.
- Statistical Pattern Recognition
- a class of strategies that represent instances as points in an n-dimensional
space and decision surfaces as surfaces in these spaces. These methods tend
to be better than the search methods when there are errors in the training set.
Further Issues
The following are open problems in machine learning
- The generalization language: a significant problem exists in choosing the
correct generalization language so that the generalization is easy to find
and possible to describe. It can also influence the resources required to
search the space. If the generalization is made poorly, matching generalizations
to instances can take a lot of computing time. As well, the parameters of
the search space can change drastically. The choice of the generalization
language can dictate whether the space will be narrow and deep or wide and
shallow.
- Using prior knowledge/expectations: One way is to impose restraints on
the generalization language. Or one can put them into the hypothesis
generator in generate-and-test cases. This can be important in limiting
the search space.
- Inconsistency: If "(1) the generalization language is insufficient to
describe the target generalization, or (2) the training instances contain errors"
then it will be impossible to find a target generalization and for the search
methods given, the algorithm will not be able to tell which of these two
cases caused the failure. Gracefully dealing with this kind of problem can
be very important for practical applications.
- Partially Learn Generalizations: It is important to be able to use
partially learned generalizations as it is "unlikely in realistic applications
that sufficient training data will be available to fully determine every
needed generalization."