Vision | ||
Patrick Doyle | AI Qual Summary | May 13, 1997 |
The vision process consists of four sequential operations of increasing complexity:
(Nilsson roughly groups these stages into two parts: images processing, which does filtering to reduce noise and highlight important discontinuities, and scene analysis, which attempts to create an iconic or a feature-based description of the original scene.)
Classifying 3D objects is difficult because of the large number of possible orientations of each object. Because 2D images are highly ambiguous, having multiple images can aid this process. Stereo vision (having multiple, simultaneous views) is one way of doing this.
Image-level invariants are weak, while 3D invariants are strong and valuable for object-level interpretation. The inverse projection problem is simple, but useless since there is no unique inverse; only very high orders of continuous infinities have projectively equivalent surfaces.
While no low-level image can tells us what an object is, the object's surroundings provide us with top-down expectations. Such expectations are critical for interpreting visual scenes, but resolving expectations can be tricky.
Vision researchers for many years held that object identification required image processing on several levels, and that matching could only occur on the highest:
This has turned out not to be the case; matching can be done at the
primal sketch level, rather than at the volume description level.
The simplest way to form an image is to use a pinhole camera. This is, basically, a box that has a small hole on one side to let the light through; if pointed at an object P at coordinates (X, Y, Z), it produces an inverted image on its rear wall (image plane) at coordinates (x, y, z). If the distance between the pinhole and the image plane (the focal length) is f, then we have that
This is known as perspective projection. The process can equivalently be modeled with the projection plane being a distance m in front of the pinhole -- in which case the image is upright.
If the object is relatively shallow compared to its distance from the camera, we can approximate perspective projection by scaled orthographic projection. The idea is that if the depth of points on the object varies in some range Z0 +/- DZ, where DZ << Z0, then the perspective scaling factor f/Z can be approximated by a constant f/Z0.
Note that under orthographic projection, parallel lines stay
parallel, and do not converge. This method is useful only as an
approximation for parts of the scene with little internal depth
variation.
Assume that the original image is represented as an m x n array of numbers (called the image array), that divides the image plane into cells called pixels. These numbers represent intensity at these points. Certain irregularities in the image can be smoothed out by an averaging operation.
This operation involves sliding an averaging window across the image array. The window is centered at each pixel in turn, and the weighted sum of the pixel values within the window is computed. This sum then replaces the original value at that pixel. This sliding and summing operation is called convolution.
Averaging tends to suppress isolated specks of noise, but also reduces the image crispness and loses small image components.
A very common function used for smoothing is the Gaussian of two dimensions,
where s (sigma) is the standard deviation of the Gaussian, and determines the "width" of the surface; hence, the degree of smoothing.
At the edge between two flat faces, brightness should change sharply from one value to another. In practice the value is generally corrupted, because input devices don't usually produce simple, clean images. Problems also arise because images are complex -- edges may be slightly rounded, and there may be mutual-illumination effects, scratches, dust, etc.
In order to detect edges with arbitrary orientation in an image, the image is convolved with two filters, a horizontal one and a vertical one. Specifically, fV = G'(x)G(y) and fH = G'(y)G(x), where G(x) is the one-dimensional version of the Gaussian function defined above, and G'(x) is its derivative.
The Canny edge-detection algorithm is then to convolve the image I(x, y) with each of those filters in turn, yielding RV(x, y) and RH(x, y). It then computes RV2 + RH2, and marks peaks of the absolute value that are above some sum.
(This is from Russell and Norvig, and the above section is from Nilsson. I think Nilsson's operator can be used with its own derivative to avoid having to do convolution twice, but it's not really clear to me. Nilsson also introduces the idea of a Laplacian operator with which the Gaussian is composed, to get the edge detection. Forgive the confusion! -- POD)
Convolution with two-dimensional Gaussian point-spread functions is equivalent to sequential convolution with one-dimensional Gaussian point-spread functions, one vertical and one horizontal. Thus two-dimensional convolution can be made fast.
When using a convolution with a wide sombrero function, details
blur out. The wider the function, the more the image is blurred, and
the number of zero crossings decreases. The fewer zero crossings, the
smaller the number of accidental mismatches. A large sombrero width
also loses the details needed to make precise distance judgments.
Extracting 3D vision has three aspects:
[Horn and Schunck] A method for computing optical flow by differentiating the brightness distribution of an image with respect to time. Optical flow is the distribution of velocities of apparent movement caused by smoothly changing brightness patterns. Optical flows encode rich information about a scene and observer motion. One problem is that it is not possible to determine the component of the flow field perpendicular to the intensity gradient, that is, along the iso-brightness contours.
The idea here is to use two sources separated in space (as opposed to time, which is what happens in optical flow) to extract additional data. If the two images are superimposed, there will be a disparity in the location of the image feature. This allows the distance to the image to be determined.
Assume two lenses, separated from the image planes (the place where the image falls on the retina) by a focal length f, and separate from one another by a baseline distance b. In the image, a point P is at a distance l from the left lens axis and r from the right (this is horizontal distance). P appears on the left image at a distance a from the left lens axis, and in the right image at a distance b from the right lens axis. By similar triangles
Solving for d we have
The sum a + b is called the disparity. The distance to the point is inversely proportional to it.
The main problem with stereo vision is the correspondence problem. Corresponding points must always lie along epipolar lines in the images. This epipolar constraint reduces the correspondence problem from two dimensions to one.
Area-based approaches that attempt to correlate points have trouble when corresponding areas do not look similar, either because the surface reflectance has a significant specular component, so the brightness of a point depends on the viewpoint, or there is a differing amount of foreshortening in the two views. An alternative approach does edge-finding and then tries to match the edges, but this approach gives limited depth information, so additional work must be done to interpolate depth in the scene.
Texture is an important visual cue to the properties of a surface. It is easy to distinguish textures such as velvet, woolen weaves or foliage even in black-and-white images. The distinguishing characteristic is regularity of the texels.
Mathematical analysis allows one to compute expressions for the rate of change of various texel features in an image, such as area, foreshortening, and density. These texture gradients are functions of the surface's shape as well as slant and tilt with respect to the viewer.
Texture can be used to determine shape via a two-step process: (a) measure the texture gradients, and (b) estimate the surface shape, slant, and tilt that could give rise to them.
Shading -- the variation in light intensity received from different parts of a surface in a scene -- is determined by the geometry of the scene and the reflectance properties of the surfaces. In computer graphics, the goal is to determine the image brightness given the scene geometry and reflectance. In computer vision, one would like to invert this process, but it has turned out to be difficult in any but the simplest cases.
The real problem is that of interreflectance -- objects are illuminated not only by the light source, but by reflections of the light off of other objects. Reflectance maps fail completely in this situation.
An important special case is Lambertian reflectance, where the intensity varies as the vector dot product of the local surface normal and the direction of the input source.
The goal here is to determine the orientation of a surface given some shading information. The amount of light reflected from a surface depends on its material, and also upon the incident angle and emergent angle.
A reflectance map can be drawn using iso-brightness lines. If the light source is aimed straight at a sphere, the lines will be circles. At 45 degrees they will be offset to a side, and at 90 degrees they will be straight lines on half the sphere projection.
To determine surface direction from perceived brightness, assumptions about smoothness can be used, to wit, the surface direction should not vary much from what the perceived brightness indicates, and the surface direction should not vary much from those of neighboring points.
[Horn, Woodham, and Silver] An approach to shape from shading. The camera position is fixed, and light sources are set up at two known points. The image from the first light restricts the surface orientation at (x, y) to the iso-brightness contour in the reflectance map corresponding to the brightness value computed from I1(x, y). Similarly the surface normal is constrained by the iso-brightness contour defined by I2(x, y) (the second light). A third light source provides complete disambiguation.
This approach tries to match elements of the image to templates stored in system memory.
Pixel templates come in four types:
Range sensors include a signal transmitter and a receiver. The round-trip travel time of the returning signal is measured, together with its intensity. In practice, the signals used are generally ultrasound and laser light. Ultrasound is adversely affected by surface specularity, and it also has poor spatial resolution, so it isn't often used. Usually the result is n x m pixels of ranges.
The schemes used are pulse time delay, which measures the time of flight directly, AM phase shift, which measures the shift that is proportional to the time of flight, and FM beat, also proportional to the flight time.
Problems include specular reflection, slow measurement, and
ambiguity in AM phase shift.
The object recognition problem can be defined as follows. Given a scene consisting of one or more objects chosen from a predefined collection, and an image of the scene taken from an unknown position and orientation, determine which objects are present and what the positions and orientations of those objects are relative to the viewer.
The two most popular representations for 3-D objects are polyhedral approximations and generalized cylinders. The former are general, but "painfully" long if high accuracy is required in modeling curved objects. They are also cumbersome for users to input.
Generalized cylinders provide compact descriptions of a wide variety of objects, and have been used in a number of object-recognition systems. However, they are not always ideal for representing arbitrary objects; the objects may have to be decomposed into many parts, and there may be several alternative decompositions. Also, the problem of effective shape representation for curved objects is largely unsolved.
This approach is used to identify 3-D objects from their projections on the image plane, with an initially unknown pose. The idea is to write equations relating the object's distinguished features in the database to those of in the perceived image.
The result of [Huttenlocher and Ullman, 1990] states that: Given three noncollinear points m1, m2, and m3 in the model, and their projections on the image plane p1, p2, p3 under scaled orthographic projection, there exist exactly two transformations from the three-dimensional model coordinate frame to a two-dimensional image coordinate frame.
A generate-and-test approach produces correspondences between images and models and tests them for validity. The worst case complexity of the alignment algorithm is O(m4n3 log n), where m and n are the number of model and image points respectively. Various improvements (only hypothesize matches between pairs of points, pose clustering and randomization) can reduce the complexity to O(mn3)).
A problem with alignment methods is that they involve trying each model in the library, resulting in a complexity proportional to the number of models. One solution is to use geometric invariants as the shape representation.
The idea is to model objects in terms of shape descriptors that do not change with pose. A simple example is the "cross-ratio" of four points on a line. Invariants are significant because they can be used as index functions into models in the database. Hypotheses about matching objects can then be projected back into the image for verification.
Another advantage of invariant representation is that models can be
acquired directly from images; it is particularly useful in
applications such as recognition from satellite images.
[Huffman and Clowes, 1971] Huffman and Clowes independently attempted the first systematic approach to polyhedral scene analysis. They limited the analysis to scenes with opaque, trihedral solids -- objects in which exactly three plane surfaces come together at each vertex. Cracks ("edges" across which the tangent planes are continuous) were not allowed. This means that there are 18 possible trihedral vertices (out of 64 conceivable) in this world. In this blocks world, there are only three types of edges between the blocks:
In this blocks world, there are only four ways to label a line. The line can be convex, concave, or a boundary line (facing either direction). The direction of a boundary line is determined by observing which side of the line corresponds to a face of the object that causes it.
Waltz showed that, in this environment, the work required to analyze the drawings grows roughly linearly with the number of lines. Waltz's algorithm [1975] always finds the unique, correct labeling if one exists, but if the figure is ambiguous it will terminate with at least one vertex that has multiple labels. (Waltz's algorithm actually handled the Huffman-Clowes world with shadows, cracks, and separably concave edges.)
Subsequent work by [Mackworth, 1973] and [Sugihara, 1984]
generalized to arbitrary polyhedra and [Malik, 1987] to piecewise
smooth curved objects.
Human vision is rapid only because a massive amount of wetware is devoted to it; vision systems occupy a major portion of the cortex.
Helmholtz (1821-1894) did the early (and still relevant) work on vision. He claimed that there was a clear separation between low-level (early) and high-level (later) processing of images in the brain. Note that early and later in this context are not temporal terms, but distinguish image processing that takes place between the retina and the cortex from cortical processing.
Ernst Mach separated task analysis from its processing mechanisms.
A rough outline of approaches and emphases in vision:
Era | Theme | Comments |
---|---|---|
Very early work (c. 1970) | Sequential | Bottom-up, knowledge-free pattern recognition approaches using edge-detection and other low-level image operators |
Early work (c. 1975) | Heterogeneous | Top-down use of ad hoc domain knowledge |
Middle period (c. 1980) | Marr's Paradigm | Computational models of low-level vision modules |
Recent (c. 1985) | Model-Based | Knowledge-based, 3-D vision, systematic use of models |
Present | Systemic | Vision as part of larger implemented systems |
[Roberts, 1965] The first computer vision system. The universe of discourse was the blocks world. Processing concentrated on extraction of line drawings, and inference matched them into database objects.
Line drawings were taken to represent a segmentation of the image into meaningful pieces. Edges were detected by rapid changes in the value of a function at a point, accompanied by large values in the derivative at that point.
[Brooks, 1981] A general vision system. ACRONYM is "complete" in the sense that its interpretation makes use of all available information, including multi-sensor data, knowledge of experts, geometric models of objects, and prediction of expectations.
It uses viewpoint-independent, 3D object models in the form of part/whole graphs, described in terms of generalized cylinder primitives. The predicted image features are shapes represented by ribbons and ellipses.
ACRONYM has weak segmentation; the ribbon finder locates spurious ribbons, and misses small ones. Grouping ribbons is only done in interpretation, not in segmentation. The line finder and ribbon finder both perform badly on textures. The top-down approach works poorly in complex scenes.
ACRONYM has been used as the basis of robot systems simulators, and
for automated object grasping, using a rule base to determine which
surfaces are accessible in the initial position, which accessible in
the final position, and how to grasp with maximum stability.
Ronny Kohavi, AI Qual Note #15: Vision.
Russell and Norvig, Artificial Intelligence: A Modern Approach, Chapter 24.
Nilsson, Draft of New AI Textbook, Chapter 6.
See the AI Qual Reading List for further details on these sources.
Patrick Doyle | pdoyle@cs.stanford.edu |