Vision
Patrick Doyle AI Qual Summary May 13, 1997


Table of Contents




Definitions

the vision problem
The real, three-dimensional world is converted into a two-dimensional image by the sensing apparatus. The vision problem is to reverse this process; that is, somehow the three-dimensional reality must be inferred from the two-dimensional image.


low-level (early) vision
Solutions using general-purpose assumptions and special-purpose hardware.


high-level (later) vision
Solutions using special-purpose assumptions and general-purpose hardware.


segmentation
The opposite of edge detection. (detail?)


primal sketch
A representation used to extract intensity changes corresponding to events in the scene. There are three stages in creating the primal sketch:

  1. Feature points are found at which intensity changes are deemed "significant." To find feature points the image is convolved with directional operators in "sufficiently many" directions. A single rotationally symmetric operator (the Laplacian) gives precisely the same results if a condition called linear variation holds.

    By the Convolution Theorem, instead of taking the Gaussian and then the Laplacian, we can find an operator that does both (Del-G) * image, where G is the Gaussian operator and * denotes convolution.

  2. Feature points are grouped to form line segments or small, closed contours.

  3. Line segments are interpreted as scene events.


generalized cylinders
[Binford] An approach in which all objects are considered to be composed of instances of a small set of prototype volumes, such as spheres, blocks, and triangular prisms. A generalized cylinder describes a volume by sweeping a cross-sectional area along a space curve (called the spline) while deforming it according to some sweeping rule.


ribbons
The two-dimensional analogue of generalized cylinders.


point-spread function
A function that executes in one step the smoothing, then double differentiation of an input brightness array. The output is said to be convolved.


Lambertian surface
A surface that looks equally bright from all possible viewpoints. The observed brightness therefore depends only on the angle on which the light source is incident upon the surface. Lambertian surfaces are what are called matte or non-specular. For such surfaces, the brightest point is the one for which the surface normal points toward the light source.


pyramid
A pyramid is a data structure. It can be thought of as a sequence of two-dimensional arrays. The dimensions of the arrays double at each step in the sequence, so we have

P = < A1x1, A2x2, ... A512x512 >


These arrays consist of pixels that may contain binary, gray-scale, multi-spectral, or local feature information. When going up a level, we average the values of the neighbors.

An example of a typical use of pyramids is seen in the following edge-detection algorithm. A level s is examined for edges: all edges found at level s in the input pyramid will be indicated in the output pyramid at level s. If the edge strength at a pixel is greater than some threshold, the edge is refined recursively. The four corresponding pixels will be examined at the next higher level of resolution.


quad trees
Data structures similar to pyramids; they have nodes that correspond to the cells of a pyramid, and each nonterminal node has four child nodes in the level below.


two-and-a-half-dimension
?


texel
A primitive unit of texture (such as velvet or grass), used in determining shape from texture.


epipolar plane
The plane in which the triangle (range finding by triangulation) lies is called the epipolar plane,


epipolar line
The line where the epipolar plane intersects the camera image (the horizontal plane intersects the vertical camera image and forms this line).


emergent angle
Angle between the surface normal and viewer direction from an object.


incident angle
Angle between the surface normal of an object and the direction of the light source.


phase angle
Sum of the emergent angle and incident angle.


iso-brightness lines
Lines with the same brightness. Damn all this terminology anyway.


boundary lines
Lines where one object face hides another.


Waltz's set
In a world of blocks in which all figures are composed of trihedral vertices, this is the set of all possible junction labels, allowing for shadows and cracks.


Muller-Lyer illusion
Two arrow heads that delimit two line segments of equal length. The one enclosed in the convex region appears shorter than the one in the convex region. (This is the parallelogram with the two line segments in it -- POD)


Kanisza subjective edge
The invisible box formed by the four Pac-Man corners facing inwards. The inference is that knowledge from the most abstract reasoning levels influences the most primitive ones.


convolution
The process of "sliding and summing" over an image intensity array with an averaging window, in which a local average of intensities is computed and replaces the original array element value.


perspective projection
Image projection through a pinhole camera (see Image Formation, below).


Lambert's cosine law
The intensity of light reflected from a surface from a perfect diffuser is given as E = rE0 cos T, where E0 is the intensity of the light source, r is the albedo (varying from 0, which is perfectly black, to 1, which is pure white), and T is the angle between the light direction and the surface normal.


specular reflection
Specularly reflected light is light reflected from the outer surface of an object.


segmentation
The process of organizing the array of image pixels into regions corresponding to "semantically meaningful" entities in the scene.


reflectance map
An image map that specifies the brightness of a surface patch as a function of its surface normal. Used to determine shape from shading.


limb
The locus of points on a surface where the line of sight is tangent to the surface (think about the Sun -- POD)


edge
A surface normal discontinuity.


general viewpoint condition
Small movements of the vision sensor do not cause the character of perceived junctions to change. Used in Huffman and Clowes' blocks world.


geometric invariant
Shape descriptors that have the same value measured on the object or from a perspective image of the object, and are unaffected by the object's pose.



Overview

The vision process consists of four sequential operations of increasing complexity:

  1. Signal processing -- Enhancing the image, either for human consumption or as input to another program.

  2. Measurement analysis -- Determining the 2D extent of an image consisting of one object.

  3. Pattern recognition -- Categorizing an object as one of a finite set of possibilities.

  4. Image understanding -- Locating objects in an image, classifying them, and then building a 3D model of the scene.

(Nilsson roughly groups these stages into two parts: images processing, which does filtering to reduce noise and highlight important discontinuities, and scene analysis, which attempts to create an iconic or a feature-based description of the original scene.)

Classifying 3D objects is difficult because of the large number of possible orientations of each object. Because 2D images are highly ambiguous, having multiple images can aid this process. Stereo vision (having multiple, simultaneous views) is one way of doing this.

Image-level invariants are weak, while 3D invariants are strong and valuable for object-level interpretation. The inverse projection problem is simple, but useless since there is no unique inverse; only very high orders of continuous infinities have projectively equivalent surfaces.

While no low-level image can tells us what an object is, the object's surroundings provide us with top-down expectations. Such expectations are critical for interpreting visual scenes, but resolving expectations can be tricky.

Marr's Paradigm -- Shape from Images

Vision researchers for many years held that object identification required image processing on several levels, and that matching could only occur on the highest:

  1. At the lowest descriptive level, brightness values are conveyed explicitly in the image.

  2. The brightness changes are described explicitly in the primal sketch.

  3. The surfaces implicit in the primal sketch are described explicitly in the two-and-one-half-dimensional sketch.

  4. The volumes implicit in the two-and-one-half-dimensional sketch are described explicitly in the volume description.

This has turned out not to be the case; matching can be done at the primal sketch level, rather than at the volume description level.


Image Formation

Pinhole Camera

The simplest way to form an image is to use a pinhole camera. This is, basically, a box that has a small hole on one side to let the light through; if pointed at an object P at coordinates (X, Y, Z), it produces an inverted image on its rear wall (image plane) at coordinates (x, y, z). If the distance between the pinhole and the image plane (the focal length) is f, then we have that

x = -fX / Z and y = -fY / Z

This is known as perspective projection. The process can equivalently be modeled with the projection plane being a distance m in front of the pinhole -- in which case the image is upright.

If the object is relatively shallow compared to its distance from the camera, we can approximate perspective projection by scaled orthographic projection. The idea is that if the depth of points on the object varies in some range Z0 +/- DZ, where DZ << Z0, then the perspective scaling factor f/Z can be approximated by a constant f/Z0.

Note that under orthographic projection, parallel lines stay parallel, and do not converge. This method is useful only as an approximation for parts of the scene with little internal depth variation.


Low-Level Operators

Averaging

Assume that the original image is represented as an m x n array of numbers (called the image array), that divides the image plane into cells called pixels. These numbers represent intensity at these points. Certain irregularities in the image can be smoothed out by an averaging operation.

This operation involves sliding an averaging window across the image array. The window is centered at each pixel in turn, and the weighted sum of the pixel values within the window is computed. This sum then replaces the original value at that pixel. This sliding and summing operation is called convolution.

Averaging tends to suppress isolated specks of noise, but also reduces the image crispness and loses small image components.

A very common function used for smoothing is the Gaussian of two dimensions,

G(x, y) = (2 x pi x s2)-1 x e-(x2 + y2) / (2s2)

where s (sigma) is the standard deviation of the Gaussian, and determines the "width" of the surface; hence, the degree of smoothing.

The Edge Operator

At the edge between two flat faces, brightness should change sharply from one value to another. In practice the value is generally corrupted, because input devices don't usually produce simple, clean images. Problems also arise because images are complex -- edges may be slightly rounded, and there may be mutual-illumination effects, scratches, dust, etc.

In order to detect edges with arbitrary orientation in an image, the image is convolved with two filters, a horizontal one and a vertical one. Specifically, fV = G'(x)G(y) and fH = G'(y)G(x), where G(x) is the one-dimensional version of the Gaussian function defined above, and G'(x) is its derivative.

The Canny edge-detection algorithm is then to convolve the image I(x, y) with each of those filters in turn, yielding RV(x, y) and RH(x, y). It then computes RV2 + RH2, and marks peaks of the absolute value that are above some sum.

(This is from Russell and Norvig, and the above section is from Nilsson. I think Nilsson's operator can be used with its own derivative to avoid having to do convolution twice, but it's not really clear to me. Nilsson also introduces the idea of a Laplacian operator with which the Gaussian is composed, to get the edge detection. Forgive the confusion! -- POD)

Convolution with two-dimensional Gaussian point-spread functions is equivalent to sequential convolution with one-dimensional Gaussian point-spread functions, one vertical and one horizontal. Thus two-dimensional convolution can be made fast.

When using a convolution with a wide sombrero function, details blur out. The wider the function, the more the image is blurred, and the number of zero crossings decreases. The fewer zero crossings, the smaller the number of accidental mismatches. A large sombrero width also loses the details needed to make precise distance judgments.


Extracting 3D Information Using Vision

Extracting 3D vision has three aspects:

  1. Segmentation of the scene into distinct objects.
  2. Determining the position and orientation of each object relative to the observer.
  3. Determining the shape of each object.

Shape from Motion / Optical Flow

[Horn and Schunck] A method for computing optical flow by differentiating the brightness distribution of an image with respect to time. Optical flow is the distribution of velocities of apparent movement caused by smoothly changing brightness patterns. Optical flows encode rich information about a scene and observer motion. One problem is that it is not possible to determine the component of the flow field perpendicular to the intensity gradient, that is, along the iso-brightness contours.

Binocular Stereopsis

The idea here is to use two sources separated in space (as opposed to time, which is what happens in optical flow) to extract additional data. If the two images are superimposed, there will be a disparity in the location of the image feature. This allows the distance to the image to be determined.

Assume two lenses, separated from the image planes (the place where the image falls on the retina) by a focal length f, and separate from one another by a baseline distance b. In the image, a point P is at a distance l from the left lens axis and r from the right (this is horizontal distance). P appears on the left image at a distance a from the left lens axis, and in the right image at a distance b from the right lens axis. By similar triangles

d/l = (d + f) / (l + a) and d/r = (d + f) / (r + b)

Solving for d we have

d = (f x (l + r)) / (a + b).

The sum a + b is called the disparity. The distance to the point is inversely proportional to it.

The main problem with stereo vision is the correspondence problem. Corresponding points must always lie along epipolar lines in the images. This epipolar constraint reduces the correspondence problem from two dimensions to one.

Area-based approaches that attempt to correlate points have trouble when corresponding areas do not look similar, either because the surface reflectance has a significant specular component, so the brightness of a point depends on the viewpoint, or there is a differing amount of foreshortening in the two views. An alternative approach does edge-finding and then tries to match the edges, but this approach gives limited depth information, so additional work must be done to interpolate depth in the scene.

Shape from Texture

Texture is an important visual cue to the properties of a surface. It is easy to distinguish textures such as velvet, woolen weaves or foliage even in black-and-white images. The distinguishing characteristic is regularity of the texels.

Mathematical analysis allows one to compute expressions for the rate of change of various texel features in an image, such as area, foreshortening, and density. These texture gradients are functions of the surface's shape as well as slant and tilt with respect to the viewer.

Texture can be used to determine shape via a two-step process: (a) measure the texture gradients, and (b) estimate the surface shape, slant, and tilt that could give rise to them.

Shape from Shading

Shading -- the variation in light intensity received from different parts of a surface in a scene -- is determined by the geometry of the scene and the reflectance properties of the surfaces. In computer graphics, the goal is to determine the image brightness given the scene geometry and reflectance. In computer vision, one would like to invert this process, but it has turned out to be difficult in any but the simplest cases.

The real problem is that of interreflectance -- objects are illuminated not only by the light source, but by reflections of the light off of other objects. Reflectance maps fail completely in this situation.

An important special case is Lambertian reflectance, where the intensity varies as the vector dot product of the local surface normal and the direction of the input source.

Computing Surface Direction

The goal here is to determine the orientation of a surface given some shading information. The amount of light reflected from a surface depends on its material, and also upon the incident angle and emergent angle.

A reflectance map can be drawn using iso-brightness lines. If the light source is aimed straight at a sphere, the lines will be circles. At 45 degrees they will be offset to a side, and at 90 degrees they will be straight lines on half the sphere projection.

To determine surface direction from perceived brightness, assumptions about smoothness can be used, to wit, the surface direction should not vary much from what the perceived brightness indicates, and the surface direction should not vary much from those of neighboring points.

Photometric Stereo

[Horn, Woodham, and Silver] An approach to shape from shading. The camera position is fixed, and light sources are set up at two known points. The image from the first light restricts the surface orientation at (x, y) to the iso-brightness contour in the reflectance map corresponding to the brightness value computed from I1(x, y). Similarly the surface normal is constrained by the iso-brightness contour defined by I2(x, y) (the second light). A third light source provides complete disambiguation.

Template Matching

This approach tries to match elements of the image to templates stored in system memory.

Pixel-Level Templates

Pixel templates come in four types:

Time of Flight

Range sensors include a signal transmitter and a receiver. The round-trip travel time of the returning signal is measured, together with its intensity. In practice, the signals used are generally ultrasound and laser light. Ultrasound is adversely affected by surface specularity, and it also has poor spatial resolution, so it isn't often used. Usually the result is n x m pixels of ranges.

The schemes used are pulse time delay, which measures the time of flight directly, AM phase shift, which measures the shift that is proportional to the time of flight, and FM beat, also proportional to the flight time.

Problems include specular reflection, slow measurement, and ambiguity in AM phase shift.


Object Representation and Recognition

The object recognition problem can be defined as follows. Given a scene consisting of one or more objects chosen from a predefined collection, and an image of the scene taken from an unknown position and orientation, determine which objects are present and what the positions and orientations of those objects are relative to the viewer.

The two most popular representations for 3-D objects are polyhedral approximations and generalized cylinders. The former are general, but "painfully" long if high accuracy is required in modeling curved objects. They are also cumbersome for users to input.

Generalized cylinders provide compact descriptions of a wide variety of objects, and have been used in a number of object-recognition systems. However, they are not always ideal for representing arbitrary objects; the objects may have to be decomposed into many parts, and there may be several alternative decompositions. Also, the problem of effective shape representation for curved objects is largely unsolved.

The Alignment Method

This approach is used to identify 3-D objects from their projections on the image plane, with an initially unknown pose. The idea is to write equations relating the object's distinguished features in the database to those of in the perceived image.

The result of [Huttenlocher and Ullman, 1990] states that: Given three noncollinear points m1, m2, and m3 in the model, and their projections on the image plane p1, p2, p3 under scaled orthographic projection, there exist exactly two transformations from the three-dimensional model coordinate frame to a two-dimensional image coordinate frame.

A generate-and-test approach produces correspondences between images and models and tests them for validity. The worst case complexity of the alignment algorithm is O(m4n3 log n), where m and n are the number of model and image points respectively. Various improvements (only hypothesize matches between pairs of points, pose clustering and randomization) can reduce the complexity to O(mn3)).

Using Projective Invariants

A problem with alignment methods is that they involve trying each model in the library, resulting in a complexity proportional to the number of models. One solution is to use geometric invariants as the shape representation.

The idea is to model objects in terms of shape descriptors that do not change with pose. A simple example is the "cross-ratio" of four points on a line. Invariants are significant because they can be used as index functions into models in the database. Hypotheses about matching objects can then be projected back into the image for verification.

Another advantage of invariant representation is that models can be acquired directly from images; it is particularly useful in applications such as recognition from satellite images.


Vision in a Simple Microworld -- Waltz's Algorithm

[Huffman and Clowes, 1971] Huffman and Clowes independently attempted the first systematic approach to polyhedral scene analysis. They limited the analysis to scenes with opaque, trihedral solids -- objects in which exactly three plane surfaces come together at each vertex. Cracks ("edges" across which the tangent planes are continuous) were not allowed. This means that there are 18 possible trihedral vertices (out of 64 conceivable) in this world. In this blocks world, there are only three types of edges between the blocks:

In this blocks world, there are only four ways to label a line. The line can be convex, concave, or a boundary line (facing either direction). The direction of a boundary line is determined by observing which side of the line corresponds to a face of the object that causes it.

Waltz showed that, in this environment, the work required to analyze the drawings grows roughly linearly with the number of lines. Waltz's algorithm [1975] always finds the unique, correct labeling if one exists, but if the figure is ambiguous it will terminate with at least one vertex that has multiple labels. (Waltz's algorithm actually handled the Huffman-Clowes world with shadows, cracks, and separably concave edges.)

Subsequent work by [Mackworth, 1973] and [Sugihara, 1984] generalized to arbitrary polyhedra and [Malik, 1987] to piecewise smooth curved objects.


History of Early Vision

Human vision is rapid only because a massive amount of wetware is devoted to it; vision systems occupy a major portion of the cortex.

Helmholtz (1821-1894) did the early (and still relevant) work on vision. He claimed that there was a clear separation between low-level (early) and high-level (later) processing of images in the brain. Note that early and later in this context are not temporal terms, but distinguish image processing that takes place between the retina and the cortex from cortical processing.

Ernst Mach separated task analysis from its processing mechanisms.

A rough outline of approaches and emphases in vision:

Era Theme Comments
Very early work (c. 1970) Sequential Bottom-up, knowledge-free pattern recognition approaches using edge-detection and other low-level image operators
Early work (c. 1975) Heterogeneous Top-down use of ad hoc domain knowledge
Middle period (c. 1980) Marr's Paradigm Computational models of low-level vision modules
Recent (c. 1985) Model-Based Knowledge-based, 3-D vision, systematic use of models
Present Systemic Vision as part of larger implemented systems



Vision Systems

Roberts' Program -- Blocks World

[Roberts, 1965] The first computer vision system. The universe of discourse was the blocks world. Processing concentrated on extraction of line drawings, and inference matched them into database objects.

Line drawings were taken to represent a segmentation of the image into meaningful pieces. Edges were detected by rapid changes in the value of a function at a point, accompanied by large values in the derivative at that point.

ACRONYM

[Brooks, 1981] A general vision system. ACRONYM is "complete" in the sense that its interpretation makes use of all available information, including multi-sensor data, knowledge of experts, geometric models of objects, and prediction of expectations.

It uses viewpoint-independent, 3D object models in the form of part/whole graphs, described in terms of generalized cylinder primitives. The predicted image features are shapes represented by ribbons and ellipses.

ACRONYM has weak segmentation; the ribbon finder locates spurious ribbons, and misses small ones. Grouping ribbons is only done in interpretation, not in segmentation. The line finder and ribbon finder both perform badly on textures. The top-down approach works poorly in complex scenes.

ACRONYM has been used as the basis of robot systems simulators, and for automated object grasping, using a rule base to determine which surfaces are accessible in the initial position, which accessible in the final position, and how to grasp with maximum stability.


Sources

Ronny Kohavi, AI Qual Note #15: Vision.

Russell and Norvig, Artificial Intelligence: A Modern Approach, Chapter 24.

Nilsson, Draft of New AI Textbook, Chapter 6.

See the AI Qual Reading List for further details on these sources.


Patrick Doyle pdoyle@cs.stanford.edu