Linear Thinking

Consider a linear system:

\[ \newcommand{\R}{\mathbb{R}} \newcommand{\u}{\mathbf{u}} \newcommand{\v}{\mathbf{v}} \newcommand{\w}{\mathbf{w}} \newcommand{\vv}[1]{\mathbf{#1}} \newcommand{\MM}[4]{\begin{bmatrix} #1 & #2 \\ #3 & #4 \\ \end{bmatrix} } \newcommand{\VV}[2]{\begin{bmatrix} #1 \\ #2 \\ \end{bmatrix} } \begin{align} 3x + 2y + z &= 39 \\ 2x + 3y + z &= 34 \\ x + 2y + 3z &= 26 \end{align} \]

Recall that in some applications, we must solve the system while varying the values on the right-hand side while preserving the left-hand side. Thus it is worth studying only the left-hand side of the system in isolation.

\[ \begin{bmatrix} 3 & 2 & 1 \\ 2 & 3 & 1 \\ 1 & 2 & 3 \end{bmatrix} \]

rice = [[3,2,1], [2,3,1], [1,2,3]]

This is known as the coefficient matrix.

Each expression such as \(3x + 2y + z\) is a linear combination of the variables, that is, the sum of scalar multiples of the variables.

This expression is also the dot product of two vectors, namely tuples of reals, where one vector is \((3, 2, 1)\) and the other vector is \((x, y, z)\). A dot product is denoted with a centered dot:

\[ (3, 2, 1) \cdot (x, y, z) = 3x + 2y + z \]

Our example matrix takes three input numbers \(x, y, z\) and yields three output numbers, one for each linear combination. In general, if we are working with real numbers, an \(m\times n\) matrix represents \(m\) linear combinations of \(n\) variables, which we view as a function:

\[ f: \R^n → \R^m \]

To evaluate \(f\) on a vector of \(n\) input reals, we compute \(m\) dot products:

dot xs ys = sum $ zipWith (*) xs ys
matrixFun f vs = dot vs <$> f

matrixFun rice [1000000,1000,1]
matrixFun rice [37/4, 17/4, 11/4]

Some puzzles naturally arise:

Functions represented by \(m\times n\) matrices are a strict subset of all functions from \(\R^n\) to \(\R^m\). For starters, every such function must map a zero vector to a zero vector. What else is special about them?
If we compose functions represented by matrices, can the resulting function also represented by a matrix?

Matrix Product

THe second question is less vague and easy to answer.

Let \( A: \R^n → \R^m \) be the function represented by:

\[ \begin{bmatrix} a_{11} & … & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{m1} & … & a_{mn} \\ \end{bmatrix} \]

Let \(B : \R^d → \R^n\) be the function represented by:

\[ \begin{bmatrix} b_{1d} & … & b_{1d} \\ \vdots & \ddots & \vdots \\ b_{n1} & … & b_{nd} \\ \end{bmatrix} \]

Let’s name the above matrices \(A\) and \(B\).

Let \(C\) be their composition \(A \circ B : \R^d → \R^m \), and let \(\v = (v_1, …, v_d)\) be a vector of reals. Then via brute force calculation of \(A(B(\v))\), we find \(C\) can be represented by the matrix \(C\) whose entry \(c_{ij}\) at row \(i\) and column \(j\) is:

\[ c_{ij} = \sum_{k=1}^n a_{ik}b_{kj} \]

We call \(C\) the matrix product of \(A\) and \(B\):

\[ C = A B \]

Thus we have answered the second question. The set of functions that can be represented by matrices is closed under composition.

Let’s confirm it by squaring our example matrix:

matrixMul aRows bRows = go <$> aRows where
  go row = dot row <$> transpose bRows

matrixMul rice rice
matrixFun (matrixMul rice rice) [1000000,1000,1]
matrixFun rice $ matrixFun rice [1000000,1000,1]

Function composition is associative, hence matrix multiplication is associative, which is hard to see from the equations defining \(C\). A direct proof would need to manipulate linear combinations of linear combinations of linear combinations; it’s magical that we can avoid monstrous calculations simply by thinking with functions.

By convention, a vector \(\v\) counts as a \(n \times 1\) matrix. Applying a matrix \(A\) to a \(n\)-dimensional vector input is equivalent to computing the matrix product \(A\vv{v}\), which we call matrix-vector multiplication:

matrixFun rice [1000000,1000,1]
matrixMul rice $ map (:[]) [1000000,1000,1]

Thus linear systems can be phrased as a matrix equation:

\[ A \v = \w \]

where \(\v\) is a vector of variables, like \((x, y, z)\), and \(\w\) is a vector of constants, like \(39, 34, 26\).

With our convention, computing a dot product is equivalent to left-multiplying by a transpose:

\[ \v \cdot \w = \v^T \w = \w^T \v \]

Some texts drive on the other side of the road, and consider a vector \(\v\) to be a \(1 \times n\) matrix. Applying a matrix \(A\) to it is then equivalent to \(\v A\). This convention is common in coding theory.

The formula for matrix products matches rows with columns, which may seem unintuitive. Why aren’t we pairing rows with rows? Or columns with columns?

I came to peace with this by likening function composition to connecting extension cords: to plug one extension cord into another to make a longer extension cord, we must match complementary ends. Thus connecting functions together requires connecting the output of one to the input of the other, which in matrix form becomes connecting the rows of one to the columns of the other.

Linear Maps

To answer the first question, we first define sums and scalar multiples of vectors of \(\R^n\) and \(\R^m\). We simply perform the operation on each of the coordinates:

vectorSum = zipWith (+)
vectorScale = map . (*)

Since linear combinations are by definition sums of scalar multiples, we find that:

\[ \begin{align} f(\v) + f(\w) &= f(\v + \w) \\ f (a \v) &= a f(\v) \end{align} \]

for all \(\v, \w \in \R^n\) and \(a \in \R\). A function \(f : \R^n → \R^m\) satisfying these conditions is a linear transformation or linear map.

When \(m = 1\), that is \(f : \R^n → \R \), we call \(f\) a linear functional.

When \(m = n\), that is \(f : \R^n → \R^n \), we call \(f\) a linear operator.

We leave it as an exercise to show that any linear transformation can be represented with a matrix. (Hint: consider the standard basis vectors \((1,0,…,0), …, (0,…0,1)\).)

Thus we have answered the first question. Functions that can be represented by matrices are precisely the linear transformations, that is, functions that commute with vector addition and scalar multiplication.

Along the way, we have provided an alternative route to answering the second question. From the definition of linear maps, we can show the composition of linear maps is linear, which then implies their composition can be represented as a matrix. However, we still want an explicit formula for the matrix product, so we prefer the original proof.

Computer Graphics

While some might find the above fascinating, my younger self did not. I was content with solving systems of linear equations; why bother going beyond what the ancients knew? However, my younger self was also obsessed wtih computer graphics, which turned out to be a gateway to the next level of linear algebra.

Like Fermat and Descartes, programmers represent an \(n\)-dimensional point with an \(n\)-dimensional vector of reals. Thus a two-dimensional picture is some set of pairs of reals:

pic = [[0,0], [3,0], [3,1], [1,1], [1,5], [0,5]]

We plot them with SVG:

draw ps = jsEval_ $ "runme_out.insertAdjacentHTML('beforeend',`" ++ svg ps ++ "`);"
  where
  svg ps = concat
    [ "<svg width=320 height=200"
    , " viewBox='-8 -9 16 10'>"
    , plot ps
    , "</svg>"
    ]
  plot ps = "<polygon points='" ++ unwords (go <$> ps) ++ "' fill='#8c1515' stroke='black' stroke-width='0.1'/>"
  go [x,y] = shows x . (',':) . shows (-y) $ ""

draw pic

We can easily figure out expressions involving the coordinates to perform operations like translation, rotation, and zooming in and out. However, the more operations there are, the more computation we must do on each point.

The previous section suggests an optimization. A linear map is a kind of transformation from an \(n\)-dimensional space to an \(m\)-dimensional space. If we have many of them, instead of applying each of them on each point, we can use matrix multiplication to compute a single transformation that we then apply on each point.

In Haskell, this is like replacing:

map (matrixFun f . matrixFun g . matrixFun h) points

with:

map (matrixFun fgh) points where fgh = f `matrixMul` g `matrixMul` h

This is a good plan if linear maps can represent a wide range of useful geometric operations. But can they? Linearity is a harsh restriction. A straight line is given by:

\[ \vv{v} + t \vv{w} \]

where \(\v, \w\) are constant vectors with \(\w \ne 0\), and where \(t\) ranges over the reals. For example, in 2D, the line that intercepts the \(y\)-axis at 6 with slope 7 is given by:

\[ (0,6) + t (1,7) \]

This implies a linear map must take straight lines to straight lines, the original motivation for the term "linear". Moreover, there are other limitations. A linear map must preserve the origin; a linear map must be continuous, and indeed, preserve proportions: a line interval must map to a line interval, and its midpoint must map to the midpoint of the output interval.

Nonetheless, linear maps suffice for many applications. Look around: any straight line you see remains straight no matter how you position your head. It turns out linear maps include all the transformations needed for rendering 3D scenes as a player moves around in a game world.

Let’s start by playing with simple 2x2 matrices, which represent operations on two-dimensional space.

We can scale horizontally by any given scalar \(a\):

\[ \MM{a}{0}{0}{1} \VV{x}{y} = \VV{ax}{y} \]

draw $ matrixFun [[2,0], [0,1]] <$> pic

The special case \(a = 1\) leaves the input unchanged, thus we have the 2x2 identity matrix which is denoted \(\vv{I}_2\). The subscript can be omitted if it is clear from context.

\[ \vv{I} \VV{x}{y} = \MM{1}{0}{0}{1} \VV{x}{y} = \VV{x}{y} \]

The special case \(a = 0\) is the perpendicular projection to the \(y\)-axis:

\[ \MM{0}{0}{0}{1} \VV{x}{y} = \VV{0}{y} \]

draw $ matrixFun [[0,0], [0,1]] <$> pic

The special case \(a = -1\) reflects across the \(y\)-axis:

\[ \MM{-1}{0}{0}{1} \VV{x}{y} = \VV{-x}{y} \]

draw $ matrixFun [[-1,0], [0,1]] <$> pic

The following shear transformation skews the picture so that vertical lines become slanted. Horizontal lines stay horizontal:

\[ \MM{1}{a}{0}{1} \VV{x}{y} = \VV{x+ay}{y} \]

draw $ matrixFun [[1,0.5], [0,1]] <$> pic

Rotations about the origin commute with vector sums and scalar products, hence are linear maps. We can write down the matrix representing rotation by \(\theta\) by considering the final locations of the standard basis vectors \((1,0), (0,1)\):

\[ \begin{align} (1,0) &\mapsto (\cos\theta,\sin\theta) \\ (0,1) &\mapsto (-\sin\theta,\cos\theta) \\ \end{align} \]

Thus the rotation matrix must be:

\[ \MM{\cos\theta}{-\sin\theta}{\sin\theta}{\cos\theta} \]

-- sin (pi/3) = 0.8660...
draw $ matrixFun [[0.5,-0.866], [0.866,0.5]] <$> pic

It may seem we have only toyed with a few examples: horizontal scaling, shears, and rotations. In fact, this is more than enough, because thanks to the power of composition via matrix multiplication, these building blocks are more than enough to generate all possible linear maps from \(\R^2\) to itself.

Understanding why can wait. For now, we focus on the benefits for programmers, just like my childhood self. Matrix multiplication means we can compute a single 2x2 matrix for combining useful transformations, such as a rotation and a zoom ("rotozoom"), which we can apply to every point of our picture in one go.

The above generalizes to any number of dimensions.

Lost in Translation

Now to address the elephant in the room: translation. Linear maps must preserve the origin, thus cannot directly express this vital operation. Must we abandon our dream of computing a single matrix to transform all the points?

Imagine flying a kite. If you stand still and hold the string tight, the kite still moves around. If the wind dies down, it falls to the ground. Define yourself to be the origin. You perceive the kite as a moving two-dimensional image, though in reality, the kite is being transformed in three dimensions and the origin never moves.

This inspires the solution. We add an extra dimension, and place our object one unit (one kite string length) along this extra axis. Then suitable shear transformations preserve the origin, yet act as translations on the (hyper)plane exactly one unit along the extra axis.

A kite that is twice as big but twice as far looks the same as the original kite, which suggests a sensible way of interpreting any point, regardless of how deep it is in the extra dimension, a scheme known as homogeneous coordinates. This also allows us to express perspective projection as a linear map.

In short, a single 4x4 matrix can describe an arbitrary composition of three-dimensional linear transformations as well as three-dimensional translations and perspective projection.

Elemental Magic

Nothing up my sleeve. I have in my hand the \(3\times 3\) identity matrix \(\vv{I}_3\):

\[ \vv{I}_3 = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{bmatrix} \]

Pick an elementary row operation, any elementary row operation. Apply it to \(\vv{I}_3\), and call the result \(A\). Such a matrix is called an elementary matrix.

For example, let’s say you chose to swap the second and third rows:

\[ A = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 1 & 0 \\ \end{bmatrix} \]

Next, pick any three-row matrix \(B\), say:

\[ B = \begin{bmatrix} 1 & \cdots & 8 \\ 9 & \cdots & 16 \\ 17 & \cdots & 24 \\ \end{bmatrix} \]

Then \(AB\) is precisely what you get from applying the same elementary row operation to \(B\).

matrixMul
  [[1,0,0],[0,0,1],[0,1,0]]
  [[1..8],[9..16],[17..24]]

To show why, we only need to show it works as advertised on \(3\times 1\) matrices, as more columns just repeats the trick in parallel. This is a matter of direct calculation with each of the matrices that correspond to the elementary row operations. The calculations generalize to matrices of all sizes.

We can change an identity matrix into any given matrix via elementary row operations, except we also allow scaling a row by zero.

Geometrically:

scaling a row (possibly by zero) corresponds to scaling along a certain axis
adding a multiple of a row to another corresponds to a shear transformation
swapping two rows corresponds to a reflection across a diagonal between two certain axes.

This is how we knew our tiny set of examples above generated all 2x2 linear maps. In fact, we could have replaced all the rotations with just one: rotation by 90 degrees, as that produce the row-swapping matrix when combined with the reflection matrix we gave.

Other tricks:

Right-multiplying a matrix \(A\) by an elementary matrix results in elementary column operations being performed on \(A\).
The transpose of a permutation matrix is its inverse.

We leave the proofs as an exercise.

LU Decomposition

We reproduce our function that records permutations and row multipliers during forward elimination. We also need our matrix pretty-printing routine.

forwardElim' [] = []
forwardElim' xs = p : rest where
  (zs, p@((pivot:pivotTail), _):nzs) = partition ((== 0) . head . fst) xs
  rest = first (0:) <$> forwardElim' (elim <$> (zs ++ nzs))
  elim (h:t, (idx, ks)) = let k = h*(1/pivot) in
    (zipWith (-) t $ map (k*) pivotTail, (idx, k:ks))

pretty = mapM_ ((>> putChar '\n') . mapM_ go) where
  go x = do
    let d = denominator x
    let str = show (numerator x) ++ (if d == 1 then "" else '/':show d)
    putStr $ replicate (8 - length str) ' '
    putStr str

We now know how to represent elementary row operations as matrices, so this time we dress up the caches as matrices so that:

the product \(P A\) reproduces the permutations of the rows of \(A\) performed during forward elimination
the matrix \(L\) can undo the row operations it represents, that is, \(L U = P A \)

We find that L is indeed lower triangular, that is, its nonzero entries only appear on the main diagonal or below it. If we had a D matrix, we would see it is diagonal, that is, its nonzero entries can only appear on the main diagonal.

plu m = [p, l, u] where
  (u, info) = unzip $ forwardElim' $ zip m $ (, []) <$> [0..]
  (perm, lRaw) = unzip info
  n = length u
  p = take n <$> zipWith (++) (map (`replicate` 0) perm) (repeat (1 : repeat 0))
  l = map (take n . (++ repeat 0) . reverse . (1:)) lRaw

We check \(P A = L U\) for our rice matrix:

demo a = do
  putStrLn "A =" *> pretty a
  let [p, l, u] = plu a
  putStrLn "P =" *> pretty p
  putStrLn "L =" *> pretty l
  putStrLn "U =" *> pretty u
  putStrLn "P A =" *> pretty (matrixMul p a)
  putStrLn "L U =" *> pretty (matrixMul l u)

demo rice

We also test the permutation code on a system that requires it. We take the left-hand side of a previous example.

anotherExample = init <$>
  [ [0, 0, 0, 1, 2, 3]
  , [4, 5, 6, 7, 8, 9]
  , [2, 7, 1, 8, 2, 8]
  , [0, 1, 0, 1, 0, 1]
  , [3, 1, 4, 1, 5, 9]
  ]
demo anotherExample

In practice, the permutation matrix \(P\) is just for show, and specialized code reorders the rows.

Invertibility

Consider a linear system of \(n\) equations with \(n\) variables and let \(A\) be the matrix representing the left-hand sides of the equations.

Recall that if the solution is unique, then the reduced echelon form of \(A\) is \(\vv{I}_n\), the identity matrix of size \(n\). Let \(E_1, …, E_k\) be the elementary matrices representing the elementary row operations we performed to reduce \(A\) to \(\vv{I}\). Then:

\[ E_k … E_1 A = \vv{I} \]

Write \( A^{-1} = E_k … E_1 \). This is the inverse of \(A\), aptly named since \( A^{-1} A = \vv{I} \).

A linear system asks us to solve the following for \(\v\):

\[ A \v = \w \]

Left-multiplying both sides by \(A^{-1}\) yields:

\[ \v = A^{-1} \w \]

Thus computing \(A^{-1}\) an alternative to LU decomposition for repeated solves of a linear system, though it only works when a unique solution is guaranteed, that is, when \(A\) is invertible.

Also, left-multiplying the previous equation by \(A\) yields:

\[ \w = A \v = A A^{-1} \w \]

As this holds for all \(\w\), we conclude \(A A^{-1} = \vv{I}\).

We reuse our forward elimination routine to invert a matrix. Instead using the multipliers to build the \(L\) matrix, we start from the identity matrix and apply the corresponding row operations one after another, resulting in its inverse \(L^{-1}\). That is, if \(A\) denotes the permuted input matrix, then:

\[L^{-1} A = U\]

We implicitly define a diagonal matrix \(D^{-1}\): its main diagonal is the inverses of the leading entries of \(U\), and is otherwise zero. It’s more efficient to scale rows with specialized code rather than call a generic matrix multiplication function with a diagonal matrix.

The main diagonal of the upper triangular matrix \(D^{-1} U\) are all ones, so backElim applies a few more elementary row operations to reach the identity matrix. As it proceeds, it applies the same operations to \(D^{-1}L^{-1}\), yielding the matrix

\[ (D^{-1} U)^{-1} D^{-1} L^{-1} = U^{-1} L^{-1} = A^{-1} \]

We could have written the code so that forwardElim and backElim share their core logic, as they’re really doing the same thing, but I wanted to reuse the forward elimination code, and there were enough differences (one has arbitrary leading entries, the other has ones; one starts from the identity matrix, the other starts from a given matrix) that it was easiest to repeat myself.

Lastly, we right-multiply by \(P\) to account for the permutation. An alternative is to start from \(P\) instead of \(\vv{I}\) when building the inverse matrix, but I prefer to treat \(P\) as an afterthought.

invert m = matrixMul (backElim (applyD u) (applyD lInv)) p where
  (u, info) = unzip $ forwardElim' $ zip m $ (, []) <$> [0..]
  (perm, lRaw) = unzip info
  n = length u
  p = take n <$> zipWith (++) (map (`replicate` 0) perm) (repeat (1 : repeat 0))
  lInv = reverse $ map (take n . (++ repeat 0)) $ foldl buildLinv [] lRaw
  applyD = zipWith scale $ (1/) <$> leads u

buildLinv acc mults = row:acc where
  idRow = replicate (length mults) 0 ++ [1]
  row = foldl (zipWith (-)) idRow $ (++ repeat 0) <$> scaledRows
  scaledRows = zipWith map (map (*) mults) acc

scale a = map (a*)

leads = \case
  [] -> []
  row:rest -> head row : leads (tail <$> rest)

backElim u inv = revrev $ go (revrev u) (revrev inv) where
  revrev = reverse . map reverse
  go = \cases
    [] [] -> []
    ((1:_):rest) (row:rows) -> row : go (tail <$> rest) rows' where
      rows' = zipWith (zipWith (-)) rows
        $ map (`scale` row) (head <$> rest)

Instant Rice

Let’s test that we get the identity if we left-multiply or right-multiply a matrix by its inverse:

do
  let a = anotherExample
  let b = invert a
  putStrLn "B = inverse of A:"
  pretty b
  putStrLn "B A:"
  pretty $ matrixMul b a
  putStrLn "A B:"
  pretty $ matrixMul a b

Here’s the inverse of the matrix from the rice problem:

unrice = invert rice
pretty unrice

We can solve the linear system with a single matrix-vector multiplication:

matrixMul unrice $ map (:[]) [39, 34, 26]

Ben Lynn blynn@cs.stanford.edu 💡