Graph Visualization

I have a soft spot for graph layout algorithms because as an undergraduate, I signed up for a project that led me to implement what my professor called "the spring algorithm" (Eades 1984). These days, it appears to be called force-directed graph drawing.

By graph, I mean what Sylvester meant when he coined the term in 1878: a set of nodes (or vertices or points) connected by edges (or lines or links). I wish he had called it a "web" instead, because "graph" already has a meaning in mathematics, and anybody who’s surfed the worldwide web already understands nodes connected with links.

Hidden in the physics building, at the end of the hallway on the second floor, there was a computer lab that housed a bevy of Silicon Graphics workstations, each capable of real-time ray-tracing. Behind a glass window in a darkened room loomed a Cray supercomputer, which looked like an altar. Perhaps it really was one, as only the high priest who oversaw the lab could perform rites on it, and ask it for blessings on our behalf.

Like a secret agent, I was issued a nondescript security card that I’d swipe at a nondescript door to gain entry, while the rest of the computer science students were condemned to crummy terminals in the stuffy, crowded, squalid basement of the Madsen building, sharing a single Sun server. With the aid of a futuristic API named OpenGL, I was able to render graphs as beautifully shaded spheres and cylinders in three dimensions. What a time to be alive!

Today, those miracles I witnessed in that lab are commonplace. Mobile phones are more powerful than the vaunted Cray, whose manufacturer is long gone, as are Sun Microsystems, Inc. and Silicon Graphics, Inc. (I later spent years in the former SGI headquarters during my stint at Google; one of my many co-workers was one of SGI’s earliest employees.)

My brain also seems to have gone through similar upgrades somewhere along the way. Back then I painstakingly edited page after page of code over a semester to lay out a graph; today, I can get it done by busting out a few lines.

Spring time

Working in a room in the physics building all those years ago was appropriate, because I was in fact writing a simple physics engine. Given a connected graph, we place the vertices randomly. We imagine all nodes to be positive charges, and all edges to be springs of the same length, then apply the laws of physics.

We copy the pseudo-random number generator from our NetWalk page:

data PCG = PCG Word64 Word64
pcg a b = PCG (a + b') b' where b' = 2*b + 1

next :: PCG -> (Word, PCG)
next (PCG x inc) = (r, PCG x' inc) where
  x' = 6364136223846793005*x + inc
  r = (fromIntegral (x `xor` (x `shiftR` 18) `shiftR` 27) :: Word) `rotateR` (fromIntegral $ x `shiftR` 59)

fromPCG p = map (fromIntegral . fst) $ tail $ iterate (next . snd) (undefined, p)

split :: PCG -> (PCG, PCG)
split p = (PCG (lohi a b) (lohi c d), PCG (lohi e f) (lohi g h)) where
  [a,b,c,d,e,f,g,h] = take 8 $ fromPCG p
  lohi = Word64

chunksOf i ls = map (take i) (go ls) where
  go [] = []
  go l  = l : go (drop i l)

We generate coordinates using a seed that happens to work nicely with our examples:

points42 = chunksOf 2 $ (\n -> doubleFromWord n / 2^25) <$> fromPCG (pcg 42 42)

In our examples we use integers to refer to the nodes, though our code is more general.

We store the nodes and their coordinates in an association list. The edges are represented by a function that returns whether two nodes are connected. For example, the following is \(K_5\), the complete 5-node graph, with an initial random layout:

graphK5 = (\_ _ -> True, zip [1..5] points42)

We display the graphs with SVG elements which we insert into the HTML of this page:

allPairs as = [(x, y)
  | x:t <- takeWhile ((> 1) . length) $ iterate tail as, y <- t]

svgline [x1, y1] [x2, y2] = ("<line x1='"<>)
  . shows x1 . ("' y1='"<>) . shows y1 . ("' x2='"<>)
  . shows x2 . ("' y2='"<>) . shows y2 . ("' stroke='red' />"<>)

svgcircle [x, y] = ("<circle cx='"<>)
  . shows x . ("' cy='"<>)
  . shows y . ("' r='2' />"<>)

svgWrap (isEdge, vs) = ("<svg viewBox='"<>)
  . (foldr (.) id $ intersperse (' ':) $ shows <$> [x0 - mx, y0 - my, dx + 2*mx, dy + 2*my])
  . ("' width=300 height=200>"<>)
  . foldr (.) id [ svgline p q | ((v, p), (w, q)) <- allPairs vs, isEdge v w]
  . foldr (.) id (svgcircle . snd <$> vs)
  $ "</svg>"
  where
  xs = (!!0) . snd <$> vs
  ys = (!!1) . snd <$> vs
  x0 = foldr1 min xs
  x1 = foldr1 max xs
  dx = x1 - x0
  y0 = foldr1 min ys
  y1 = foldr1 max ys
  dy = y1 - y0
  mx = max 2 $ dx * 0.05
  my = max 2 $ dy * 0.05

draw g = jsEval $ "runme_out.insertAdjacentHTML('beforeend',`" ++ svgWrap g ++ "`);"

draw graphK5

Hooked on Coulomb

Let \(d\) be the distance between two nodes. If they’re connected by an edge, the force exerted on each node away from the other node is given by Coulomb’s Law and Hooke’s Law:

\[ F = \frac{k_1}{d^2} + k_2(L - d) \]

where \(L\) is the length of each spring (at equilibrium), and \(k_1, k_2\) are constants. If the nodes have no edge between them, then we drop the second term, as then we only wish to simulate the repulsion of the two charges.

Each frame, for each node, we sum the forces on it due to every other node, multiply the total by a tunable tiny constant, and adjust its position by the result.

This makes sense, because for constant acceleration, the distance traveled over a given time is proportional to the acceleration; of course, the forces continuously vary as the nodes move around, which is why we try to find a duration short enough for a passable approximation, though not so short that we lose our patience waiting for the simulation.

In our regular expressions demo, zero-length springs (\(L = 0\)) happen to work with my example, but it appears to perform poorly for the graphs below. I settled on the constants in the code after some experimentation.

sim (isEdge, vs) = (isEdge, nudge <$> vs) where
  nudge (v, p0) = (v, foldr ($) p0 $ [zipWith (+) $ delta (isEdge v w) $ zipWith (-) p1 p0 | (w, p1) <- vs, v /= w])

delta sprung dp = (f*) <$> dp where
  d = sqrt $ sum $ map (^2) dp
  f = -0.2 * (9000/(d*d) + bool 0 (20 - d) sprung)/d

Let’s look at the first few frames when drawing \(K_5\).

mapM draw $ take 8 $ iterate sim graphK5

The result is already pleasing, though complete graphs only exercise one code path.

Good Will Hunting II

The movie Good Will Hunting features a couple of combinatorics problem sets. We explore the first set elsewhere. Here, we address the second set, as it’s a good excuse to show off our code.

The first problem asks for the number of \(n\)-node labeled trees.

A tree is a connected graph with no cycles, that is, given any two nodes, there is a unique path between them. We label the nodes 1 to \(n\) and consider two trees to be equivalent if two nodes are adjacent on one tree exactly when the identically labeled nodes are adjacent on the other.

Counting the first 5 cases is easy, and provides enough data for a good educated guess at the general formula. I urge you to try before reading on.

If \(n = 1, 2\) there is only one possible unlabeled tree, and only one way to label it uniquely. If \(n = 3\), the only possible unlabeled tree is 3 nodes connected in series, but there are 3 different labelings, each determined by the choice of the label for the node in the middle.

fromEdgeStr s = (edgeP, zip vs $ points42) where
  edgeP v w = [min v w, max v w] `elem`
    chunksOf 2 (fromIntegral . readInteger <$> words s)
  ns = fromIntegral . readInteger <$> words s :: [Int]
  vs = nub ns

nub = foldr go [] where
  go x xs
    | elem x xs = xs
    | otherwise = x:xs

mapM draw
  [ (undefined, [(1, [0, 0])])
  , fromEdgeStr "1 2"
  , (!!50) $ iterate sim $ fromEdgeStr "1 2 2 3"
  ]

If \(n = 4\) then there are two shapes: either all 4 nodes are connected in series, or there is one node of degree 3 that connects to the others. In the latter case, there are 4 labelings, where again each is determined by the choice of the label of the central node. In the former case, there are \(4! = 24\) permutations we can arrange 4 labels in a row, but this counts each labeling twice: rotating the line of nodes by 180 degrees yields the reverse of their permutation. Thus there are a total of \(4 + 24/2 = 16\) trees.

mapM (draw . (!!50) . iterate sim . fromEdgeStr)
  [ "1 2 2 3 3 4"
  , "1 2 1 3 1 4"
  ]

If \(n = 5\), then one of the following occurs:

we have 5 nodes in series, yielding \(5! / 2 = 60\) unique labelings.
we have one node of degree 4 connected to the others, yielding 5 unique labelings.
we have one node of degree 3, one of degree 2, and the other nodes have degree 1, yielding \(5\times 4\times 3 = 60\) unique labelings which are determined by the choices of labels of the degree 2 node and its neighbours.

mapM (draw . (!!50) . iterate sim . fromEdgeStr)
  [ "1 2 2 3 3 4 4 5"
  , "1 2 1 3 1 4 1 5"
  , "1 2 1 3 1 4 2 5"
  ]

This gives a total of 125 labeled trees. So we now have the sequence \(1, 1, 3, 16 = 4^2, 125 = 5^3\), which leads to the guess:

\[ n^{n-2} \]

This guess turns out to be correct, and is known as Cayley’s formula, who proved it in the 19th century (see Cayley, A theorem on trees). Will Hunting writes it on the board with no proof. However, since that boy is wicked smart, he would likely be able to supply one.

Good Arthur Cayley

Many famous mathematicians have tackled Cayley’s formula. Aigner and Ziegler, Proofs from THE BOOK, walk through four distinct proofs of Cayley’s formula, each of which is ingenious and elegant. Because of my background in cryptography, my favourite is the following due to Joyal.

On a labeled tree with \(n\) nodes, choose any node to be the "start", and any node to be the "end"; these may be the same node. Since it’s a tree, there is a unique path from the start node to the end node, which we’ll call the "spine".

Let \(a b … z\) be the labels of the nodes of the spine sorted in ascending order. Then the path from the start node to the end node describes a permutation of \(a b … z\) which we denote \(\pi\).

Define a function \(f\) from the set of naturals \([1..n]\) to itself as follows. If \(k \in [1..n] \) is on the spine, then \(f(k) = \pi(k)\). Otherwise, there is a unique path from \(k\) to the start node (in fact, we can use any node on the spine). Start walking along this path, and define \(f(k)\) to be the first node you reach. Different labelings and choices of start and end nodes lead to different functions.

There are \(n\) choices for the start node, and same for the end node. As there are \(n^n\) different functions from \([1..n]\) to itself, we have shown there are \(n^{n-2}\) different labeled trees…provided we fill in a gap.

To prove we have constructed a bijection, we need the flip side: we must show that any function \(f\) from \([1..n]\) to itself corresponds to a labeled tree. But consider repeatedly applying \(f\) on some \(k \in [1..n]\). Either \(k\) is in a cycle, or we’ll eventually reach a value that is in a cycle.

Thus we can reverse the above process. Together, the cycles describe a permutation \(\pi\) of some of the elements of \([1..n]\), which we use to build a spine. As for the other elements, applying \(f\) describes how to connect them via subtrees rooted to the spine.

In cryptography, Pollard’s rho algorithm repeatedly applies a function from a finite set to itself, eventually causing the values to cycle, which is exploited to factor an integer or find a discrete logarithm.

Robot Will Hunting

The last task is to draw all homeomorphically irreducible trees with 10 nodes. This is easy, so we up the ante with a layer of indirection: our goal is to write a program that draws all homeomorphically irreducible trees with 10 nodes.

Let’s briefly review some definitions. A node is external if it has degree 1, and is internal otherwise. A graph is homeomorphically irreducible or just irreducible if no node has degree 2. This is the same as saying all internal nodes have degree 3 or more.

We define two graphs to be equivalent (isomorphic) if both graphs have \(n\) nodes, and we can label the nodes of each graph with \([1..n]\) such that two nodes are connected in one graph if and only if in the other graph, the nodes with the same labels are connected.

Now that we understand the problem, we simply exhaust all possibilities and feed any trees we find into our code above to draw them.

We first notice that the internal nodes must form a tree. Suppose there are 4 internal nodes. As above, they are either they are connected in series, or one node is connected to each of the other three. We find adding 6 external nodes are barely enough to obtain irreducible trees, thus we have drawn 2 of the desired trees, and have also discovered there can be no more than 4 internal nodes (we’re waving our hands here, but this argument can easily be made more rigorous).

t4_1 = fromEdgeStr "1 2 2 3 3 4 1 5 1 6 2 7 3 8 4 9 4 10"
t4_2 = fromEdgeStr "1 2 1 3 1 4 2 5 2 6 3 7 3 8 4 9 4 10"
mapM (draw . (!!80) . iterate sim ) [t4_1, t4_2]

We continue bashing through cases. If there is 1 internal node, then there’s only one way to attach 9 external nodes to it. If there are 2 internal nodes, they must be adjacent, and we find 3 different ways to attach external nodes to reach 10 nodes total.

t1 = (\v w -> v == 1 || w == 1, zip [1..10] $ points42)
t2_1 = fromEdgeStr "1 2 1 3 1 4 1 5 1 6 1 7 1 8 2 9 2 10"
t2_2 = fromEdgeStr "1 2 1 3 1 4 1 5 1 6 1 7 2 8 2 9 2 10"
t2_3 = fromEdgeStr "1 2 1 3 1 4 1 5 1 6 2 7 2 8 2 9 2 10"
mapM (draw . (!!50) . iterate sim ) [t1, t2_1, t2_2, t2_3]

If there are 3 internal nodes, they must be connected in series, and we find 4 different ways to attach the external nodes: the degrees of the 3 nodes are 5-3-3, 4-4-3, 4-3-4, and 3-5-3.

t3_1 = fromEdgeStr "1 2 2 3 1 4 1 5 2 6 3 7 3 8 1 9 1 10"
t3_2 = fromEdgeStr "1 2 2 3 1 4 1 5 2 6 3 7 3 8 1 9 2 10"
t3_3 = fromEdgeStr "1 2 2 3 1 4 1 5 2 6 3 7 3 8 1 9 3 10"
t3_4 = fromEdgeStr "1 2 2 3 1 4 1 5 2 6 3 7 3 8 2 9 2 10"
mapM (draw . (!!50) . iterate sim ) [t3_1, t3_2, t3_3, t3_4]

In the story, Will Hunting draws only 8 trees before he is interrupted and chased off. We see the board is missing the case with 4 internal nodes in series and the case with 3 internal nodes with degrees 5-3-3. Speaking of which, I applaud the choice of this question, because it looks good on screen. The viewer enjoys intricate and intriguing drawings of trees, and is reminded that mathematics is not numbers plus algebraic gobbledygook.

Here be dragons

While finding the above 10 trees is easy, there are deep, difficult questions lurking only a hair’s breadth away. For starters, what if we generalize and ask for the number of irreducible trees with \(n\) nodes? It seems there is no neat formula, and we must settle for an arduous trek through combinatorics to arrive at a generating function whose coefficients we can compute. When doing so, we discover this is also the case for counting unlabled trees.

What about writing code that determines if two graphs are equivalent? This results in an algorithm for the graph isomorphism problem, which lies in NP because we can succinctly describe how to relabel vertices to prove two graphs are the same. But is there an quick way to find such a relabeling? In other words, is the graph isomorphism problem in P? Nobody knows.

Murkier still, nobody knows if graph isomorphism is NP-complete (roughly: no easier than any other NP problem). Thus graph isomorphism is a rare problem that lies somewhere between P and NP-complete, but no one has proved it’s one or the other. Or strictly in between, assuming P ≠ NP. In other words, graph isomorphism might be a member of the exclusive club known as NP-intermediate.

On the one hand, we can generalize slightly to obtain a known NP-complete problem, the subgraph isomorphism problem, which asks if a given graph is equivalent to some subgraph of another graph. On the other hand, Babai published (then retracted then restored) a paper showing that graph isomorphism can be solved in quasi-polynomial time; the best known algorithms for solving NP-complete problems are exponential-time.

Cryptography mysteriously appears again. Other famous suspected members of NP-intermediate are integer factorization and discrete log. Computer security relies on these problems being difficult to solve. To the dismay of those who build cryptosystems, while nobody can solve them in polynomial time (without a quantum computer that is perpetually ten years away), there exist subexponential-time algorithms.

Ben Lynn blynn@cs.stanford.edu 💡