Import:
= Graph Visualization =

I have a soft spot for graph layout algorithms because as an undergraduate, I
signed up for a project that led me to implement what my professor called "the
spring algorithm" (Eades 1984). These days, it appears to be called
https://en.wikipedia.org/wiki/Force-directed_graph_drawing[force-directed graph
drawing].

By _graph_, I mean what Sylvester meant when he coined the term in 1878: a set
of _nodes_ (or _vertices_ or _points_) connected by _edges_ (or _lines_ or
_links_). I wish he had called it a "web" instead, because "graph" already has
a meaning in mathematics, and anybody who's surfed the worldwide *web* already
understands nodes connected with links.

Hidden in the physics building, at the end of the hallway on the second floor,
there was a computer lab that housed a bevy of Silicon Graphics workstations,
each capable of real-time ray-tracing. Behind a glass window in a darkened room
loomed a Cray supercomputer, which looked like an altar. Perhaps it really was
one, as only the high priest who oversaw the lab could perform rites on it, and
ask it for blessings on our behalf.

Like a secret agent, I was issued a nondescript security card that I'd swipe at
a nondescript door to gain entry, while the rest of the computer science
students were condemned to crummy terminals in the stuffy, crowded, squalid
basement of the Madsen building, sharing a single Sun server. With the aid of a
futuristic API named OpenGL, I was able to render graphs as beautifully shaded
spheres and cylinders in three dimensions. What a time to be alive!

Today, those miracles I witnessed in that lab are commonplace. Mobile phones
are more powerful than the vaunted Cray, whose manufacturer is long gone, as
are Sun Microsystems, Inc. and Silicon Graphics, Inc. (I later spent years in
the former SGI headquarters during my stint at Google; one of my many
co-workers was one of SGI's earliest employees.)

My brain also seems to have gone through similar upgrades somewhere along the
way. Back then I painstakingly edited page after page of code over a semester
to lay out a graph; today, I can get it done by busting out a few lines.

== Spring time ==

Working in a room in the physics building all those years ago was appropriate,
because I was in fact writing a simple physics engine. Given a connected graph,
we place the vertices randomly. We imagine all nodes to be positive charges,
and all edges to be springs of the same length, then apply the laws of physics.

We copy the pseudo-random number generator from link:../play/netwalk.html[our NetWalk page]:

[ ]:
data PCG = PCG Word64 Word64
pcg a b = PCG (a + b') b' where b' = 2*b + 1

next :: PCG -> (Word, PCG)
next (PCG x inc) = (r, PCG x' inc) where
  x' = 6364136223846793005*x + inc
  r = (fromIntegral (x `xor` (x `shiftR` 18) `shiftR` 27) :: Word) `rotateR` (fromIntegral $ x `shiftR` 59)

fromPCG p = map (fromIntegral . fst) $ tail $ iterate (next . snd) (undefined, p)

split :: PCG -> (PCG, PCG)
split p = (PCG (lohi a b) (lohi c d), PCG (lohi e f) (lohi g h)) where
  [a,b,c,d,e,f,g,h] = take 8 $ fromPCG p
  lohi = Word64

chunksOf i ls = map (take i) (go ls) where
  go [] = []
  go l  = l : go (drop i l)
We generate coordinates using a seed that happens to work nicely with our
examples:

[ ]:
points42 = chunksOf 2 $ (\n -> doubleFromWord n / 2^25) <$> fromPCG (pcg 42 42)
In our examples we use integers to refer to the nodes, though our code is more
general.

We store the nodes and their coordinates in an association list. The edges
are represented by a function that returns whether two nodes are connected.
For example, the following is \(K_5\), the complete 5-node graph, with an
initial random layout:

[ ]:
graphK5 = (\_ _ -> True, zip [1..5] points42)
We display the graphs with SVG elements which we insert into the HTML of
this page:

[ ]:
allPairs as = [(x, y)
  | x:t <- takeWhile ((> 1) . length) $ iterate tail as, y <- t]

svgline [x1, y1] [x2, y2] = ("<line x1='"<>)
  . shows x1 . ("' y1='"<>) . shows y1 . ("' x2='"<>)
  . shows x2 . ("' y2='"<>) . shows y2 . ("' stroke='red' />"<>)

svgcircle [x, y] = ("<circle cx='"<>)
  . shows x . ("' cy='"<>)
  . shows y . ("' r='2' />"<>)

svgWrap (isEdge, vs) = ("<svg viewBox='"<>)
  . (foldr (.) id $ intersperse (' ':) $ shows <$> [x0 - mx, y0 - my, dx + 2*mx, dy + 2*my])
  . ("' width=300 height=200>"<>)
  . foldr (.) id [ svgline p q | ((v, p), (w, q)) <- allPairs vs, isEdge v w]
  . foldr (.) id (svgcircle . snd <$> vs)
  $ "</svg>"
  where
  xs = (!!0) . snd <$> vs
  ys = (!!1) . snd <$> vs
  x0 = foldr1 min xs
  x1 = foldr1 max xs
  dx = x1 - x0
  y0 = foldr1 min ys
  y1 = foldr1 max ys
  dy = y1 - y0
  mx = max 2 $ dx * 0.05
  my = max 2 $ dy * 0.05

draw g = jsEval $ "repl.outdiv.insertAdjacentHTML('beforeend',`" ++ svgWrap g ++ "`);"

draw graphK5
== Hooked on Coulomb ==

Let \(d\) be the distance between two nodes. If they're connected by an edge,
the force exerted on each node away from the other node is given by
Coulomb's Law and Hooke's Law:

\[
F = \frac{k_1}{d^2} + k_2(L - d)
\]

where \(L\) is the length of each spring (at equilibrium), and \(k_1, k_2\) are
constants. If the nodes have no edge between them, then we drop the second
term, as then we only wish to simulate the repulsion of the two charges.

Each frame, for each node, we sum the forces on it due to every other node,
multiply the total by a tunable tiny constant, and adjust its position by the
result.

This makes sense, because for constant acceleration, the distance traveled over
a given time is proportional to the acceleration; of course, the forces
continuously vary as the nodes move around, which is why we try to find a
duration short enough for a passable approximation, though not so short that
we lose our patience waiting for the simulation.

In link:re.html[our regular expressions demo], zero-length springs (\(L = 0\))
happen to work with my example, but it appears to perform poorly for the graphs
below. I settled on the constants in the code after some experimentation.

[ ]:
sim (isEdge, vs) = (isEdge, nudge <$> vs) where
  nudge (v, p0) = (v, foldr ($) p0 $ [zipWith (+) $ delta (isEdge v w) $ zipWith (-) p1 p0 | (w, p1) <- vs, v /= w])

delta sprung dp = (f*) <$> dp where
  d = sqrt $ sum $ map (^2) dp
  f = -0.2 * (9000/(d*d) + bool 0 (20 - d) sprung)/d
Let's look at the first few frames when drawing \(K_5\).

[ ]:
mapM draw $ take 8 $ iterate sim graphK5
The result is already pleasing, though complete graphs only exercise one code
path.

== Good Will Hunting II ==

The movie _Good Will Hunting_ features a couple of combinatorics problem sets.
link:goodwill.html[We explore the first set elsewhere]. Here, we address the
second set, as it's a good excuse to show off our code.

The first problem asks for the number of \(n\)-node labeled trees.

A _tree_ is a connected graph with no cycles, that is, given any two nodes,
there is a unique path between them. We label the nodes 1 to \(n\) and consider
two trees to be equivalent if two nodes are adjacent on one tree exactly when
the identically labeled nodes are adjacent on the other.

Counting the first 5 cases is easy, and provides enough data for a good
educated guess at the general formula. I urge you to try before reading on.

If \(n = 1, 2\) there is only one possible unlabeled tree, and only one way
to label it uniquely. If \(n = 3\), the only possible unlabeled tree is
3 nodes connected in series, but there are 3 different labelings, each
determined by the choice of the label for the node in the middle.

[ ]:
fromEdgeStr s = (edgeP, zip vs $ points42) where
  edgeP v w = [min v w, max v w] `elem`
    chunksOf 2 (fromIntegral . readInteger <$> words s)
  ns = fromIntegral . readInteger <$> words s :: [Int]
  vs = nub ns

nub = foldr go [] where
  go x xs
    | elem x xs = xs
    | otherwise = x:xs

mapM draw
  [ (undefined, [(1, [0, 0])])
  , fromEdgeStr "1 2"
  , (!!50) $ iterate sim $ fromEdgeStr "1 2 2 3"
  ]
If \(n = 4\) then there are two shapes: either all 4 nodes are connected in
series, or there is one node of degree 3 that connects to the others.
In the latter case, there are 4 labelings, where again each is determined by
the choice of the label of the central node. In the former case, there are \(4!
= 24\) permutations we can arrange 4 labels in a row, but this counts each
labeling twice: rotating the line of nodes by 180 degrees yields the reverse of
their permutation. Thus there are a total of \(4 + 24/2 = 16\) trees.

[ ]:
mapM (draw . (!!50) . iterate sim . fromEdgeStr)
  [ "1 2 2 3 3 4"
  , "1 2 1 3 1 4"
  ]
If \(n = 5\), then one of the following occurs:

  * we have 5 nodes in series, yielding \(5! / 2 = 60\) unique labelings.

  * we have one node of degree 4 connected to the others, yielding 5 unique
  labelings.

  * we have one node of degree 3, one of degree 2, and the other nodes have degree 1, yielding \(5\times 4\times 3 = 60\) unique labelings which are determined by the choices of labels of the degree 2 node and its neighbours.

[ ]:
mapM (draw . (!!50) . iterate sim . fromEdgeStr)
  [ "1 2 2 3 3 4 4 5"
  , "1 2 1 3 1 4 1 5"
  , "1 2 1 3 1 4 2 5"
  ]
This gives a total of 125 labeled trees. So we now have the sequence
\(1, 1, 3, 16 = 4^2, 125 = 5^3\), which leads to the guess:

\[
n^{n-2}
\]

This guess turns out to be correct, and is known as Cayley's formula, who
proved it in the 19th century
(see
https://rcin.org.pl/impan/dlibra/publication/176630/edition/143722/content[Cayley,
_A theorem on trees_]).
Will Hunting writes it on the board with no proof. However, since that boy is
wicked smart, he would likely be able to supply one.

== Good Arthur Cayley ==

Many famous mathematicians have tackled Cayley's formula. Aigner and Ziegler,
_Proofs from THE BOOK_, walk through four distinct proofs of Cayley's formula,
each of which is ingenious and elegant. Because of my background in
cryptography, my favourite is the following due to Joyal.

On a labeled tree with \(n\) nodes, choose any node to be the "start", and any
node to be the "end"; these may be the same node. Since it's a tree, there is a
unique path from the start node to the end node, which we'll call the "spine".

Let \(a b ... z\) be the labels of the nodes of the spine sorted in ascending
order. Then the path from the start node to the end node describes a
permutation of \(a b ... z\) which we denote \(\pi\).

Define a function \(f\) from the set of naturals \([1..n]\) to itself as
follows. If \(k \in [1..n] \) is on the spine, then \(f(k) = \pi(k)\).
Otherwise, there is a unique path from \(k\) to the start node (in fact, we can
use any node on the spine). Start walking along this path, and define \(f(k)\)
to be the first node you reach. Different labelings and choices of start and
end nodes lead to different functions.

There are \(n\) choices for the start node, and same for the end node. As there
are \(n^n\) different functions from \([1..n]\) to itself, we have shown there
are \(n^{n-2}\) different labeled trees...provided we fill in a gap.

To prove we have constructed a bijection, we need the flip side: we must show
that any function \(f\) from \([1..n]\) to itself corresponds to a labeled
tree. But consider repeatedly applying \(f\) on some \(k \in [1..n]\). Either
\(k\) is in a cycle, or we'll eventually reach a value that is in a cycle.

Thus we can reverse the above process. Together, the cycles describe a
permutation \(\pi\) of some of the elements of \([1..n]\), which we use to
build a spine. As for the other elements, applying \(f\) describes how to
connect them via subtrees rooted to the spine.

In cryptography,
https://en.wikipedia.org/wiki/Pollard%27s_rho_algorithm[Pollard's rho
algorithm] repeatedly applies a function from a finite set to itself,
eventually causing the values to cycle, which is exploited to factor an integer
or find a discrete logarithm.

== Robot Will Hunting ==

The last task is to draw all homeomorphically irreducible trees with 10 nodes.
This is easy, so we up the ante with a layer of indirection: our goal is to
write a program that draws all homeomorphically irreducible trees with 10
nodes.

Let's briefly review some definitions. A node is _external_ if it has degree 1,
and is _internal_ otherwise. A graph is _homeomorphically irreducible_ or just
_irreducible_ if no node has degree 2. This is the same as saying all internal
nodes have degree 3 or more.

We define two graphs to be equivalent (_isomorphic_) if both graphs have \(n\)
nodes, and we can label the nodes of each graph with \([1..n]\) such that two
nodes are connected in one graph if and only if in the other graph, the nodes
with the same labels are connected.

Now that we understand the problem, we simply exhaust all possibilities and
feed any trees we find into our code above to draw them.

We first notice that the internal nodes must form a tree. Suppose there are 4
internal nodes. As above, they are either they are connected in series, or one
node is connected to each of the other three. We find adding 6 external nodes
are barely enough to obtain irreducible trees, thus we have drawn 2 of the
desired trees, and have also discovered there can be no more than 4 internal
nodes (we're waving our hands here, but this argument can easily be made more
rigorous).

[ ]:
t4_1 = fromEdgeStr "1 2 2 3 3 4 1 5 1 6 2 7 3 8 4 9 4 10"
t4_2 = fromEdgeStr "1 2 1 3 1 4 2 5 2 6 3 7 3 8 4 9 4 10"
mapM (draw . (!!80) . iterate sim ) [t4_1, t4_2]
We continue bashing through cases. If there is 1 internal node, then there's
only one way to attach 9 external nodes to it. If there are 2 internal nodes,
they must be adjacent, and we find 3 different ways to attach external nodes to
reach 10 nodes total.

[ ]:
t1 = (\v w -> v == 1 || w == 1, zip [1..10] $ points42)
t2_1 = fromEdgeStr "1 2 1 3 1 4 1 5 1 6 1 7 1 8 2 9 2 10"
t2_2 = fromEdgeStr "1 2 1 3 1 4 1 5 1 6 1 7 2 8 2 9 2 10"
t2_3 = fromEdgeStr "1 2 1 3 1 4 1 5 1 6 2 7 2 8 2 9 2 10"
mapM (draw . (!!50) . iterate sim ) [t1, t2_1, t2_2, t2_3]
If there are 3 internal nodes, they must be connected in series, and we find 4
different ways to attach the external nodes: the degrees of the 3 nodes are
5-3-3, 4-4-3, 4-3-4, and 3-5-3.

[ ]:
t3_1 = fromEdgeStr "1 2 2 3 1 4 1 5 2 6 3 7 3 8 1 9 1 10"
t3_2 = fromEdgeStr "1 2 2 3 1 4 1 5 2 6 3 7 3 8 1 9 2 10"
t3_3 = fromEdgeStr "1 2 2 3 1 4 1 5 2 6 3 7 3 8 1 9 3 10"
t3_4 = fromEdgeStr "1 2 2 3 1 4 1 5 2 6 3 7 3 8 2 9 2 10"
mapM (draw . (!!50) . iterate sim ) [t3_1, t3_2, t3_3, t3_4]
In the story, Will Hunting draws only 8 trees before he is interrupted and
chased off. We see the board is missing the case with 4 internal nodes in
series and the case with 3 internal nodes with degrees 5-3-3. Speaking of
which, I applaud the choice of this question, because it looks good on screen.
The viewer enjoys intricate and intriguing drawings of trees, and is reminded
that mathematics is not numbers plus algebraic gobbledygook.

== Here be dragons ==

While finding the above 10 trees is easy, there are deep, difficult questions
lurking only a hair's breadth away. For starters, what if we generalize and
ask for the number of irreducible trees with \(n\) nodes? From a brief search,
it seems there is no neat formula, and we must settle for an arduous trek
through combinatorics to arrive at a generating function whose coefficients can
be readily computed. This is also the case for
https://oeis.org/A000055[counting unlabled trees].

What about writing code that determines if two graphs are equivalent? This
results in an algorithm for the _graph isomorphism problem_, which lies in NP
because we can succinctly describe how to relabel vertices to prove two graphs
are the same. But is there an efficient algorithm? Is the graph isomorphism
problem in P? Nobody knows.

Murkier still, nobody knows if graph isomorphism is NP-complete (roughly: no
easier than any other NP problem). Thus graph isomorphism is a rare problem
that lies somewhere between P and NP-complete, but no one has proved it's one
or the other. Or strictly in between, assuming P &ne; NP. In other words,
graph isomorphism might be a member of the exclusive club known as
https://en.wikipedia.org/wiki/NP-intermediate[NP-intermediate].

On the one hand, we can generalize slightly to obtain a known NP-complete
problem: the _subgraph isomorphism problem_, which asks if a given graph is
equivalent to some subgraph of another graph. On the other hand, Babai
published (then retracted then restored) a paper showing that graph isomorphism
can be solved in quasi-polynomial time; the best known algorithms for solving
NP-complete problems are exponential-time.

Cryptography mysteriously appears again. Other famous suspected members of
NP-intermediate are integer factorization and discrete log. Computer security
relies on these problems being difficult to solve. To the dismay of those who
build cryptosystems, while nobody can solve them in polynomial time (without a
quantum computer that is perpetually ten years away), there exist
subexponential-time algorithms.

Ben Lynn blynn@cs.stanford.edu 💡