Suppose you've got a huge graph - millions of nodes. And you know that it's not connected - so the graph actually consists of some number of pieces (called the connected components of the graph). And there are constantly new vertices and edges being added to the graph, but nothing is ever removed. Some questions you might want to ask about this graph at a particular point in time are:
- How many components are there in the graph?
- Which component is vertex X in?
- Are vertices X and Y in the same component?
- How many components are there?
All of these questions are variants of a classic computer science problem, called
union-find, which also comes up in an astonishing number of different contexts. The reason for the name is that in the representation of the solution, there
are two basic operations: union, and find. Basically, the division of the graph into
components is also a partition of the vertices of the graph into disjoint sets: union find
is a problem which focuses on a particular kind of disjoint set problem, where you can modify
the sets over time.
The basic idea of the classic union-find solution is to create a list of sets, where each of the member sets contains a collection of the members of one of the sets (vertices of the connected subgraph). Each of the
sets has one member, randomly selected, and called the set's representative: as long as a set
remains disjoint (no edges are added which connect it to another component), its representative never
changes; when two sets are merged (an edge is added connecting two components), the representative
of the merged set will be the representative of one of the two sets that were merged. The only way to refer to a set is by using its representative.
This gives us a data structure with 3 real operations. One is to add a set of objects (connected vertices) to it. Another is union which specifies two objects (vertices), and merges the sets containing those objects (adding an edge). The last is find which takes an object, and returns the representative of the set to which that object belongs.
If you think of it as lists of lists, it's pretty obvious how this all works. Adding a set of objects
is just linking that list into the set. Union is taking one set, and appending it to another. Find is searching the lists to figure out which set contains a particular element.
type setMember = (value, child, parent) def AddSet(sets, newSetRepresentative): sets.add(newSet) def Union(sets, setOneMember, setTwoMember): setOneRepresentative = sets.Find(setOneMember) setTwoRepresentative = sets.Find(setTwoMember) sets.remove(setTwoRepresentative) setOneRepresentative.append(setTwoRepresentative) def Find(sets, v): if (v.parent == null): return v else: return Find(sets, v.parent)
In this version, the costs are also easy to figure out. Using a doubly linked list for the sets, we
get addition costs as O(1); find is O(N) (since you're following the list of links from a node to its
parent until you get to the representative, which is the only node with no parent, it could take up to N
steps, where N is the number or objects in the set of sets); and union is also O(N) (union requires two
Finds, each of which take O(N); and then finding the end of one of the lists, where the other set will be
linked in, which is also O(N)). If you've got really big sets, and you're going to do a lot of operations,
then this is a really awful running time.
How can we improve that? The easiest way is to take each set, and replace it by a tree. That's called
the disjoint forest solution. In a disjoint forest, a node can have more than one child, but each
node has either zero (if its a representative) or one parent. In this representation, the implementation
of find is exactly the same: climb up the parent chain to find a representative, which takes amortized
O(lg n), provided the trees remain shallow and bushy. Union is again two finds (lg n) followed by
attaching one tree a a child of the other trees representative (O(1)), so it's also O(lg n).
Now, the question is, how can we say that the forest is really O(lg n)? That's only true if the trees
are shallow and close to balanced. Unbalanced, deep trees could degenerate to exactly the same scenario as
the list implementation. Fortunately, maintaining an approximate, bushy balance is easy: For each tree, we
maintain something called rank. For a tree of one vertex, we assign rank=0. When we merge trees,
we always merge into the tree with higher rank (leaving the rank unchanged), unless the trees being merged
have equal rank=N. In that case, we pick one arbitrarily, and assign the result rank=N+1. That's actually
enough to keep the amortized cost of all operations at O(lg N).
type setMember = (value, child, parent) def AddSet(sets, newSetRepresentative): sets.add(newSetRepresentative) newSetRepresentative.rank = 1 def Union(sets, setOneMember, setTwoMember): setOneRepresentative = sets.Find(setOneMember) setTwoRepresentative = sets.Find(setTwoMember) if (setOneRepresentative.rank > setTwoRepresentative.rank): higherRankRepresentative = setOneRepresentative lowerRankRepresentative = setTwoRepresentative else: higherRankRepresentative = setTwoRepresentative lowerRankRepresentative = setTwoRepresentative higherRankRepresentative.addChild(lowerRankRepresentative) sets.remove(lowerRankRepresentative) if (higherRankRepresentative.rank == lowerRankRepresentative.rank): higherRankRepresentative.rank++ def Find(sets, v): if (v.parent == null): return v else: return Find(sets, v.parent)
In fact, we can improve this even more in a simple way. Instead of maintaining rank, we can just
munge the tree. Each time we do a union, we just randomly pick one of the two tree, and attach is
as a child of the representative of the other. Each time we do a Find operation, we keep a list of each node visited on the path to
the representative, and re-link all of them as children of the representative. This way, once is a
while, you'll have to do the expensive operation of relinking a bunch of nodes after doing a climb
up the tree (with potential cost O(lg n)), but as a result, a bunch of Finds that formerly would have taken O(lg n) will now take O(1). This one is easier to see with an example than with pseudo code. Image
we had a set of trees like this one, where the representatives are in red:
With this set of trees, we can do a Union(A,J). The representative of the set containing A is O, and J is a representative. So we merge those two sets, and get the following:
Now, suppose we want to do a Find(C). What we'd do is trace the path from C to its representative: C, to A, to O, to J. The representative is J. But we also keep the list of what we visited before reaching J - (C, A, O) - and link them as direct children of J, giving us this:
So doing tho find(C) was fairly expensive - it required us to visit three nodes, and re-link two. But now, doing finds on C, A, and G will be fast next time. In fact, this approach does a better job of keeping the trees shallow and bushy than rank, with no extra storage overhead. And the cost of both union and find remains an amortized O(lg n).
I suppose we're really interested in how many components there are, given that we want to ask it twice.
Ooh, that's really cool. This section of the series is really driving home for me the need for good data structures. I never gave much thought to just how much small details of a data structure could influence the difficulty of a problem.
Just looked at the Wikipedia page: if you do ranks and path compression, the average time per find becomes O(alpha(n)), where alpha is the inverse of the function n -> A(n,n), A the Ackermann function. For all remotely reasonable n, alpha(n) â¤ 5.
Any chance of modifying your CSS to not use absolute font sizes? This font size is too small for me, and changing my browser's font size setting has no effect. Please advise. Thank-you.
Suppose you've got a huge graph - millions of nodes. And you know that it's not connected [...]. And there are constantly new vertices and edges being added to the graph, but nothing is ever removed.
Are you describing Google's database? :-)
Come on, tell us what you're *really* doing!
Seriously: Will you be able to talk about your actual work, one day? Sometimes it's nice to see applications of all these abstract ideas. Especially coming from Google...
No, I'm *not* describing anything about Google's database. I don't even know if google's crawlers use union find.
There's are several good reasons that I don't discuss my work on the blog.
One is that the blog is something I do for fun - it's a relief from work; I don't want to draw work into it, or it into work.
Another is that anything involving my work rightfully belongs to Google. Google doesn't mind me blogging, and they don't mind me saying that I work for them, but because of how we work at Google, we're held to a very high standard of confidentiality. If I were to talk about my work in any detail, I'd be likely to slip up and give something away. I like working for Google *way* to much to even take a chance of doing that.
Snap snap, grin grin, wink wink, nudge nudge, say no more. :-)
Seriously (again): I understand completely. Thanks for blogging about interesting things, whether they're linked to your work or not!
If my dim memory serves me right, union-find first became important in FORTRAN, where it's needed for the EQUIVALENCE operation? Or am I misremembering it?
heres how i would write union, is this way good or bad?
a,b=[s.Find(r) for r in (m,n)]
python put put
a,b=[s.Find(r) for r in (m,n)]
An excellent write-up, and touches on a lot o' problems that I've had to work with. I don't suppose you know of any simple approaches to:
- the fully dynamic case, with edge removals? There are some very complicated algorithms out there; I'd love something easy-to-implement that gave O(log n) find and delete; I'd settle for restrictions such as planarity.
- 2-connected graphs?
I ask because when searching for ways to tackle the above, I was stymied by too-deep-for-me theory and algorithms whose expositions took upwards of 50 pages... So I ended up with an O(n) solution.
In response to David Mercer:
Unfortunately, Internet Explorer 6 has a bug that prevents it from resizing text specified in terms of pixels. If possible, the best solution is simply to upgrade, either to Firefox, Opera, or (as a last resort) Internet Explorer 7.
There are a number of design problems that can't be overcome without specifying at least one font in pixels. For example, when translating a relative size (say 116.667%) into pixels, each browser has to decide how to round the result, and they all do it in subtly different ways. In IE6 in particular, 16 * (10/16) * 80 does not equal 800.
Alas, one base font size is enough to throw IE6 out of whack. I'm not claiming that any of those problems affect this layout, just trying to explain why someone might choose to define the size in pixels when on the surface it seems as if relative sizes would work just as well or better.
If upgrading your browser is out of the question, you might consider forcing the main text font to Verdana, which displays slightly larger than most other fonts at a given size. Trebuchet (which is used by default on this blog) appears smaller than average in most situations, so Verdana could be a significant improvement.