Today, I'm going to talk a bit about two closely related problems in graph theory: the maximal clique detection problem, and the maximal common subgraph problem. The two problems are interesting both on their own as easy-to-understand but hard-to-compute problems; and also because they have so many applications. In particular, the two are used extensively
in bioinformatics and pharmacology for the analysis of complex molecular structure.
As I mentioned back in the definitions post, a clique is a complete graph: a graph where all vertices are adjacent, where any pair of vertices has an edge connecting them.
The clique problem is an optimization problem which asks: given a graph **G**, what's the
largest subgraph of **G** which is a clique? If we convert that to a decision problem, we get:
"Is there a clique of N vertices contained in **G**?"
The maximal common subgraph problem is another optimization problem which asks: Given two graphs **G** and **H**, what is the largest subgraph of G which is isomorphic to a subgraph of H? In its representation as a decision problem, it becomes: is there a subgraph in **G** with N nodes which is isomorphic to a subgraph of H?
Following the usual pattern of these things, for both problems decision problem is NP complete, and the corresponding optimization problem is NP-hard. In both problems, in the worst case, you need to exhaustively consider all possible subgraphs in order to find the solution. In both cases, there are useful heuristics for pruning the search space, but like other NP complete/NP hard problems, either the heuristics sometimes fail to produce the optimal solution, or they sometimes wind up taking exponential time.
But still, heuristics are the key. In general, for applications of the graph clique problem (or the maximal subgraph), you want the optimum, but a solution which is not the ideal optimum, but is very close is nearly as good. Again, this is the common pattern for solutions of NP complete problems. Perfect solutions are hard to come by, but good ones
are often not unknown.
(A great example of this comes from a talk I saw a few years ago by a software engineering researcher named Daniel Jackson. Daniel had built a really cool tool called Alloy, which did some neat analysis. In the course of his talk, he was describing how the analysis actually worked. The way he described it was: "The bad news is, it's reducible to 3-SAT, which is NP complete. The good news is, it's reducible to 3-SAT, which means we can compute the answer really fast.")
Anyway - the basic heuristics for these two problems are very similar; I'll describe the method for the clique problem; and starting from the largest common clique is a good heuristic for a starting point for getting the largest common subgraph.
For the maximum clique, there's a clever heuristic, which will, on average, find the maximal sub-clique. The idea is, find the set of all clique of size 2. Then, given that set, try to find cliques within it that can be merged. When you find two cliques that can be merged into a clique, remove those two, and replace them with their merge. Keep doing that until you can no longer merge any members of the set.
This can be implemented quite efficiently. (In fact, it's just a special case of the good
old classic union-find algorithm, which I'll probably write about one of these days.) It's
*not* a true solution to the problem, because it *doesn't* always produce the maximal
clique. It produces *a* maximal clique over its members, but not necessarily *the* maximal
clique over G.
Let's pause for just a moment, so that I can explain that last line, which sounds quite confusing.
Take a graph G=(V,E), and a set of vertices C={v1,...,vn}⊂V,
such that the induced subgraph over C is a clique. C is *a* maximal clique over the vertices in C if there is no vertex is G that can be added to C producing a larger clique.
C is *the* maximal clique in G if and only if there is *no larger clique* contained in G.
The key difference is that in the first (a maximal clique over a set of vertices), we're
asking for the largest clique containing *that specific set* of vertices; in the second, we're asking for the largest clique formed from *any set* of vertices in G.
For example, in the diagram below, the green clique is the maximal clique over the set of vertices, {F,D,L}. There is no larger possible clique in the graph containing those three
vertices than the green 4-clique. But the maximal clique in G is actually the 6-clique containing the vertices {A, G, H, J, K, M}.
So, how does this became useful for biology?
* In evolutionary development, one of the ways of recognizing many processes is
by looking at gene expression in terms of gene expression networks, which
are graphs. Regulatory systems are often revealed as cliques in the expression
network.
* When studying protein structures, one of the major techniques is to
look at similar proteins, and find common cliques in the graph-based
structural networks of the proteins.
* In pharmacology, the relationship between different chemicals is studied
by finding the largest common subgraphs in the molecular graphs of the different
substances.
* Again in pharmacology, there is a graph based method for searching for a compound that will interact
with a specific biochemical target. One of the ways that this is done is to generate
a large library of graphs describing known molecules. Then, given a particular
target, you generate a *complimentary* graph describing its structure; then you
search for graphs in the library that have large cliques in common with the
complimentary graph of the target. The resulting list of molecules from the library
form a set of likely candidates for a molecule that will produce the desired action
by bonding with the target.
- Log in to post comments
That's pretty awesome. Thanks for the graph posts. I've been wanting to get back into them after my combinatorics class in college. I also had a lot of graph theory dealing with game theory because of an undergraduate grant I got to study the Tree Game. In any case, I have a cool problem that I think I emailed you, but you must have missed:
Given a regular n-gon with all diagonals drawn, how many subregions are contained in the n-gon? Since I'm not sure exactly what the definition of subregion in this case, but I think it is simply: Anything that shares an edge.
If indeed the definition is anything that shares an edge, then I realized that the problem is isomorphic to a cool graph problem. I believe both problems are unsolved, but I was wondering if you could find anything on them. I have been unable to so far.
The isomorphic problem is simply: Given the same n-gon above, at every angle on the n-gon and every place where two diagonals meet, draw a vertex. Count the number of subgraphs contained in this graph which feature no bottlenecks. I'm not sure if I invented the term bottlenecks here, but a bottleneck is simply one vertex where two graphs come together. It's necessary for that to be in there for the above definition of subregion to hold true. What do you think about this problem?
That seems feasible, but I can't see that the linked paper supports that. The examples of functionally related cliques are on energy production and aromatic-acid biosynthesis.
Rather, the paper seems to find that cliques are a common property of gene networks illustrating different evolutionary contexts such as phyletic distribution (naturally some nice and large clusters), chromosomal proximity, or domain fusion.
In any case good illustrations of the technique.
mah
Kyle: "bottlenecks" is, in my opinion, a poor choice for graphs in a biochemical context.
That's because "bottleneck enzyme" has a traditional meaning in metabolisms. Google it to see what I mean.
I happen to dispute the standard interpretation of "bottleneck enzyme" as based on a static analysis of kinetics, while a more complicated dynamic interpreation is justified.
I first published this in graduate school:
Jonathan V. Post, "Is Functional Identity of Products a Necessary Condition for the Selective Neutrality of Structural Gene Allele?", Population Biologists of New England (PBONE), Brown University, Providence, RI, June 1976
and
Jonathan V. Post, "Enzyme Kinetics and Selection of Structural Gene Products
-- A Theoretical Consideration", Society for the Study of Evolution, Ithaca, NY,
June 1977
and then in a series of papers such as:
Jonathan V. Post, "Nonlinear Enzyme Waves, Simulated Metabolism Dynamics, and Protein Nanotechnology", poster session, 2nd Artificial Life Workshop, 5-9
Feb 1990, Santa Fe, NM
and
Jonathan V. Post, "Analysis of Enzyme Waves: Success through Simulation",
Proceedings of the Summer Computer Simulation Conference, Seattle, WA, 25-27
August 1980, pp.691-695, AFIPS Press, 1815 North Lynn Street, Suite 800, Arlington,
VA 22209
In any case, those educated in biochemistry, enzymology, and metabolic systems would assume that you meant "bottleneck" in the old sense.
More confusion than it's worth.
Deleting minimum vertex degrees iteratively from the graph ends up with complete subgraph or vertex disjoint regular graphs. I think this is simplest heuristics for finding maximal complete subgraph of the graph.
Jonathan Vos Post: I didn't even mean to relate it to biology. It was just the only term I could think of that described the graph problem. :) However, there is surely a name that doesn't conflict with biology here either. I can't seem to come up with it though!
I do not disagree with your assessment of this post for use in biological systems.
However, energy economics in the form of various quanta [ATP and related nucleotides, electron and proton transfer, types of bonding, etc] also play a role.
Petrie Nets may be a better graphing tool?
Kyle: keep trying. Inventing a new word is one of the inexpensive paths to immortality.
Nice paper on graphs and Protein Interaction Networks:
Mon, 16 Jul 2007
http://arxiv.org/pdf/0707.2076
Title: Node similarity within subgraphs of protein interaction networks
Authors: Orion Penner, Vishal Sood, Gabe Musso, Kim Baskerville, Peter Grassberger, Maya Paczuski
Comments: 10 pages, 5 figures
Subjects: Molecular Networks (q-bio.MN)
Also, ruined by a press release by someone who has no clue as to what "entropy" means, something cool is described at:
Proteins' Internal Motion Found To Affect Their Function: Implications For Drug Design
Great topic Mark. One thing that may help to clarify your definitions of a maximal clique in G versus a maximal clique is to change your terminology a bit. My advisor got me in the habit of talking about a maximum clique on a graph as the largest one that exists and a maximal clique as the largest clique that exists given that it must contain a given vertex or set of vertices. The maximum clique is always maximal, but a maximal clique is not always the maximum.
This is similar to the idea that a hillclimbing algorithm may find a local maximum (a maximal solution) but not a global maximum (a maximum solution).
Kyle: here's a cool paper by someone I've seen lecture at my alma mater, about subgraphs with vertices in common, in a very interesting context.
Replacements for Fri, 20 Jul 07
arXiv:physics/0602033 (replaced)
[ps, pdf, other]
Title: Community Structure in the United States House of Representatives
Authors: Mason A. Porter, Peter J. Mucha, M. E. J. Newman, A. J. Friend
Comments: 44 pages, 13 figures (some with multiple parts and most in color), 9 tables, to appear in Physica A; new figures and revised discussion (including extra introductory material) for this version
Subjects: Physics and Society (physics.soc-ph); Statistical Mechanics (cond-mat.stat-mech); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO); Data Analysis, Statistics and Probability (physics.data-an)
Not having a CS or good mathematical background, graph theory has only recently popped up on my radar - originally via perusal of the Python Cookbook site. I wonder if you have come across an analogous problem in bioinformatics to the n-tuple 'compression' problem. Specifically, given (say) the following 3-tuples:
(1, 2, 3)
(1, 2, 4)
(1, 3, 3)
(1, 3, 4)
(2, 2, 3)
(2, 2, 4)
(2, 3, 3)
(2, 3, 4)
desired result would be (1+2, 2+3, 3+4) (cartesian product of this -> above tuples). if the (2, *, *) tuples did not exist in the above list, then (1, 2+3, 3+4) would be desired result of the compression/merge process.
The big picture is, given a file with many unique n-tuples, produce a (near) optimal set of 'merged tuples' which can reconstruct the original tuples by cartesian product expansion. Of course there may be some n-tuples which are entirely unmergable.
It has been suggested to me that I create a kind of Hamming graph where nodes are n-tuples and edges join nodes differing in only one tuple field. e.g. in above case (1,2,3) would be adjacent to (1,2,4), (1,3,3) and (2, 2, 3). Drawing this 3-tuple case results in something which looks like a cube.
My question is how do I find such 'features' in a large spaghetti ball graph. If the small data set of 3-tuples given above is a subset of a large data set, then it is reasonable to assume that (say) the graph node of tuple (1, 2, 3) might also be connected to other nodes - for example (9, 2, 3). Which is where we get into questions of maximality.
So I guess I really have two questions: (1) How do I find such cube / hypercube features? and (2) How do I try to ensure that the features I pick for compression result in maximal compression?
I realise that this is probably non-trivial (to say the least), but would appreciate very much if you could offer any immediate insight which might be obvious to you (it certainly won't be to me!:)... in particular if you can point me to a directly analogous problem, it would be a very helpful starting point. I've got the vaguest idea that bioinformatics, computational pharmacology, etc. are the most likely fields where a similar problem might have cropped up and been solved?
Peter Green: see this recent paper in the arXiv from Fri, 20 Jul 2007, whiuch has a great algorithm:
http://arxiv.org/pdf/0707.2890
Title: Constructing level-2 phylogenetic networks from triplets
Authors: Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie
Subjects: Populations and Evolution (q-bio.PE)
Thanks. Had a brief peruse and can't quite see the relatedness given that phylogenetic networks seem to include the notion of directedness.
Just to check my understanding of why the heuristic can fail - at each iteration we try to grow possible cliques, eagerly merging them as opposed to considering two conflicting merges simultaneously. Merging too soon can include a vertex that will block other merges. These may include the one that would generate the maximal clique over G.
Nice article btw.
Greg:
Yes, you've got it exactly. The heuristic is over-eager, and can incorporate blockers into its solution.