Now on ScienceBlogs: Charles Darwin February 12, 1809 - April 19, 1882

ScienceBlogs Book Club: Inside the Outbreaks

Good Math, Bad Math

Finding the fun in good math; Shredding bad math and squashing the crackpots who espouse it.

Search

Profile

markcc.jpg
Mark Chu-Carroll (aka MarkCC) is a PhD Computer Scientist, who works for Google as a Software Engineer. My professional interests center on programming languages and tools, and how to improve the languages and tools that are used for building complex software systems.

Donors Choose

Other Information

Add this blog to my Technorati Favorites!

Recent Posts

Recent Comments

Categories

Blogroll

Old Topic Indices

Great Online Books

« A Laughable Laffer Curve from the WSJ | Main | Order From Chaos Using Graphs: Ramsey Theory »

Cliques, Subgraphs, and a bit of biology

Category: Graph Theory
Posted on: July 15, 2007 6:52 PM, by Mark C. Chu-Carroll

pclique.jpg Today, I'm going to talk a bit about two closely related problems in graph theory: the maximal clique detection problem, and the maximal common subgraph problem. The two problems are interesting both on their own as easy-to-understand but hard-to-compute problems; and also because they have so many applications. In particular, the two are used extensively in bioinformatics and pharmacology for the analysis of complex molecular structure.

As I mentioned back in the definitions post, a clique is a complete graph: a graph where all vertices are adjacent, where any pair of vertices has an edge connecting them.

The clique problem is an optimization problem which asks: given a graph G, what's the largest subgraph of G which is a clique? If we convert that to a decision problem, we get: "Is there a clique of N vertices contained in G?"

The maximal common subgraph problem is another optimization problem which asks: Given two graphs G and H, what is the largest subgraph of G which is isomorphic to a subgraph of H? In its representation as a decision problem, it becomes: is there a subgraph in G with N nodes which is isomorphic to a subgraph of H?

Following the usual pattern of these things, for both problems decision problem is NP complete, and the corresponding optimization problem is NP-hard. In both problems, in the worst case, you need to exhaustively consider all possible subgraphs in order to find the solution. In both cases, there are useful heuristics for pruning the search space, but like other NP complete/NP hard problems, either the heuristics sometimes fail to produce the optimal solution, or they sometimes wind up taking exponential time.

But still, heuristics are the key. In general, for applications of the graph clique problem (or the maximal subgraph), you want the optimum, but a solution which is not the ideal optimum, but is very close is nearly as good. Again, this is the common pattern for solutions of NP complete problems. Perfect solutions are hard to come by, but good ones are often not unknown.

(A great example of this comes from a talk I saw a few years ago by a software engineering researcher named Daniel Jackson. Daniel had built a really cool tool called Alloy, which did some neat analysis. In the course of his talk, he was describing how the analysis actually worked. The way he described it was: "The bad news is, it's reducible to 3-SAT, which is NP complete. The good news is, it's reducible to 3-SAT, which means we can compute the answer really fast.")

Anyway - the basic heuristics for these two problems are very similar; I'll describe the method for the clique problem; and starting from the largest common clique is a good heuristic for a starting point for getting the largest common subgraph.

For the maximum clique, there's a clever heuristic, which will, on average, find the maximal sub-clique. The idea is, find the set of all clique of size 2. Then, given that set, try to find cliques within it that can be merged. When you find two cliques that can be merged into a clique, remove those two, and replace them with their merge. Keep doing that until you can no longer merge any members of the set.

This can be implemented quite efficiently. (In fact, it's just a special case of the good old classic union-find algorithm, which I'll probably write about one of these days.) It's not a true solution to the problem, because it doesn't always produce the maximal clique. It produces a maximal clique over its members, but not necessarily the maximal clique over G.

Let's pause for just a moment, so that I can explain that last line, which sounds quite confusing.

Take a graph G=(V,E), and a set of vertices C={v1,...,vn}⊂V, such that the induced subgraph over C is a clique. C is a maximal clique over the vertices in C if there is no vertex is G that can be added to C producing a larger clique.

C is the maximal clique in G if and only if there is no larger clique contained in G.

The key difference is that in the first (a maximal clique over a set of vertices), we're asking for the largest clique containing that specific set of vertices; in the second, we're asking for the largest clique formed from any set of vertices in G.

For example, in the diagram below, the green clique is the maximal clique over the set of vertices, {F,D,L}. There is no larger possible clique in the graph containing those three vertices than the green 4-clique. But the maximal clique in G is actually the 6-clique containing the vertices {A, G, H, J, K, M}.

maximal-cliques.jpg

So, how does this became useful for biology?

  • In evolutionary development, one of the ways of recognizing many processes is by looking at gene expression in terms of gene expression networks, which are graphs. Regulatory systems are often revealed as cliques in the expression network.
  • When studying protein structures, one of the major techniques is to look at similar proteins, and find common cliques in the graph-based structural networks of the proteins.
  • In pharmacology, the relationship between different chemicals is studied by finding the largest common subgraphs in the molecular graphs of the different substances.
  • Again in pharmacology, there is a graph based method for searching for a compound that will interact with a specific biochemical target. One of the ways that this is done is to generate a large library of graphs describing known molecules. Then, given a particular target, you generate a complimentary graph describing its structure; then you search for graphs in the library that have large cliques in common with the complimentary graph of the target. The resulting list of molecules from the library form a set of likely candidates for a molecule that will produce the desired action by bonding with the target.
Share on Facebook
Share on StumbleUpon
Share on Facebook

Comments

1

That's pretty awesome. Thanks for the graph posts. I've been wanting to get back into them after my combinatorics class in college. I also had a lot of graph theory dealing with game theory because of an undergraduate grant I got to study the Tree Game. In any case, I have a cool problem that I think I emailed you, but you must have missed:

Given a regular n-gon with all diagonals drawn, how many subregions are contained in the n-gon? Since I'm not sure exactly what the definition of subregion in this case, but I think it is simply: Anything that shares an edge.

If indeed the definition is anything that shares an edge, then I realized that the problem is isomorphic to a cool graph problem. I believe both problems are unsolved, but I was wondering if you could find anything on them. I have been unable to so far.

The isomorphic problem is simply: Given the same n-gon above, at every angle on the n-gon and every place where two diagonals meet, draw a vertex. Count the number of subgraphs contained in this graph which feature no bottlenecks. I'm not sure if I invented the term bottlenecks here, but a bottleneck is simply one vertex where two graphs come together. It's necessary for that to be in there for the above definition of subregion to hold true. What do you think about this problem?

Posted by: Kyle | July 15, 2007 7:08 PM

2
Regulatory systems are often revealed as cliques in the expression network.

That seems feasible, but I can't see that the linked paper supports that. The examples of functionally related cliques are on energy production and aromatic-acid biosynthesis.

Rather, the paper seems to find that cliques are a common property of gene networks illustrating different evolutionary contexts such as phyletic distribution (naturally some nice and large clusters), chromosomal proximity, or domain fusion.

In any case good illustrations of the technique.

Posted by: Torbjörn Larsson, OM | July 15, 2007 9:13 PM

3

mah

Posted by: genius | July 16, 2007 1:46 AM

4

Kyle: "bottlenecks" is, in my opinion, a poor choice for graphs in a biochemical context.

That's because "bottleneck enzyme" has a traditional meaning in metabolisms. Google it to see what I mean.

I happen to dispute the standard interpretation of "bottleneck enzyme" as based on a static analysis of kinetics, while a more complicated dynamic interpreation is justified.

I first published this in graduate school:

Jonathan V. Post, "Is Functional Identity of Products a Necessary Condition for the Selective Neutrality of Structural Gene Allele?", Population Biologists of New England (PBONE), Brown University, Providence, RI, June 1976

and

Jonathan V. Post, "Enzyme Kinetics and Selection of Structural Gene Products
-- A Theoretical Consideration", Society for the Study of Evolution, Ithaca, NY,
June 1977

and then in a series of papers such as:

Jonathan V. Post, "Nonlinear Enzyme Waves, Simulated Metabolism Dynamics, and Protein Nanotechnology", poster session, 2nd Artificial Life Workshop, 5-9
Feb 1990, Santa Fe, NM

and

Jonathan V. Post, "Analysis of Enzyme Waves: Success through Simulation",
Proceedings of the Summer Computer Simulation Conference, Seattle, WA, 25-27
August 1980, pp.691-695, AFIPS Press, 1815 North Lynn Street, Suite 800, Arlington,
VA 22209

In any case, those educated in biochemistry, enzymology, and metabolic systems would assume that you meant "bottleneck" in the old sense.

More confusion than it's worth.

Posted by: Jonathan Vos Post | July 16, 2007 9:11 AM

5

Deleting minimum vertex degrees iteratively from the graph ends up with complete subgraph or vertex disjoint regular graphs. I think this is simplest heuristics for finding maximal complete subgraph of the graph.

Posted by: I. Cahit | July 16, 2007 9:54 AM

6

Jonathan Vos Post: I didn't even mean to relate it to biology. It was just the only term I could think of that described the graph problem. :) However, there is surely a name that doesn't conflict with biology here either. I can't seem to come up with it though!

Posted by: Kyle | July 16, 2007 3:35 PM

7

I do not disagree with your assessment of this post for use in biological systems.

However, energy economics in the form of various quanta [ATP and related nucleotides, electron and proton transfer, types of bonding, etc] also play a role.

Petrie Nets may be a better graphing tool?

Posted by: Doug | July 17, 2007 12:03 AM

8

Kyle: keep trying. Inventing a new word is one of the inexpensive paths to immortality.

Nice paper on graphs and Protein Interaction Networks:

Mon, 16 Jul 2007

http://arxiv.org/pdf/0707.2076
Title: Node similarity within subgraphs of protein interaction networks
Authors: Orion Penner, Vishal Sood, Gabe Musso, Kim Baskerville, Peter Grassberger, Maya Paczuski
Comments: 10 pages, 5 figures
Subjects: Molecular Networks (q-bio.MN)

Also, ruined by a press release by someone who has no clue as to what "entropy" means, something cool is described at:

Proteins' Internal Motion Found To Affect Their Function: Implications For Drug Design

Posted by: Jonathan Vos Post | July 20, 2007 1:00 PM

9

Great topic Mark. One thing that may help to clarify your definitions of a maximal clique in G versus a maximal clique is to change your terminology a bit. My advisor got me in the habit of talking about a maximum clique on a graph as the largest one that exists and a maximal clique as the largest clique that exists given that it must contain a given vertex or set of vertices. The maximum clique is always maximal, but a maximal clique is not always the maximum.

This is similar to the idea that a hillclimbing algorithm may find a local maximum (a maximal solution) but not a global maximum (a maximum solution).

Posted by: Hank | July 20, 2007 2:28 PM

10

Kyle: here's a cool paper by someone I've seen lecture at my alma mater, about subgraphs with vertices in common, in a very interesting context.

Replacements for Fri, 20 Jul 07

arXiv:physics/0602033 (replaced)
[ps, pdf, other]
Title: Community Structure in the United States House of Representatives
Authors: Mason A. Porter, Peter J. Mucha, M. E. J. Newman, A. J. Friend
Comments: 44 pages, 13 figures (some with multiple parts and most in color), 9 tables, to appear in Physica A; new figures and revised discussion (including extra introductory material) for this version
Subjects: Physics and Society (physics.soc-ph); Statistical Mechanics (cond-mat.stat-mech); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO); Data Analysis, Statistics and Probability (physics.data-an)

Posted by: Jonathan Vos Post | July 20, 2007 10:07 PM

11

Not having a CS or good mathematical background, graph theory has only recently popped up on my radar - originally via perusal of the Python Cookbook site. I wonder if you have come across an analogous problem in bioinformatics to the n-tuple 'compression' problem. Specifically, given (say) the following 3-tuples:

(1, 2, 3)
(1, 2, 4)
(1, 3, 3)
(1, 3, 4)
(2, 2, 3)
(2, 2, 4)
(2, 3, 3)
(2, 3, 4)

desired result would be (1+2, 2+3, 3+4) (cartesian product of this -> above tuples). if the (2, *, *) tuples did not exist in the above list, then (1, 2+3, 3+4) would be desired result of the compression/merge process.

The big picture is, given a file with many unique n-tuples, produce a (near) optimal set of 'merged tuples' which can reconstruct the original tuples by cartesian product expansion. Of course there may be some n-tuples which are entirely unmergable.

It has been suggested to me that I create a kind of Hamming graph where nodes are n-tuples and edges join nodes differing in only one tuple field. e.g. in above case (1,2,3) would be adjacent to (1,2,4), (1,3,3) and (2, 2, 3). Drawing this 3-tuple case results in something which looks like a cube.

My question is how do I find such 'features' in a large spaghetti ball graph. If the small data set of 3-tuples given above is a subset of a large data set, then it is reasonable to assume that (say) the graph node of tuple (1, 2, 3) might also be connected to other nodes - for example (9, 2, 3). Which is where we get into questions of maximality.

So I guess I really have two questions: (1) How do I find such cube / hypercube features? and (2) How do I try to ensure that the features I pick for compression result in maximal compression?

I realise that this is probably non-trivial (to say the least), but would appreciate very much if you could offer any immediate insight which might be obvious to you (it certainly won't be to me!:)... in particular if you can point me to a directly analogous problem, it would be a very helpful starting point. I've got the vaguest idea that bioinformatics, computational pharmacology, etc. are the most likely fields where a similar problem might have cropped up and been solved?

Posted by: Peter Green | July 24, 2007 1:37 AM

12

Peter Green: see this recent paper in the arXiv from Fri, 20 Jul 2007, whiuch has a great algorithm:

http://arxiv.org/pdf/0707.2890
Title: Constructing level-2 phylogenetic networks from triplets
Authors: Leo van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie
Subjects: Populations and Evolution (q-bio.PE)

Posted by: Jonathan Vos Post | July 24, 2007 9:33 AM

13

Thanks. Had a brief peruse and can't quite see the relatedness given that phylogenetic networks seem to include the notion of directedness.

Posted by: Peter Green | July 24, 2007 1:35 PM

14

Just to check my understanding of why the heuristic can fail - at each iteration we try to grow possible cliques, eagerly merging them as opposed to considering two conflicting merges simultaneously. Merging too soon can include a vertex that will block other merges. These may include the one that would generate the maximal clique over G.

Nice article btw.

Posted by: Greg Bakker | July 26, 2007 10:51 PM

15

Greg:

Yes, you've got it exactly. The heuristic is over-eager, and can incorporate blockers into its solution.

Posted by: Mark C. Chu-Carroll | July 27, 2007 8:46 AM

ScienceBlogs

Search ScienceBlogs:

Go to:

Advertisement
Follow ScienceBlogs on Twitter

© 2006-2011 ScienceBlogs LLC. ScienceBlogs is a registered trademark of ScienceBlogs LLC. All rights reserved.