Knowing the Score on Relationships

What would you say are the strongest three factors associated with the salaries of major-league baseball players? According to a popular, well-established algorithm, the main influential factors are walks, intentional walks and runs batted in.

How much does he earn?

But a paper recently published in Science reports on a new data analysis tool, which is able to find interesting relationships and trends in complex data sets - relationships that are invisible to other types of statistical analyses.

This could be a big deal: Large data sets with thousands of variables are increasingly common in fields as diverse as genomics, physics, political science, economics and more, so there is an increasing need for data analysis tools to make sense of such complex data sets.

It all started when Yakir Reshef, who is now a visiting Fulbright Scholar at the Weizmann Institute, was an undergraduate at Harvard University. Together with his older brother, David Reshef, then a master's student at MIT, he became interested in large data sets containing relationships whose type is unknown. In a collaboration that began in North America and crossed the Atlantic Ocean as Yakir moved to Israel, the two developed a new algorithm that could discover unexpected yet important relationships that would otherwise go unnoticed.

The tool the two developed - under the guidance of advisers Michael Mitzenmacher of the Harvard University School of Engineering and Applied Sciences and Pardis Sabeti of the Broad Institute - is named the maximal information coefficient, or MIC, and it scores pairs of variables based on how closely related they are. Researchers can calculate MIC on each pair of variables in their data set, rank the pairs by their scores (the higher the score, the more related the pair), and then examine the top-scoring pairs - that is, the pairs that affect each other the most.

Relationship graph.jpg
Associations between bacterial species in the gut microbiota of "humanized" mice

To test whether the algorithm actually works, Yakir and David worked with Ph.D. student Hilary Finucane, of the Weizmann Institute's Mathematics Department (and while we are on the subject of relationships, Yakir's fiancée). The three applied MIC to both known and novel data sets in global health, gene expression, the human gut microbiota, and - you guessed it - major-league baseball, and compared the results to those of current methods.

In one example, they examined data from the World Health Organization, covering 200 countries and containing 357 data variables per country. One interesting relationship they found was between female obesity and income in which obesity increases monotonically with income in the Pacific Islands, a finding that contrasted with results from other countries. Was this an anomaly they were seeing? On the contrary - obesity is considered a sign of status in the Pacific Islands. But while most methods would treat this separate trend as noise, MIC is able to identify relationships, such as this one, that include more than one trend.

The researchers explain that the attributes which set MIC apart from other data analysis tools are twofold: It assigns high scores to a wide variety of relationship types hidden in large datasets, while also being able to provide similar scores to relationships with comparable amounts of noise. In other words, they say, it can find "cool things going on" that are unexpected and therefore difficult to detect with other types of analyses.

So what about baseball? MIC results differ from that traditional statistic: Rather than walks, intentional walks and runs batted in, it places hits, total bases and how many runs a player generates for a team as the most influential factors. So, which of the statistics is correct? The researchers have wisely opted to step aside, leaving it to baseball enthusiasts to decide which of them are - or should be - more strongly tied to salary!

Reshf and Finucane.jpg
Hilary Finucane and Yakir Reshef

More like this

Nick Swisher, the prospect the traditional scouts and statisticians agreed upon. After watching the film "Moneyball" with my young son about professional baseball and the struggle to build a strong team on a shoestring budget, I wondered "What really determines a player's salary?" Is it how…
Cause and effect weave a tangled web, but a new data analysis tool called MIC can help make sense of it all.  The Weizmann Institute writes that "Large data sets with thousands of variables are increasingly common in fields as diverse as genomics, physics, political science, economics and more."  …
I'm not much of a baseball fan, but we're edging our way toward football season, so I flipped to ESPN radio a couple of days ago, in time to hear Mike and Mike discussing Jim Thome's 600th home run. They were questioning how much meaning we should attach to home run records any more, given how many…
You are not alone. Even if you're currently reading this in complete isolation, you are still far from a singular individual. You're more of a colony - one human, together with microbes in their trillions. For every one of your own genes, your body is also host to thousands of bacterial ones. Some…

You offered us so many interesting ideas and thoughts,it really does effect to relationship the eranings in the family.

Is the claim that the algorithm was able, without assistance, to classify certain nations as belonging to Polynesia? If so, you really ought to be writing about that. If not, if it just depends on human intervention to assign some relevance to a correlation that is positive in some cases and negative in others, then it's just spotting relationships within arbitrary subsets of a total population while ignoring the rest - in other words, a conspiracy theory generator!

By Ian Kemmish (not verified) on 07 Mar 2012 #permalink

Actually, it found a correlation between obesity and economic status in Polynesia in a large data set on global health. In other words, it identified a true (and known) trend that would probably have been discarded as noise in such a large pile of information using other statistical methods.