Suppose you've got a bunch of data. You believe that there's a linear

relationship between two of the values in that data, and you want to

find out whether that relationship really exists, and if so, what the properties

of that relationship are.

Once again, I'll use an example based on the first example that my

father showed me. He was working on semiconductor manufacturing. One of the

tests they did was to expose an integrated circuit to radiation, to determine

how much radiation you could expect it to be exposed to before it failed. (These were circuits for satellites, which are exposed to a lot of radiation!)

The expectation was that there was a linear relationship between

failure rate and radiation level. Expose a batch of chips to a given level R of radiation, and you'll get an X% failure rate. Expose the chips to double the

amount of radiation, and the failure rate will double. This relationship

held up pretty well - even when the model predicts that the doubling would go above 100, they'd see all chips fail.

So we'd start with a list of data about failure rates. This is a very small

dataset; in reality, we'd have hundreds or thousands of data points, but for simplicity, I'm using a very small set.

Exposure | Failure Rate |
---|---|

10 | 8.2 |

12 | 10.1 |

14 | 11.8 |

16 | 10.3 |

18 | 12.7 |

20 | 12.9 |

There are a couple of things that we'd like to know from this data. First,

can we predict the failure rate for a particular exposure? And if so, how reliable will that prediction be?

To answer that, we first need to have some idea of what kind of relationship we're looking for. In this case, we believe that the relationship should be linear.

(If it's not, then we wouldn't use linear regression, and this would be a different post.) So what we're looking for is a linear equation that captures the relationship in that data. Since it's experimental data, it's not perfect - there are experimental errors, there are extraneous factor. So no line is going to be

an exact fit to the data - so we want to find the best possible fit.

If you remember your algebra, a linear equation takes

the form y=mx+b, where "m" and "b" are constants. In this case, "x" is the

radiation level, and "y" is the failure rate. So what we want to do is to take

the set of data, find values for "m" and "b" that provide a *best fit* line for the data. The way that we define best fit is a line where, for a given x-value,

the mean-square difference between the measured y-value, and the y-value predicted by our best-fit line is minimal.

Mean-square in the above basically means that we take the difference between

the observed and the predicted, square it, and take the mean. That mean-square

value is what we want to minimize. But why square it? The intuitive answer comes

from an example. What's a better fit: one where for a given pair of points, the

distances are -10 and 15, or one where the distances are 7 and -2? If we use mean

distance, then the two are the same - the mean distance in each case is 2.5. If we

use mean-square, then we've got 162.5 in the first case, and 53 in the second. The "7" and "-2" is a better fit.

So how do we minimize the mean-square y-distance? It's actually pretty easy. First

we want to compute the slope. Normally, the slope is defined as ΔY/ΔX. We

tweak that a bit, making it (ΔYΔX)/(ΔX^{2}). Then we

use that for all of the data points. So what we end up with is:

m = (Σ(x

_{i}-x)

(y

_{i}-y)) /

(Σ(x-x)

^{2})

So, for our example data, x=16,

and y=11. So we compute the

following values for our data:

y-y | x-x | (x-x)^{2} |
---|---|---|

-2.8 | -5 | 25 |

-0.9 | -3 | 9 |

0.8 | -1 | 1 |

-0.7 | 1 | 1 |

1.7 | 3 | 9 |

1.9 | 5 | 25 |

If we sum up "(y-y)(x-x)", we get 29.8; if we sum up

(x-x)^{2}, we get 70. That gives us a slope of 0.42.

Once we have the slope, we need to compute the point where the regression line crosses the y axis, which is simple: b=y-mx. So the intercept is 11-0.42×15 = 4.7.

The resulting line - the "best fit" line according to our linear regression - is shown in the diagram. You can see it's a reasonably good fit. But how good? There's a standard

measure of the quality of a line-fit in a linear regression, called the *correlation coefficient*. The correlation coefficient describes how well a simple linear relationship matches a set of data (which connects to an interesting point that I'll mention in a moment). The correlation coefficient is unitless, and varies between -1 and +1. If the data is a perfect negative correlation - meaning that as X increases, Y decreases in a perfectly linear fashion, then the correlation coefficient will be -1. If the data is a perfect positive correlation - that is, a perfect relationship where when X increases, Y also increases in a linear fashion, then the correlation coefficient will be +1. If there is absolutely no linear relationship between X and Y, then the correlation coefficient will be zero.

What's a "good" correlation coefficient? There is no single answer. If you're working on

an experiment where you've done a really good job of isolating out any external factors, and

where you've got very precise equipment, the correlation coefficient required to conclude

that you've correctly isolated a relationship could be very high - greater than 0.99. On the other hand, in social science work, anything over 0.5 is often considered good enough to

infer a relationship.

The definition of the correlation coefficient is a bit messy. In fact, there are several

different versions of it used for different purposes. The most common one is the Pearson

correlation, which is what I'm going to describe. The basic idea of the Pearson correlation

coefficient is to compute something called the *covariance* of the two variables in

the equation, and then divide that by the product of the standard deviation of the two

variables. The covariance is something that's usually defined via probability theory, and I

don't want to get into the details in this post, but the basic idea of it is: for each data

point (x,y), take the product of the difference between x and mean x, and the difference

between y and the mean y. Sum that up for all data points, and that's the *covariance*

of the variables x and y.

For our particular data set, the correlation coefficient comes out to 0.88 - a very

high correlation for such a small data set.

The interesting point is, we constantly hear people talk about correlation. You'll frequently hear things about "correlation is not causation", "X correlates with Y", etc. Most people seem to think that correlation implies nothing more that "X is related to Y", or "If X increases, then Y increases". In fact, correlation means something very specific: X correlates with Y *if and only if* there is a linear relationship between X and Y. Nothing more, and nothing less. Correlation means a direct linear relationship.

When you look at data, it's easy to make the assumption that if, every time you change X Y changes in a predictable linear fashion, then changing X must *cause* Y to change. That's probably the single most common mistake made in the misuse of statistics.

Correlation shows that there is a relationship - but it doesn't show *why*.

Real examples of this show up all the time. For a trivial example, my dog is more

likely to have an accident in the house if it's very cold out. In fact, there's

a strong inverse correlation between the temperature outside, and the likelihood of my

dog having an accident. Does cold weather cause my dog to have an accident? No. Cold

weather causes me to try to avoid taking him out, and because I don't take him out

enough, he has more accidents. The cold doesn't affect my dog's likelihood to

have an accident directly. He'd be equally likely to have an accident if the weather

were a balmy 70 degrees out, if I didn't take him out for a walk. He doesn't care whether

it's 10 or 70; he just needs to go.

- Log in to post comments

Shouldn't that data set be analyzed with logistic or Poisson regression? You point out, after all, that the standard linear regression model predicts > 100% failure rate for sufficiently high levels of radiation (it also predicts < 0% failure rates for very low levels).

Noahpoah:

It's just a fake example for the sake of simplicity. In the real world, they were testing within a very narrow range of failure; their yield was (if I recall correctly) less than 20% success under the best conditions, down to 15ish percent under high stress. In their case, the simple linear regression had very reliable results. But for the purposes of this post, I'm using data that I just invented off the top of my head to illustrate the basic idea.

"Correlation is not causation."

My favorite example is the observation that as more firefighters are sent to a fire, the fire does more damage.

What about if one or both of your variables have uncertainty? I know that if you ignore the uncertainty in the measured variables, the standard deviation of the slope can be used as the uncertainty in the slope, but how do you fold in the uncertainty of the other variables?

I've heard of something called Total Least Squares that apparently achieves this, but don't really understand it that well, (and Mathematica doesn't support it without an additional $500 program called Experimental Data Analayst).

I'll admit that my understanding of linear regression is naive, but wouldn't the expected correlation coefficient tend to decrease with increasing numbers of points? (Or am I thinking of statistical significance levels?) Particularly when you are dealing with multiple regression. For example, R^2 for a data set consisting of two points is guaranteed to be 1, because two points determine a line.

Another remark is that many apparently nonlinear regressions can be reduced to linear regression by a suitable change of variables. In the example you gave, if your set of radiation doses extended to the point where you expect a nearly 100% failure rate, you might model the actual curve as something like y = tanh(ax), which looks linear near x=0 but asymptotically approaches y = 1 for large x. (Which is probably true of your sample, but you would need a rather large sample size to distinguish between this scenario and 100% failure rate.) In this case, taking z = tanh^-1(y) would give you the relation z = ax. Exponentials, logs, and power laws are particularly amenable to this treatment.

What's a better fit: one where for a given pair of points, the distances are -10 and 15, or one where the distances are 7 and -2? If we use mean distance, then the two are the same - the mean distance in each case is 2.5.

I'm not seeing how you get that. Isn't it 12.5 in the first case?

Sorry if this is dumb

Yow, $500!? Wikipedia's article is far less expensive.

This is great! Could you just explain what the bar over the variable refers to? Thanks!

No, it's 2.5

The mean distance between X and Y is (X + Y)/2

-10 + 15 = 5

5 / 2 = 2.5

Thanks, Skemono, missed the minus sign since it got word wrapped here to a different line. (And my eyesight is not too good to begin with.)

If you'll use it, y'should understand it, especially if you don't need or want to program it.

There are many references, including this and that, but IMO the best introduction to TLS and that amazing workhorse, the SVD, is Strang and Borre. See Strang's online course, too.

Try this interesting unpublished manuscript by E.T. Jaynes which has a discussion of the whole total least squares debacle and its history: http://bayes.wustl.edu/etj/articles/leapz.pdf

Correlation and causation:

In children, there is a significant correlation between astrological sign and IQ. This lessens with age, disappearing well before adulthood.

Students who hire tutors do worse than students who do not hire tutors

I recall from my probability and statistics course a few decades ago (how the time flies) that if you ask people to draw a line through a set of data points, most people will draw a pretty good approximation to a least-squares fit. The only real value that I've ever gotten out of this little fact is that if you ever see a line drawn through a set of data points that doesn't look right, someone has screwed up while running the data analysis.

Wow, Jaynes could be incredibly bombastic. It's completely understandable though, when entire fields missed something quite basic and refused to take much interest in Bayesian methods.

I'm a big fan of linear algebraic methods, and it's interesting that this particular problem can be solved with matrix methods. I'm probably going to slaughter the math notation, I'll just use a notation similar to that in the 'maxima' software.

You've got the model: x*m + 1*b = y

For all of your data points, write this as:

[A] . [[m] [b]] = [y]

Where 'A' is an Nx2 matrix, each row of which is [x, 1]. [y] is an Nx1 matrix, each row of which is a 'y' value.

Then, a least-squares fit to the data is given by:

transpose(A) . A . [[m] [b]] = transpose(A) . [y]

transpose(A) . A is a 2x2 matrix. You solve this matrix equation by the usual methods, and you get a least-squares fitting of m and b.

This technique generalizes, and can be used any time you're trying to obtain a least-squares approximation to the linear coefficients of a fitting function.

Thanks for this.

I remember having blindly used a set of formulae (cribbed from the manual of a more advanced Casio calculator than mine!) to get best-fit straight lines, but this explains it much better.

Again like the standard deviation it's possible, by doing a bit (OK then, a lot) of simplification, to derive the equations for the slope and offset in terms of just the final totals Î£x, Î£xÂ², Î£y, Î£xy and N.

Here's a quick program I knocked together to calculate the slope and offset for a dataset. Unfortunately, the blog software has spoilt my nice indentation :(

`#!/usr/bin/perl -w`

use strict;

my ($x, $y, $n, $sigmaX, $sigmaX2, $sigmaXY, $sigmaY);

my ($offset, $slope);

while ($_ = <>) {

tr/\r\n//d;

if (($x, $y) = /([0-9.]+)[^0-9.]+([0-9.]+)/) {

$sigmaX += $x;

$sigmaY += $y;

$sigmaX2 += $x * $x;

$sigmaXY += $x * $y;

++$n;

};

};

$offset = ($sigmaY * $sigmaX2 - $sigmaX * $sigmaXY)

/ ($n * $sigmaX2 - $sigmaX * $sigmaX);

$slope = ($n * $sigmaXY - $sigmaX * $sigmaY)

/ ($n * $sigmaX2 - $sigmaX * $sigmaX);

print "Slope is $slope\nOffset is $offset\n";

exit;

It expects data on STDIN (so just use the < operator) and is pretty liberal with what it accepts. Basically it reads the first two groups of digits and decimal points, separated by at least one character which is neither a digit nor a point, from each line. A CSV file fits this specification nicely.

By the way, my favourite bogus correlation is between kilos of ice cream sold on a given day and deaths by drowning on that day.

Oops. That should have been

while ($_ = <>) {

of course.

The eye is a pretty good judge of nonlinear fits as well. For example, while it is very hard to draw a correct hyperbolic or exponential fit by eye, the eye is quite good at judging its quality. I always tell people never to just accept the parameters from a computer fit--check to see if it "looks right," because if it doesn't, something is likely to be wrong.

Nice post. An important point to note, conveyed symbolically but not in words, is that the least-squares regression line always passes through the bivariate mean.

I use linear regression (and its big brother, Analysis of Covariance) in all my work, usually to scale some measure of whole-animal physiological performance to body mass.

As pointed out, "ordinary" least-squares (OLS) regression minimizes the squared "vertical" distance from the computed line to each data point, these distances parallel to the y axis. This line is strictly appropriate only when the independent (x) variable is measured without error. It's operationally apropriate when x is measured with much greater precision than y (e.g. body mass and metabolic rate) and the point is to use the line to predict y from an x value measured in the future.

In other cases, though, the line is not intended for prediction, x is measured with similar error as is y, and sometimes it is not even clear which variable is independent and which dependent (e.g. growth rate and metabolic rate). I had not previously heard of the "Total Least-Squares" approach, but I guess it's a multivariate version of what we biologists call "reduced major-axis (RMA) regression" or "geometric mean regression." This technique computes a line that minimizes the squared perpendicular distance from each data point to the line. I am no statistician, but it's my understanding that the arithmetic works out such that the RMA slope = the OLS slope divided by r (so the RMA slope is always higher than the OLS slope). The RMA line is also constrained to pass through the bivariate mean, so once the new slope is calculated the formula shown in the OP can be used to find the intercept.

One other point: we commonly use the value r^2, the "coefficient of determination," to characterize power of a predictive relationship. R-squared is interpreted as the proportion of variance in the y (dependent) variable that is "accounted for" or "explained by" the variance in the independent variable and the computed regression. It's a way to get quantitatively from correlation to (an inference of) causation, if the direction of causation is already clear a priori.

Heh! While I've never heard of this as a problem (or it's solution) I suspected as much from extrapolation.

But also from my personal heuristic to describe what is going on. The idealized regression line can be modeled as a stiff rod hanging in strings, and by the linearity in Hooke's law we get the squares for the minimum of potential energy when the rod is hanging free.

So the general problem should be solvable by letting the strings connect to loose instead of fixed x (or y) distance points, and thus minimizing the squared perpendicular distance. To me it seems improbable that this ever was a problem for mathematicians relying on standard measure theory (as AFAIU probabilities are compatible with). But what do I know?

As I remember, the model assumes that residual follows a standard normal distribution. It is necessary to study residuals to assess if the model is valid or not to a particular set of data. A F-test for significance of regression can be used to determine if model can explain the variation of data.

To Eduardo

You're right about having to study the residuals, but they don't have to be from a STANDARD normal, any normal will do. The assumes are that the residuals are independently and identically distributed with a normal distribution with mean 0 and constant variance

Daniel Hawkins (@4) - if your explanatory variable has error, then the estimate of the slope is biased towards zero. There's no really good solution, unless you know something about the error structure (e.g. the variance of the uncertainty).

Aaron (@13) - in fairness to the other fields, Bayesian methods weren't much use for most problems until the computational methods developed. Quite frankly, they were right to ignore them.

Eduado (@20) - the residuals don't have to be normally distributed. The regression is a result of trying to minimise the square of the residuals. Where it makes a difference is in significance tests and confidence intervals, which are easiest to construct assuming normality.

Hi, maybe I missed it in the comments, but what you should explain for beginners is why you square the distance instead of taking the absolute value; say why don't we use the sum of | - x_i|?

Is it only because it makes the computation easier, or is there a deeper reason?

/threadjack

If everone had agreed it was right to ignore Bayesian methods, then no one would have bothered to develop and study the necessary computational methods.

Conversely, if more people had been interested in Bayesian methods, the Metropolis algorithm might have made the jump from the physics literature to the statistics literature in the '50s instead of the '90s.

/end threadjack

My favorite example of curve-fitting abuse:

http://www.talkorigins.org/faqs/c-decay.html

Yes, the mean distance calculation is wrong, since distance is usually taken as the absolute value. I.e. distances 15 and -10 are really 15 and 10, and 7 and -2 are really 7 and 2, with means of 12.5 and 4.5, respectively.

The squares are needed to get the mean to want to be in the middle. In other words, there is a difference between differences of 11 and -1 versus 6 and -6: in the latter case, the line is going exactly in the middle between the data points, and in the first case, it is off to one side; yet the average mean in both cases is 6. If you square the distances, the first case looks much worse, with (121+1)/2 against (36+36)/2.

-mendel

Hope somebody--other than mendel, whose comment I cannot parse--will answer mickkelly's question about squaring instead of using absolute values. I've often wondered the same, while obediently computing my sums of squares.

It was discussed in the earlier post.

I might add that AFAIU the deeper reason for the square popping up often is that it is just an expression for energy (as my heuristic shows). It has all sorts of connections to noise power in stochastic processes et cetera statistical subjects. Or maybe it is just my physic presumptions.

Nice work here (research stats happens to be my Briar Patch in the math realm); and -- for most purposes, as people tend to be leery of relationships mor complicated than linear or exponential -- your explanation of "correlation" is a good one.

I would ask, "can't we have non-linear correlations?", but only for the more advanced reader.

Thanks again. I enjoy perusing your posts.

mickkelly - That is done, technically it's called the L1 norm. Historically, least squares was developed first (by a couple of hundred years or so), and is more tractable mathematically. There are theoretical reasons why it might work well, and the extensions are well developed.

It also works well in practice: some things can screw it up, but they can be spotted, and dealt with.

I guess the absolute distance isn't used because least squares works well, so there's no point in developing the theory as far (i.e. it's a historical reason:least suqares got there first). I know there has been work on more general forms of estiamtion, but I haven't followed that literature at all.

Um, no. The key inovations were faster computers, and the Gibbs sampler, which was devised by physicists in the 1980s (Geman & Geman, 1984 IIRC).

Of course, there were people working with Bayesian methods, but the ideas never really got out of the stats department, because they weren't generally useful. The key statistical innovation was in graphical models, which alloeed the Gibbs sampler to be applied to complex models by breaking it down into parts. The M-H algorithms would be an absolute bitch to tune to the whole model, so you need a method to decompose it into smaller pieces. Hence the graphical modelling.

Thanks, TorbjÃ¶rn and Bob.

I ran my first linear regressions as part of my undergrad thesis in 1982 (back in the BPC epoch; I don't even think there were pocket statistical calculators yet--I remember thinking my HP-15 was sent from heaven when I got one in '85 or so). I did the calculations on a huge, SUV-sized mechanical calculator, with amazing levers and gears and turning wheels of digits. It was an amazing machine, and I haven't seen one since.

very informative post. maybe somebody could help me to understand reduced major axis regression, comparing slopes, basic GLM - where could I find info on really good on-line courses for this subject matter, preferrably for spss application. any thoughts?