The Terminator said:
Excluding the United States and Switzerland would make this worse.
Further, do you have any justification for excluding them?
Eliminating data points, simply because they don’t “fit” isn’t very
Because with least squares estimation, outlying values bias the
results. In this case they make the correlation higher, giving a
value that would be quoted by a politician, and not a statistician.
I made the same mistake on an oral presentation a semester ago.
I was computing a linear regression for data taken in the Millikan
Oil Drop Experiment. I “threw out” outlying points because I thought
that since they were much farther away than the rest of the clustered
points, they were erroneous. Henry Kendall, a Nobel laureate, who was
the instructor for Junior Lab, turned a shade of crimson, and explained
to me in no uncertain terms that throwing out those points was UNDESIRABLE.
When you discarded outlying points it was wrong. When I did it was
right. The difference is that I knew what I was doing.
Here is scatter plot of handguns vs homicide rate.
U|9 | | homicide rate | (per 100,000) N |5 | | C F |3 SANFW S |2 ES |1 --------------------| 1 7 15 29 Per cent households with handguns
11 of the countries form a cluster in the bottom left hand corner of
the plot. Linear regression effectively fits a line connecting this
cluster with the US in the upper right corner of the plot. This gives
a reasonably large correlation coefficient since all the cluster
points are close to the line. It is reasonable to do another
regression just within the cluster.
It also a good idea to think of reasons why the other countries do not
fit in the cluster. NI presumably because of the political violence
not present in the other countries, Switzerland because half of the
handguns were army weapons, something not occurring in other countries,
and the US because (fill in your own reason here).
He asked me, “Many years ago, it was taught that you should take three
measurements when weighing a compound, and throw out the maverick point.
Do you know what’s wrong with this?” I did. “Say that you have two
low measurements and one high, so you toss out the high. Your computed
average would be too low.” He concurred.
Let’s see, we weigh our compound three times and get weights of 26, 27
and 106 grams. One thing we can be sure of is that the true weight is
NOT 53 grams (the mean). If we have a 20% chance of getting a bogus
weight, then there is a 50% chance that the mean of three measurements
will be bogus. In this situation, a robust statistic such as the
median is called for.
I append an extract from comp.risks that may be relevant.
---------------------- Date: Mon 17 Mar 86 11:43:53-PST From: JAGAN@SRI-CSL.ARPA Subject: A Stitch in Time To: Neumann@SRI-CSL.ARPA This is the probable sequence of events that led us back in time on CSLA: 1. A power glitch (late night SUNDAY) caused the F4 to hard boot. 2. During a hard boot, the TIME is retrieved from eleven independent sources (which are assumed to be correct!) 3. One of these sources had the incorrect time of some warm day in 1972 causing the average to be wrongly computed resulting in Dec 6th/1985. Suggestion: 1. Change the statistical measure from MEAN to something less sensitive to one or two abnormal times; for example the average of the 5th, 6th, and 7th largest times. [IT IS ABSOLUTELY INCREDIBLE THAT UNSAFE ALGORITHMS continue to be used. This problem is as old as the hills. Statisticians routinely throw out the absurd values before computing the mean. Dorothy Denning pointed out the pun in their terminology (applicable to Byzantine agreement algorithms, where you don't trust anyone): the OUT-LIERS are really the OUT-LIARS. EVEN WORSE, Jagan points out that if the clock had been accidentally set INTO THE FUTURE, things could also get very sticky. We also have a problem of nonunique clock readings during the hour at 2AM when Daylight Savings Time ends. A good time to be asleep. PGN]