As I recently reported, there is an order of magnitude difference between the market share of Linux “out there” in the world, and the market share of LInux on Scienceblogs.com and on this very blog. Subsequently, I was trolled by my very own brother “… so, when is Luniux going to reach 1% market share?….” and this item has come out on ZDNet (which we all know is essentially funded by Microsoft, right?): Linux – Still chasing that elusive 1% market share.

Suddenly, it dawned on me that something is wrong with this picture.
Maybe.
Is it necessary to assume that the readers of Sb are really that different? An order of magnitude different? Isn’t is possible that the sample of users hitting pages at Scienceblogs.com is a perfectly good sample of internet use, and thus, reflect the underlying distribution of systems (since most systems are ultimately used to access the internet, yes?)? And that this other data is bogus?
So I went and looked. Here is the description of the database used by the Market Share service that everyone seems to rely on:
We collect data from the browsers of site visitors to our exclusive on-demand network of live stats customers. The data is compiled from approximately 160 million visitors per month. The information published is an aggregate of the data from this network of hosted website statistics. The site unique visitor and referral information is summarized on a monthly basis.
WTF?
Is this supposed to be some kind of unbiased sample? But wait, there’s more:
In addition, we classify 430+ referral sources identified as search engines. Aggregate traffic referrals from these engines are summarized and reported monthly. The statistics for search engines include both organic and sponsored referrals. The websites in our population represent dozens of countries in regions including North America, South America, Western Europe, Australia / Pacific Rim and Parts of Asia.
Well, that means more data, but does it mean less bias? Or more bias? Here’s some additional information; a summary of features of the sampled population:
OK. We are asking the question: How many people are running Linux. Maybe we are asking how many computers are running Linux. These are not the same question. But the data we have comes from people using the internet to access sites, and there are two data sets. Mine and theirs. Theirs is as summarized above, and mine is visitors to Sciencebogs, and they are different.
This profile … 76 percent in pay per click programs (i.e, buy google adsense space), just under half as commerce sites, and so on … this is the profile of sites that are being visited, the visitors counted, and the visitors’ OS (and other data) recorded.
So what is the taphonomy of this process … the steps of randomizing the data, or introducing bias, from the number of computers running each operating system to the clicks on this particular ‘demographic’ spread of sites?
The complexity of this problem is actually rather large. But I can tell you one thing: If you were my graduate student and you came to me with this sampling strategy, I’d send you back to kindergarten. (If I had that power.)
So, initially, I just thought that members of the Sb community were more Mac- and Linux-oriented than the rest of the drones out there. And that still might be true. But now, it seems that the number that many seem to rely on for “market share” is potentially biased, or at least, I’m not sure how one would demonstrate that it is not.




