Is Linux Getting the Shaft?

As I recently reported, there is an order of magnitude difference between the market share of Linux "out there" in the world, and the market share of LInux on Scienceblogs.com and on this very blog. Subsequently, I was trolled by my very own brother "... so, when is Luniux going to reach 1% market share?...." and this item has come out on ZDNet (which we all know is essentially funded by Microsoft, right?): Linux - Still chasing that elusive 1% market share.
i-a620ca4c92d9d7c31cb7d1771ed5ea2b-linux_crush_ms.jpg

Suddenly, it dawned on me that something is wrong with this picture.

Maybe.

Is it necessary to assume that the readers of Sb are really that different? An order of magnitude different? Isn't is possible that the sample of users hitting pages at Scienceblogs.com is a perfectly good sample of internet use, and thus, reflect the underlying distribution of systems (since most systems are ultimately used to access the internet, yes?)? And that this other data is bogus?

So I went and looked. Here is the description of the database used by the Market Share service that everyone seems to rely on:

We collect data from the browsers of site visitors to our exclusive on-demand network of live stats customers. The data is compiled from approximately 160 million visitors per month. The information published is an aggregate of the data from this network of hosted website statistics. The site unique visitor and referral information is summarized on a monthly basis.

WTF?

Is this supposed to be some kind of unbiased sample? But wait, there's more:

In addition, we classify 430+ referral sources identified as search engines. Aggregate traffic referrals from these engines are summarized and reported monthly. The statistics for search engines include both organic and sponsored referrals. The websites in our population represent dozens of countries in regions including North America, South America, Western Europe, Australia / Pacific Rim and Parts of Asia.

Well, that means more data, but does it mean less bias? Or more bias? Here's some additional information; a summary of features of the sampled population:

  • 76% participate in pay per click programs to drive traffic to their sites.
  • 43% are commerce sites
  • 18% are corporate sites
  • 10% are content sites
  • 29% classify themselves as other (includes gov, org, search engine marketers etc..)

    OK. We are asking the question: How many people are running Linux. Maybe we are asking how many computers are running Linux. These are not the same question. But the data we have comes from people using the internet to access sites, and there are two data sets. Mine and theirs. Theirs is as summarized above, and mine is visitors to Sciencebogs, and they are different.

    This profile ... 76 percent in pay per click programs (i.e, buy google adsense space), just under half as commerce sites, and so on ... this is the profile of sites that are being visited, the visitors counted, and the visitors' OS (and other data) recorded.

    So what is the taphonomy of this process ... the steps of randomizing the data, or introducing bias, from the number of computers running each operating system to the clicks on this particular 'demographic' spread of sites?

    The complexity of this problem is actually rather large. But I can tell you one thing: If you were my graduate student and you came to me with this sampling strategy, I'd send you back to kindergarten. (If I had that power.)

    So, initially, I just thought that members of the Sb community were more Mac- and Linux-oriented than the rest of the drones out there. And that still might be true. But now, it seems that the number that many seem to rely on for "market share" is potentially biased, or at least, I'm not sure how one would demonstrate that it is not.

  • More like this

    This is all very typical of the 'research' used to support most marketing decisions.

    Market down to experience, I guess....

    Perhaps it would be better to say that the marketers are still chasing those elusive Linux and Macintosh users, the ones who don't buy things just because they're told to.

    IIRC, one can set some browers to say that they're IE, if the website gives one sh*t about not using IE. There was a controversy about browser market share (on some blog, sometime ago) where this was considered (by somebody) to be significant.

    First: When did the iPhone become separate from Mac OS? That's what it runs...

    Second: Are we asking about how much market share Linux has only in the desktop market or are we going to include the server market as well?

    A quick look at scienceblogs.com indicates that they are running Apache on RedHat. That's pretty much the way the rest of the world rolls too.

    One thing that can introduce bias: the method the service uses to turn raw hits into unique visitors. One of the most common methods is browser cookies. Problem is, Mozilla and related browsers have a setting to control acceptance of third-party cookies and it's default setting seems to be more aggressive about blocking third-party cookies than IE's is. When cookies are blocked, the service will discard those hits since it won't be able to do the unique-visitor determination for them. I suspect Mozilla/Firefox users in general are more interested in security and privacy and more likely to prefer tighter controls over third-party cookies than IE users.

    Disclosure: I used to work for WebSideStory (about 4 years ago) writing the cookie-handling and unique-visitor code for the Hitbox service. The effects of third-party cookie blocking were getting so severe that, before I left, we'd had to start moving our servers into our customers' domains. I don't imagine things have gotten any better since then. And I don't imagine the management at other services was any more willing to believe it was having an effect on the data than my management was.

    maybe i missed it ... but what are your stats then? that article has been widely circulated and quoted, some other real numbers could be useful.

    wdely circulated, but is it correct? ... I mean, my stats (see link) are based on good sample sizes. What if I simply decide to believe them instead of these widely circulated stats? MS is behind this, of course.

    I agree that these "market share" figures are likely to be biased, but it seems to me that the biggest bias is based on two simple facts that have nothing to do with web statistics, cookies, or whatever.

    First of all, if one wants to buy a PC in most places in the world, it is either a Mac or comes with Windows installed. Very little of the installed base of Linux comes in the form of a box bought off the shelf at this point. Of the desktop and laptop I use to run Linux, one was bought with no OS (not easy, but Tigerdirect sells a few), and the other was bought used, so I could install Linux on it. Most desktop purchasers are less interested in OS choice than having something they can turn on and use, which means Windows and Mac, because they tie their OSes directly to hardware. The growing sales of UL PCs with Linux on them might change that, but that part of the market has just gotten started.

    The second fact is that if you believe what you read on various forums, quite a few switchers use Linux in a dual or even multiple environment that uses both Windows and Linux. I do, as it's simpler than undoing 20+ years of macro-manipulated data sets based on things like specialized spreadsheets (Quattro Pro mostly) that either won't run or won't run reliably in a Linux environment. Even though I will mostly access the Internet using Linux, in any given day I may access it just about as much on Windows, depending on what I need to do. Windows is run almost entirely on VMs at this point, but I doubt the Market Share stats are going to show that. I may boost the Linux stats on the one hand, but at the same time I boost the Windows stats, so the effect I would have based on web statistics in regard to Linux is effectively canceled out, even though my machines - including the Windows VMs, run on Linux.

    So I have no doubt that the Linux numbers are higher, but I also have no doubt that they're not going to look higher using website access figures unless and until there's a serious uptick in the number of Linux-only boxes or laptops sold out there.

    By lostinspace (not verified) on 09 Jul 2008 #permalink

    lostinspace: That's a good point. It's difficult to purchase a machine without Windows installed. Dell made FreeDos and Red Hat options on their server class hardware a few years ago, and the Wall*Mart PC ran a version of Linux. Mostly it's only a do-it-yourself option.

    On my floor at work, I know of three Macs and three Linux boxen, the other couple hundred people are using Windows, which is our default when purchasing new workstations. While that's technically not statistical bias, it is certainly a contributing factor.

    I think, in order to get substantially neutral statistics, someone like Google needs to publish them.

    Mr Almost:

    The other problem, added to all this, is that major institutional and corporate IT departments seem to be able to handle diversity at the user end of things.

    Am I right?

    By the way, how can you tell Sb is running Red Hat Linux?

    lostinspace has a very good point. All things being equal, if I had the choice on the store floor when I purchased this machine for instance, I'd have purchased one with Linux pre-installed.

    It's the change after-the-fact that scares me.

    It simply wasn't an option within the parameters of my shopping experience. (price, availability of a demo model I could try out, etc.) I think it would only help Linux to increase the share of off-the-shelf machines in brick and mortar stores that run on its operating system right out of the box.

    As any good statistics student or professional knows, there is no such thing as "neutral statistics." Statistics are used to prove something you want to prove. Even if the stats prove you wrong, stats are applied in a bubble environment trying to display some characteristic or trait which is distorted from reality.

    Having said that, there is still tons of good information you can gleen from stats and that shouldn't prevent us from using them.

    Our local library's website, for example has ~3,500 hits a month. Hardly something to brag about, but not bad. Of those hits, something like 22% are from Linux computers. Normally you would think 'now we're getting at something' until you realize all the public access computers at the library have the library's website set as home page, and all 20 computers are Fedora computers. Of course, not all the Linux hits are only from those computers (wish I had those stats in front of me), but it is certainly higher than the 1-3% toted by stat vendors.
    Linux is out there on the desktop, make no bones about it.

    ,ValentineS

    By ValentineS (not verified) on 10 Jul 2008 #permalink

    Mister Laden:

    I agree. Major institutions are completely able to handle this diversity. Most of our tech people at least have access to Macs and Linux, one of them even uses a Mac Mini as his primary workstation now. Getting support for Linux is a little harder, but it's all about knowledge sharing anyway, right?

    As for the second question, I looked at the HTTP response header from the web server. It can be done with telnet, but Firefox has a great plug-in for that type of thing.

    If I want to add a piece of software that would make me a more productive citizen of the community I have to fill out a form, get it approved by one person who knows nothing about computers (and that person always says 'yes') which is then passed on to another person who says no.

    Then I stop letting myself have good ideas for the greater benefit of society for a while.

    With all these confounding factors that have been well described in the article and comments, it's clear that it's simply impossible to gather any meaningful numbers with these approaches.

    The only way we'll really know is to have a door to door or phone survey. And even then, the results will vary greatly by region.

    Maybe the Linux community can all chip in like they did for the Firefox ads and hire Nielson or someone like that to do a proper poll.

    To find out what Sb is running on, you could just ask 'Whats that site running?' - a very popular tool. It'll tell you that Sb is hosted on: Apache/2.0.52 (Red Hat) at Rackspace.com, Ltd.

    To find out what Sb is running on, you could just ask 'Whats that site running?' - a very popular tool. It'll tell you that Sb is hosted on: Apache/2.0.52 (Red Hat) at Rackspace.com, Ltd.

    Apologies for the double post - I should have paid more attention to the message that told me not to do it.

    $curl -I scienceblogs.com
    HTTP/1.1 200 OK
    Date: Tue, 16 Dec 2008 22:09:48 GMT
    Server: Apache/2.0.52 (Red Hat)
    X-Powered-By: PHP/4.3.9
    Connection: close
    Content-Type: text/html