Good Math, Bad Math

The Z2K9 Problem

I’ve been getting a lot of emails asking about the so-called “Z2K9″ problem.

For those who haven’t heard, the software on a particular model of Microsoft’s Zune music player froze up on New Year’s eve, because of a bug. Apparently, they didn’t
handle the fact that a leap year has 366 days – so on the 366th day of 2008, they froze up for the day, and couldn’t even finish booting.

Lots of people want to know why on earth the player would freeze up over something like this. There was no problem with the date February 29th. There was nothing wrong with the date December 31st 2009. Why would they even be counting the days of the year, much less being so sensitive to them that they could crash the entire device for a full day?

The answer is: I don’t have a damned clue. For the life of me, I can’t figure out
why they would do that. It makes absolutely no sense.

I’ve seen a couple of different date formats in use in software. There’s
the so-called “epoch time”, which represents dates and times in terms of
seconds since January 1st, 1970 UTC. There’s also some standard data structures
that represent dates and times as month/day/year/hour/minute/second/fraction.
I’ve even seen some data structures that use “day of year” – but every implementation I’ve seen explicitly includes code for handing 366 day years.

There’s no reasonable explanation for this failure, short of utter incompetence on the part of some engineer somewhere in Microsoft. Some engineer foolishly chose to roll his own date implementation, instead of using one of the dozens of well-tested date/time implementations that exist, and whoever that person was wasn’t
very careful.

The most likely specific cause of the freeze is what we call an out of bounds error. What that means is that there’s a data structure somewhere which is indexed by a number. For example, you could have a table of the days of
the week, indexed by number:

Index Value
0 Sunday
1 Monday
2 Tuesday
3 Wednesday
4 Thursday
5 Friday
6 Saturday

With a table like that, you can store the day of the week as an integer, rather
than as a string of characters. When you want to print out the day of the week,
you look it up by number. The classic error for this kind of structure is assuming
that you’ll never try to look up something that isn’t in the table. For example, if
the “days of the week” index was 7. Then you’d be looking for a value that was past the end of the table. Depending on some specific things about your code,
you could either just get an immediate crash, or you could wind up with random data (which would likely cause a slightly delayed crash.)

In general, when you hear about a crash caused by something like not
getting the number of days in the year correct, you guess that it’s probably an out-of-bounds error. But I can’t for the life of me think of why they would be doing a lookup based on the day-of-year number. It just doesn’t make any sense.

(I’ve gotten some clarifications from email. Apparently, it’s a driver problem, and it’s not a driver that originated at Microsoft. And it’s not an indexing error, as I suspected. It’s something even dumber. It’s a good old-fashioned infinite loop – on the last day of leap year, it’s year calculation will just loop forever. Really dumb. Here‘s the relevant code segment, in a function named “ConvertDays”:

    while (days > 365)
    {
        if (IsLeapYear(year))
        {
            if (days > 366)
            {
                days -= 366;
                year += 1;
            }
        }
        else
        {
            days -= 365;
            year += 1;
        }
    }

It’s trying to compute a date given the number of days since an epoch mark.
So it computes the year by looping over years since the epoch mark – for each year, it subtracts the number of days in that year, until it gets back to the epoch mark. So, for it checks if there’s more that a year’s worth of days left. The way it does that is by comparing it to the number of days left in the year. If it’s a leap year, it checks if there’s more than 366 days left. If there are, then it subtracts 366 and increments the year. Otherwise, it assumes that the day fits inside the year.

But on the last day of leap year, that calculation fouls up – because the number of days left isn’t greater than 366 – it’s exactly 366. Since there’s no else branch, and there’s no code to handle the case that equals 366, it does nothing for that year – and it doesn’t change the year. It just goes back through the loop. Over and over again.

So it’s still a damned stupid programmer error, which really should have been caught in testing. But the screwup wasn’t at Microsoft, but at Freescale, who wrote the driver.

Comments

  1. #1 jivemasta
    January 2, 2009

    The only thing I could think of is for time based DRM they use. But even then, there are better ways to figure that kind of thing out that are well tested, like just subtracting the date structures.

  2. #2 David
    January 2, 2009

    As far as I know, the buggy code lay with some driver that the first-gen Zunes were using, and not with software that MS had written themselves

  3. #3 Aaron Bergman
    January 2, 2009

    I don’t know if this is legit or not, but here’s the code that supposedly did it. Read it and weep.

  4. #4 Chris Quackenbush
    January 2, 2009

    As the link above shows, it wasn’t an out of bounds error. It was a regular old infinite loop.

    At first glance it seems like a counter is being decremented every iteration, but there is one path through the loop (corresponding to the last day of a leap year), when the counter is not decremented and the terminating condition for the loop is never met.

    Here is the whole file: http://pastie.org/349916 (Look at the function ConvertDays)

  5. #5 Blaise Pascal
    January 2, 2009

    The critical part of the internal date representation they were using was days since an epoch. It computed the current year by repeatedly subtracting 365 (or 366) days from the day count until the day count was less than 365.

    The problem was when it was deciding to subtract 365 or 366 was that the logic was:

    While there's more than 365 days left,
      If it is a leap year, then
         if there's more than 366 days left, subtract 366 days.
      If it isn't a leap year, subtract 365.
    

    On 12/31/2008, there were exactly 366 days left, so the while loop ran, but neither subtraction happened (it was a leap year, but there weren’t more than 366 days left).

    So the Zune, on boot, went into an infinite loop.

  6. #6 Adam
    January 2, 2009

    This is truly baffling that such a small thing could cause such havoc… but at least its good opportunity for satire about the situation. like this little zune re-slogan competition http://zune.cheddrmedia.com

  7. #7 Tobias
    January 2, 2009

    It happened to my Zune as well, I was not happy (but I was glad it wasn’t my Zune’s fault).

  8. #8 Lettuce
    January 2, 2009

    How is this not your “Zune’s fault”?

    The driver may have been written for them, but they still used the driver and it scotched a whole bunch of Zunes… And it is THEIR fault.

    Jiminy.

  9. #9 Emory K.
    January 3, 2009

    Ah, the pleasures of codenfreude.

  10. #10 Rabe
    January 3, 2009

    This programmer never heard of a math concept called division? Incredible.

  11. #11 Arno
    January 3, 2009

    This kind of error you see VERY often. For some reason a lot of programmers do not need to have an understanding of basic logic (anymore). And so they mess up like this with some very sloppy code.
    If you would just (really) think this one through you’d notice there are 4 possible cases. This programmer got the checks (ifs) in the wrong order and therefore fails to correctly handle the fourth case (in which it is a leap-year and the day-number is 366).
    I found this in just about 20 seconds, before I looked at the explanation. But then again, I’m used to doing code-checks of peer programmers and that is more than just checking whether it “looks nice”.

  12. #12 Nick B
    January 3, 2009

    I think this goes a little deeper than programmer incompetence. As Mark said, it should have been caught in testing. It’s a trivial boundary error, and the only explanation for it slipping through is that Freescale do not routinely write unit tests. Aside from the dubious decision to yet again reinvent the wheel, it’s their methodology which is at fault, more than the individual programmer.

  13. #13 Chris
    January 3, 2009

    Division only gives you an approximation. In the old days you would only have integer math on embedded devices, but even if you have floating point math it’s a pain to figure out how to handle the fractions correctly.

    That said, I’ve had to write date code and I always use a /4 year block/ of 1461 days and an epoch starting on March 1st of a leap year. (You have to add a constant to the standard date value, but that’s trivial.) Integer divide by 1461 to get a 4-year window, then integer divide by 365 to get a 3-year window, then what’s left is a modified Julian date. (The day of the year, not the days since 4000 BCE.) From there it’s easy to do a lookup to get the month, date and flag whether you need to adjust the year.

    However, this code invariably leads to discussions with coworkers who don’t understand that the key to writing good code is understanding the world, not just computers. The calendar originally started in March, the names of our months are based on that (e.g., October = 8th month of the year, not 10th), the leap day is on February 29th since that was the last day of the year. A LOT of date calculations are easier if you adjust to a March 1st New Years, do your calculations, and then adjust back to a January 1st New Years.

  14. #14 C. Chu
    January 3, 2009

    That’s embarassing.

  15. #15 sohbet chat
    January 4, 2009

    The driver may have been written for them, but they still used the driver and it scotched a whole bunch of Zunes

  16. #16 itchy
    January 5, 2009

    codenfreude

    I love, love, love that!

  17. #17 William Wallace
    January 5, 2009

    Some engineer foolishly chose to roll his own date implementation, instead of using one of the dozens of well-tested date/time implementations that exist…

    In my experience, this problem happens in hardware design and software engineering all the time.

    “Serial communications protocol? How hard could that be. Let’s roll our own.”

    This always always leads to disaster (in terms of project schedule).

  18. #18 Salem
    January 6, 2009

    brilliant… wow. I want to see a MS response to this post.

  19. #19 Homer
    January 7, 2009

    Chris, unless you are omitting a layer in your algorithm description, it looks like you are not correctly handling century years. Century years are not leap years unless the year is divisible by 400. Thus, 2000 was a leap year, 1900 was not, and 2100 will not be a leap year.

  20. #20 Uncle Al
    January 7, 2009

    Microcrap Korporate Kulture demands any release better than beta is cost inefficient, incremental sales vs. incremental development costs. (“Klingons do not ‘release’ software. it escapes, leaving a bloody trail of design engineers and quality assurance kuvekhestat in its wake.”)

  21. #21 Rabe
    January 8, 2009

    Homer, 1900 does not matter, we started at 1st of march. Let’s plan for a code revision in 2099 to solve that nasty Y2K1 problem.

  22. #22 KeithB
    January 8, 2009

    Rabe:
    The person who *does* have to fix the “Y2K1″ problem will be able to Google (or the equivalent) this page and curse you for it.
    8^)

New comments have been temporarily disabled. Please check back soon.