I’ve been getting a lot of emails asking about the so-called “Z2K9″ problem.
For those who haven’t heard, the software on a particular model of Microsoft’s Zune music player froze up on New Year’s eve, because of a bug. Apparently, they didn’t
handle the fact that a leap year has 366 days – so on the 366th day of 2008, they froze up for the day, and couldn’t even finish booting.
Lots of people want to know why on earth the player would freeze up over something like this. There was no problem with the date February 29th. There was nothing wrong with the date December 31st 2009. Why would they even be counting the days of the year, much less being so sensitive to them that they could crash the entire device for a full day?
The answer is: I don’t have a damned clue. For the life of me, I can’t figure out
why they would do that. It makes absolutely no sense.
I’ve seen a couple of different date formats in use in software. There’s
the so-called “epoch time”, which represents dates and times in terms of
seconds since January 1st, 1970 UTC. There’s also some standard data structures
that represent dates and times as month/day/year/hour/minute/second/fraction.
I’ve even seen some data structures that use “day of year” – but every implementation I’ve seen explicitly includes code for handing 366 day years.
There’s no reasonable explanation for this failure, short of utter incompetence on the part of some engineer somewhere in Microsoft. Some engineer foolishly chose to roll his own date implementation, instead of using one of the dozens of well-tested date/time implementations that exist, and whoever that person was wasn’t
very careful.
The most likely specific cause of the freeze is what we call an out of bounds error. What that means is that there’s a data structure somewhere which is indexed by a number. For example, you could have a table of the days of
the week, indexed by number:
| Index | Value |
|---|---|
| 0 | Sunday |
| 1 | Monday |
| 2 | Tuesday |
| 3 | Wednesday |
| 4 | Thursday |
| 5 | Friday |
| 6 | Saturday |
With a table like that, you can store the day of the week as an integer, rather
than as a string of characters. When you want to print out the day of the week,
you look it up by number. The classic error for this kind of structure is assuming
that you’ll never try to look up something that isn’t in the table. For example, if
the “days of the week” index was 7. Then you’d be looking for a value that was past the end of the table. Depending on some specific things about your code,
you could either just get an immediate crash, or you could wind up with random data (which would likely cause a slightly delayed crash.)
In general, when you hear about a crash caused by something like not
getting the number of days in the year correct, you guess that it’s probably an out-of-bounds error. But I can’t for the life of me think of why they would be doing a lookup based on the day-of-year number. It just doesn’t make any sense.
(I’ve gotten some clarifications from email. Apparently, it’s a driver problem, and it’s not a driver that originated at Microsoft. And it’s not an indexing error, as I suspected. It’s something even dumber. It’s a good old-fashioned infinite loop – on the last day of leap year, it’s year calculation will just loop forever. Really dumb. Here‘s the relevant code segment, in a function named “ConvertDays”:
while (days > 365)
{
if (IsLeapYear(year))
{
if (days > 366)
{
days -= 366;
year += 1;
}
}
else
{
days -= 365;
year += 1;
}
}
It’s trying to compute a date given the number of days since an epoch mark.
So it computes the year by looping over years since the epoch mark – for each year, it subtracts the number of days in that year, until it gets back to the epoch mark. So, for it checks if there’s more that a year’s worth of days left. The way it does that is by comparing it to the number of days left in the year. If it’s a leap year, it checks if there’s more than 366 days left. If there are, then it subtracts 366 and increments the year. Otherwise, it assumes that the day fits inside the year.
But on the last day of leap year, that calculation fouls up – because the number of days left isn’t greater than 366 – it’s exactly 366. Since there’s no else branch, and there’s no code to handle the case that equals 366, it does nothing for that year – and it doesn’t change the year. It just goes back through the loop. Over and over again.
So it’s still a damned stupid programmer error, which really should have been caught in testing. But the screwup wasn’t at Microsoft, but at Freescale, who wrote the driver.