Finding Patterns in Data: Fourier Series

Over at Faraday's Cage, Cherish has a very nice post on Fourier series, following on an earlier post on Fourier transforms in the Transformers movie. She gives a nice definition of the process in the earlier post:

A Fourier Transform takes a signal and looks at the waves and then shows us the frequencies of all the waves. If we only have a single sine wave, like above, we will have a frequency that is zero everywhere except for the frequency of that sine wave. More complicated signals will be made up of several of these different frequencies and thus will have several peaks. The idea is that you could move back and forth between the period of the wave and the frequency by using the Fourier Transform. If you only have the frequency information, for example, you can use that to figure out which waves you need. Add them together, and you have your original signal back.

Of course, you might not see the point in this, as expressed that way-- it sounds sort of like a game, just flipping back and forth between different representations of the same thing. This is actually an extremely powerful analysis tool, though, because it lets you pick patterns out of what otherwise can look like random noise.

In order to demonstrate, I will do what may be the dorkiest thing I've ever done here, namely taking the Fourier transform of the traffic on this blog. Here's a graph showing the number of pageviews per day for the last 1024 days of Uncertain Principles (for technical reasons, it needs to be a power of 2):

i-902173a73948df380f791637cdb9297c-traffic_1024.jpg

Looking at that, you might be saying "This is just noise-- a bunch of low-level hash with a handful of big spikes up above it." Which is true-- the really big spike corresponds to the Many Worlds, Many Treats post hitting Boing Boing, and the smaller, more recent one is this year's Speed of God post. But is there some hidden pattern here that we're not seeing?

The next graph is the Fourier transform of the traffic data:

i-f55a8f4467383328e0e6516fd55cca42-traffic_fft.jpg

I've zoomed in the vertical scale, because there's always an enormous spike at a frequency of zero, again for technical reasons that don't really matter. What you see here is a measure of the "power" at a given frequency, with the frequency measured in units of "per day." This "power" tells you how much of a sine wave at that frequency you would need to put in in order to recreate the original graph of blog traffic.

This still looks like a bunch of crap with two spikes rising above the noise: one at around 0.14 day-1, the other at around 0.28 day-1. What does that mean for our data? That tells us that there's a big concentration of power at a couple of frequencies, which indicates that there's a strong recurring pattern in the traffic signal with that frequency.

So what is it? It's maybe a little easier to understand if I plot the power as a function of one over the frequency, which is to say, the period in days:

i-1aec26808c2a9039e839e43680f36702-traffic_fft_period.jpg

OK, in order to see anything, you need to plot it on a log scale (which is why this isn't usually done), but I've taken the liberty of labelling the peaks. The biggest peak in the signal is at a period of 7.014 days, meaning that there is a pattern in the data that repeats every 7 days. What's that? Well, it's the days of the week. If we zoom in on the original traffic graph a bit, you can see it:

i-b8e810de617425618bc09221d9284b66-traffic_detail.jpg

Traffic is generally lower on weekends, so every seven days, you see a dip in the numbers. This is just visible in the raw data, but really pops out in the Fourier transform.

What's with the 3.5 day peak? Well, the weekly variation isn't really sinusoidal-- it's kind of flat for five days, then drops down for two. That sort of pattern requires additional frequencies that are harmonics of the fundamental pattern, as discussed in Cherish's posts. 3.5 days is half a week, so this is twice the freuqency of the main signal, reinforcing the idea of a weekly pattern. The next harmonic would be at a frequency of around 0.42, or a period around 1.75 days, but it would also be small enough that it would get lost in the other stuff.

So that's why scientists use Fourier analysis, and why it's important enough to merit a (clunky and essentially undocumented) tool in the Excel analysis tool pack. Fourier analysis lets you find recurring patterns in large data sets, even when those patterns aren't obvious in looking at the raw data. The result is kind of trivial in this case-- I had thought there might be something at a period of around a year, because traffic usually dips significantly in December, but there isn't enough data to really see it (there might be a peak at about 170 days, if you squint, but I'm not sure that's real).

(Back when I was tracking baby feeding times and making colorful graphs, I had intended to follow up with a Fourier analysis of the feeding data, but things got busy at work, and I never found the time. That probably would've been more interesting, given that I didn't see as clear a pattern in the raw data, but I no longer have SigmaPlot at home, so I can't get at that data.)

More like this

I would have thought that there would also be a visible daily pattern as regular readers would have a routine of trawling through sites at a similar time every day. It's interesting that there is nothing there. I wonder what your ratio of regular readers to casual readers is like.

Excel makes you use 2^n data sets for FFTs, but Mathematica and some others will do it for arbitrary input. Probably not worth the trouble to work between two programs in this case, but it can be done in general if any readers are curious.

Matt #1: The reason you don't see the daily pattern is because the data Chad is sampling has a time resolution of one day. That means you cannot see a period of less than two days, since obviously the fastest signal you can track will be up on one sample, down on the next, and up again on the third. Higher frequencies than that will be aliased, meaning that if they are present in the underlying signal they will appear at a lower frequency. This may be part of the reason why the third harmonic (0.43/day or a period of 2.33 days) does not appear here--the fourth harmonic (0.57/day or a period of 1.75 days) will be aliased to the same frequency, and these two harmonics might be destructively interfering with one another. (Alternatively, there may be an anti-aliasing filter which is suppressing that frequency range precisely to avoid this kind of confusion.)

Matt #2: Yes, algorithms for arbitrary length FFTs do exist, but FFTs of length 2^n are the fastest. In general, you want to choose interval lengths such that the prime factors are small; otherwise you may find yourself effectively doing a DFT, which takes O(N^2) computations, instead of an FFT, which takes O(N log N) computations. That distinction makes a big difference when you are working with large data sets. More often (at least in my line of work), you are computing a spectrogram in order to look at how the frequency content varies in time, and there is no harm in choosing subintervals of length 2^n to analyze.

By Eric Lund (not verified) on 03 Jan 2010 #permalink

Flipping back and forth is not just an analysis tool. It is also a very useful compression tool.

Back in the early '80's a (now extinct) company I was at achieved toll quality voice* over an 8Kbps channel. Compare that against 64Kbps for standard T1 voice muxing, or against the higher bandwidth of H.264, and one just has to wonder why. The how? FFT voice in 20 msec chunks. Apply a little magic sauce to the resultant set of frequency terms and ship the list over the wire. At the other end perform the reverse FFT, taking the signal component states at the end of the previous chunk as starting points, and play back the result.

*toll quality voice has a specific meaning in telecomms, and is a subjective measure of how well speakers can be recognized from their voice as reproduced in the listener's telephone. A panel of listener experts is used to rate the performance of a voice system. I would not call present day Skype or cell phone voice toll quality.

By GrayGaffer (not verified) on 03 Jan 2010 #permalink

There are fast algorithms for FTs of any length, even prime. Yes it gets a little hairier for large prime factors, but no one should be writing their own FFT routine anyway. Use a decent library, or a decent analysis program, and just call it's routine on whatever length of data you have.

For truly huge FT's (Say > 1 GB) you're likely io bound anyway*. And for anything under, 100 MB or so, who cares if it takes an extra tenth of a second?

*The biggest FT I've personally run, was a 4.5 GB data set. It took about an hour to run.

Yeah, my current research involves propagating broadband femtosecond pulses through materials with sharp dispersion, and all our simulations are basically straight-up Fourier transforms. Currently I'm doing them as numerical integrals just out of Mathematica programming laziness, but it's getting to the point where I'm going to have to rewrite the continuous integration as a sampled FFT. A 5-minute numerical integration isn't bad until you want to do it for a few hundred different parameters.

Well, at least in my visits stat you'd have to be blind not to see the 7 day frequency, and you don't need a fourier trafo for that. Nice post anyway :-)

The Fourier composition/decomposition process is often given as very straightforward, but many forget about loose ends like the Gibbs phenomenon: e.g. left-over spikes at the edges of square waves. It is related to some processing issues too, but derives from math pe se and not imperfections of physical response. True, with a literally infinite series I guess you can ignore the "line" spike at the ends but it seems ugly that successive approximations reduce only the width, not the spike ammplitude as well (if I gather correctly.)

Another issue about Fourier composition is the idea that an ideal filter or frequency detector should find any composing frequencies we look for, that are present in the signal. Well, e.g. a finite chain (flat until wave section, then flat again) of sine waves "contains" all sorts of other harmonics to create that pulse. However, suppose I was trying to detect some higher harmonics over a legitimate-seeming interval of several of their periods. My detector "doesn't know" that the rest of the wave does not continue, but I can't see it "detecting" harmonics while it is operating within the pure sine-wave regime. The pulse, and infinite pure waves, are literally identical between the bounds. This is not about the physical issue, but the pure math in effect of imagining a perfect detector of your choice. Any thoughts?

Neil: Fourier decomposition is not the only method for analyzing the frequency content of a signal. There is an entire class of functions called wavelets which do a better job with certain applications (and, like the Fourier series, a bad job if your choice is ill suited to the chosen task).

The simplest example would be to imagine taking each pair of points in your time series and computing the average and difference. Now take those averages and compute their average and difference, and iterate until you are taking the average of the entire series. This procedure amounts to decomposition in an orthonormal basis called the Haar wavelet (original reference in German: A. Haar, Zur Theorie der orthogonalen Funktionensysteme, Mathematische Annalen 69, 331, 1910). Haar showed that decomposition in this basis does not produce Gibbs overshoot, so it may be a reasonable choice for your first application. However, it sucks as a frequency detector; there are other (not necessarily orthonormal) wavelet bases which do a better job.

By Eric Lund (not verified) on 05 Jan 2010 #permalink