In many instances, a well thought-out regular expression can convince most non-technical people in the room that you’re a computer genius who’s brain possesses more synapses, forming more bridges and firing more rapidly than anyone’s ever should.
Oh this is so true….
The other day I was working on cleaning up some data with a colleague. We had two simple but common problems with our database, which had a few thousand records, and forty or so variables.
1) We needed to get a subset of data that included means for all numeric values based on a single factor (a factor = a categorical variable used to divide up or group the other data… in this case, species name); and
2) We needed to make sure that blank spaces in this grid were filled with “NA” in order for the statistical software to recognize these blanks as missing data.
Chances are, you’ve had this problem too. Let me tell you how we solved them.
The subtotal problem seems easy. There are two obvious fast ways to do this. One is to use brute force … put the data in a spreadsheet and sort by the factor (in this case ‘species’). Insert a row after each set of species data, and put in formulas to calculate the averages. This option is unacceptable because it is hard work, laborious, error prone, boring, and every time we ‘fix’ the data (finding errors, etc.) we have to recalculate, so any method should be fast and accurate.
The other way is to use the subtotals function on the spreadsheet. Say “we want subtotals of these variables … but make the totals averages … and get a subtotal for each different species….” All commonly used spreadsheets have a subtotal calculation function. Just press the subtotal calculation button and you are done.
This works, but it does not work. You get the subtotals, and you can hide away the raw data and see only the subtotals, but you can’t copy and paste these subtotals, you can’t save the file as only the subtotals, you can’t do anything with these subtotals without the raw data coming along for the ride.
A third way, suggested by my friend Stephanie, is to use lookup expressions in the spreadsheet. As Stephanie notes, however, “It’s better than the brute force option, because I can do it in such a way that it updates when the data does. But it is not elegant. It does, however, inspire some awe.”
Here is how I solved the problem: I created the subtotals and saved the file … with the subtotals … as a comma-separated file (any kind of text file would work). Then I got up a shell, navigated to the directory holding the text file, and typed something like:
grep Result filename.csv > summarydata.txt
The subtotal lines all have the string “Result” in them. So, this one liner which is not even a regular expression (but could be if I wanted it to be) creates a file with only the summary data. You have to go and manually copy and paste the headings from your original data file.
Then, without skipping a beat, I addressed the problem of the missing data. I typed:
sed ‘s/,,/,NA,/g’ summarydata.txt | sed “s/,,/,NA,/g” > summarydatafixed.txt
The data were totally fixed, averages only and all the missing data represented at NA. The output was so perfect and elegant, I’m pretty sure my colleague actually stopped breathing for a moment.
In the above command, I ran the file through the sed command (which changes “,,” to “,NA,”) twice because three commas in a row (representing two missing data items in a row) would change to “,NA,,” because the second comma is eaten by the regular expression and thus not used farther down the line. But twice gets them all.
There is another way to do this without running the data twice, using “lookahead” capacities that are in Perl. The quote above, which I think I will print out and frame and hang on my wall, comes from a blog post by Mike that I think may explain this … but my solution was quicker and remained efficient. I swear to you, everyone in the room thought I was brilliant, and this was a room crammed with scientists, mostly cladists. It is pretty rare to get cladists thinking anyone else is smart.