# Win friends and fix up your data with regular expressions

In many instances, a well thought-out regular expression can convince most non-technical people in the room that you’re a computer genius who’s brain possesses more synapses, forming more bridges and firing more rapidly than anyone’s ever should.

Oh this is so true….

[]

The other day I was working on cleaning up some data with a colleague. We had two simple but common problems with our database, which had a few thousand records, and forty or so variables.

1) We needed to get a subset of data that included means for all numeric values based on a single factor (a factor = a categorical variable used to divide up or group the other data… in this case, species name); and

2) We needed to make sure that blank spaces in this grid were filled with “NA” in order for the statistical software to recognize these blanks as missing data.

Chances are, you’ve had this problem too. Let me tell you how we solved them.

The subtotal problem seems easy. There are two obvious fast ways to do this. One is to use brute force … put the data in a spreadsheet and sort by the factor (in this case ‘species’). Insert a row after each set of species data, and put in formulas to calculate the averages. This option is unacceptable because it is hard work, laborious, error prone, boring, and every time we ‘fix’ the data (finding errors, etc.) we have to recalculate, so any method should be fast and accurate.

The other way is to use the subtotals function on the spreadsheet. Say “we want subtotals of these variables … but make the totals averages … and get a subtotal for each different species….” All commonly used spreadsheets have a subtotal calculation function. Just press the subtotal calculation button and you are done.

This works, but it does not work. You get the subtotals, and you can hide away the raw data and see only the subtotals, but you can’t copy and paste these subtotals, you can’t save the file as only the subtotals, you can’t do anything with these subtotals without the raw data coming along for the ride.

A third way, suggested by my friend Stephanie, is to use lookup expressions in the spreadsheet. As Stephanie notes, however, “It’s better than the brute force option, because I can do it in such a way that it updates when the data does. But it is not elegant. It does, however, inspire some awe.”

Here is how I solved the problem: I created the subtotals and saved the file … with the subtotals … as a comma-separated file (any kind of text file would work). Then I got up a shell, navigated to the directory holding the text file, and typed something like:

grep Result filename.csv > summarydata.txt

The subtotal lines all have the string “Result” in them. So, this one liner which is not even a regular expression (but could be if I wanted it to be) creates a file with only the summary data. You have to go and manually copy and paste the headings from your original data file.

Then, without skipping a beat, I addressed the problem of the missing data. I typed:

sed ‘s/,,/,NA,/g’ summarydata.txt | sed “s/,,/,NA,/g” > summarydatafixed.txt

The data were totally fixed, averages only and all the missing data represented at NA. The output was so perfect and elegant, I’m pretty sure my colleague actually stopped breathing for a moment.

In the above command, I ran the file through the sed command (which changes “,,” to “,NA,”) twice because three commas in a row (representing two missing data items in a row) would change to “,NA,,” because the second comma is eaten by the regular expression and thus not used farther down the line. But twice gets them all.

There is another way to do this without running the data twice, using “lookahead” capacities that are in Perl. The quote above, which I think I will print out and frame and hang on my wall, comes from a blog post by Mike that I think may explain this … but my solution was quicker and remained efficient. I swear to you, everyone in the room thought I was brilliant, and this was a room crammed with scientists, mostly cladists. It is pretty rare to get cladists thinking anyone else is smart.

1. #1 Shawn Smith
September 4, 2009

But there aren’t any non-trivial regular expressions here. It’s more a description of the grep and sed commands. I feel cheated.

2. #2 Pierre
September 4, 2009
3. #3 peter
September 4, 2009

i suppose the perl expression would be to find any comma that is followed by another comma. then replace the first comma with a comma followed by NA.

perl -p -e ‘s/,(?=,)/,NA/g’ filename.txt

September 4, 2009

When you call my regular expressions trivial it makes me feel bad.

5. #5 Alcari
September 4, 2009

“but you can’t copy and paste these subtotals, (…) you can’t do anything with these subtotals without the raw data coming along for the ride.”

Uhm… yes you can. At least, Office 2007 had a function they call “special paste”, where you can chose which parts of the copied object to paste. “Values” is one of them, which just pastes the number of the copied bit, not the formula.

You can use either through Ctrl-alt-v, or with the little arrow below the Paste button, if you’re more of a mouse person. Special Paste is one of the very few redeeming features of Office ’07 over Openoffice.

“you can’t save the file as only the subtotals,”
True, you’d have to manually paste it in a new file, using the above method.

The ultimate solution would be to use a piece of statistical software that can be properly set up to handle all kinds of data like SPSS, but that tends to cause great pains in your wallet. Still, you could just fire it up, configure it and in the future, hit a single button to do all your work. It saved my life in uni.

6. #6 Alcari
September 4, 2009

Of course, the real question/solution lies in the question what format the data was originally in. What causes the missing data and can’t it be fixed to show ‘NA’ itself?

7. #7 lylebot
September 4, 2009

Use R, man; it’s open source, powerful (whatever statistical analysis you are doing, there’s a way to do it in R), and it will even fill in your NAs as needed. It’s got a fairly steep learning curve, but that shouldn’t be an obstacle to any self-proclaimed Linux guru (perhaps it’s even a feature).

September 4, 2009

Uhm… yes you can. At least, Office 2007 had a function they call “special paste”, where you can chose which parts of the copied object to paste. “Values” is one of them, which just pastes the number of the copied bit, not the formula.

Not the subtotals by themselves. Use the data-subtotals command to make subtotals using some field to breatk, you get subtotals after each change in the break fields. Those subtotals are for your eyes only. You can copy the entire thing, and paste as values, and then sort them out somehow, but that is a slightly clunkier way to do what I’ve done here.

These data are going into R. There remain, however, good reasons to start out in a spreadsheet that have to do wiht data collectio. The database management features of R exist but suck.

9. #9 Andrew
September 4, 2009

I made a program to rename files by Regular Expressions. I love casually using it because what I have to type in it looks so much like arcane electronic incantations and then it magically does whatever we wanted. It’s as close to having magical powers as I’ve ever been.

10. #10 lylebot
September 4, 2009

tapply(X\$value, X\$species, mean, na.rm=T)

Boom! R!

September 4, 2009

Lylebot, nice. But, really, just assume that I need the data fixed as per above for numerous reasons, which I do. But yes, that works great. R is the rad. I love it.

12. #12 itzac
September 4, 2009

Just a quick nit to pick about pasting the headers into the text file after grep. Why not head your file into the new file, then append the output from grep to it?

head -n 2 (or whatever) filename.csv > summarydata.txt
grep Result filename.csv >> summarydata.txt

13. #13 travc
September 4, 2009

A great benefit of doing stuff on the command line (or simple scripts) is documenting and reusing it. If your data changes (say you add a few more species or notice a typo), re-run the commands (which you did put into your notes, didn’t you?) and *blam* done.

I tend to work with crap-loads of data, and develop my analysis methods on a subset (so I’m not waiting overnight to see if some new method is doing what I think it should do). Writing up everything as scripts (including generating figures in R and Matlab) means that re-running on new data (or the full data set) is totally trivial.

14. #14 mike
September 5, 2009

Keep these Linux posts going. I appreciate them.

15. #15 David
September 5, 2009

pivot tables.

September 5, 2009

David: Without hearing your tone of voice, I can’t tell if you are saying “pivot tables” in a bad way or a good way.

Anyway, they don’t exist. Or at least not in the Linux world (unless you are running Excel under Wine, which works great, by the way). We call them something else (Data pilot in OO)

17. #17 David
September 6, 2009

Greg, pivot tables are a generic name for that aspect of spreadsheets. As you point out, they do exist in open office. And they’re perfect for the task you described. If you had a higher level of nesting (say, if one factor is color and the second color is size, you could easily get totals for “large red”).

September 6, 2009

Greg, pivot tables are a generic name for that aspect of spreadsheets.

Not that it matters, but I tend to doubt that. This is a guess, but I would think that if it were true OO would not have avoided the name, as OpenSource software tends to use generic names where possible. I suspect MS has at least a patent or trademark or whatever applied for…

19. #19 Shawn Smith
September 6, 2009

When you call my regular expressions trivial it makes me feel bad.

ROFLMAO!

If you don’t want me to call your regular expressions (more like search strings and replacement strings, if you ask me) trivial, then don’t make them trivial. Have an example that will benefit from asterisks or question marks or parentheses or even brackets. Have them get put as arguments to sed and grep so that you can also describe shell quoting rules. But if it’s something that MS Notepad can search with, then you probably shouldn’t be using the term “Regular Expressions” in your post title.

ðŸ™‚

20. #20 David
September 6, 2009

From wikipedia: “The term pivot table is a generic phrase used by multiple vendors. However, the specific form PivotTable is a trademark of the Microsoft Corporation.”

A rose by any other name. The point remains: Regular expressions are complex, time consuming, and hard to validate. Pivot tables are easy, powerful data analysis tools and it’s pretty straightforward to know, at the end of the process, whether you’ve done it right. The pivot table gives you the results in human-readable form. And it’s such a snap that nobody would think of writing up the accomplishment. The right tool for the job.

The “NA” problem is also easily solved by inserting a column and copying in the pretty simple formula. If your data is in column A and you insert column B, use =if(isnumber(A1), A1,”NA”)

21. #21 Ray Ingles
September 8, 2009

“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” – Jamie Zawinski

:->

22. #22 Walter
June 4, 2011

Redundant use of sed.

sed -i ‘s/,\{3\}/,NA,/g’ summarydata.txt

replaces three commas with “,NA,” inline.

December 10, 2011

Gosh, Dave, your regular expressions look swell! Thank you for taking time to describe your techniques. I appreciate that.
I use and recommend Gnumeric for spreadsheet data analysis. Runs great under KDE. Excel compatible.

24. #24 Dottie Davis
January 17, 2012

“but you can’t copy and paste these subtotals” again
Select the subtotaled data you want, click on Edit..Go To..Special… and select Visible cells only.
Copy and paste where you like, regular paste here will do.

I’m not familiar with the other software above, usually stick to Access or SQL for db stuff. If an average of pivoted data is needed, with numbers in some fields and nulls in others, Monarch is my choice.