Win friends and fix up your data with regular expressions

In many instances, a well thought-out regular expression can convince most non-technical people in the room that you're a computer genius who's brain possesses more synapses, forming more bridges and firing more rapidly than anyone's ever should.

Oh this is so true. The other day I was working on cleaning up some data with a colleague. We had two simple but common problems with our database, which had a few thousand records, and forty or so variables.

1) We needed to get a subset of data that included means for all numeric values based on a single factor (a factor = a categorical variable used to divide up or group the other data... in this case, species name); and

2) We needed to make sure that blank spaces in this grid were filled with "NA" in order for the statistical software to recognize these blanks as missing data.

Chances are, you've had this problem too. Let me tell you how we solved them.

The subtotal problem seems easy. There are two obvious ways to do this. One is to use brute force ... put the data in a spreadsheet and sort by the factor (in this case 'species'). Insert a row after each set of species data, and put in formulas to calculate the averages. This option is unacceptable because it is hard work, laborious, error prone, boring, and every time we 'fix' the data (finding errors, etc.) we have to recalculate, so any method should be fast and accurate.

The other way is to use the subtotals function on the spreadsheet. Say "we want subtotals of these variables ... but make the totals averages ... and get a subtotal for each different species...." All commonly used spreadsheets have a subtotal calculation function. Just press the subtotal calculation button and you are done.

This works, but it does not work. You get the subtotals, and you can hide away the raw data and see only the subtotals, but you can't copy and paste these subtotals, you can't save the file as only the subtotals, you can't do anything with these subtotals without the raw data coming along for the ride.

Or at least, I have no idea how to do this. If you do, tell me!

Here is how I solved the problem: I created the subtotals and saved the file ... with the subtotals ... as a comma-separated file (any kind of text file would work). Then I got up a shell, navigated to the directory holding the text file, and typed something like:

grep Result filename.csv > summarydata.txt

The subtotal lines all have the string "Result" in them. So, this one liner which is not even a regular expression (but could be if I wanted it to be) creates a file with only the summary data. You have to go and manually copy and paste the headings from your original data file.

Then, without skipping a beat, I addressed the problem of the missing data. I typed:

sed 's/,,/,NA,/g' summarydata.txt | sed "s/,,/,NA,/g" > summarydatafixed.txt

The data were totally fixed, averages only and all the missing data represented at NA. The output was so perfect and elegant, I'm pretty sure my colleague actually stopped breathing for a moment.

In the above command, I ran the file through the sed command (which changes ",," to ",NA,") twice because three commas in a row (representing two missing data items in a row) would change to ",NA,," because the second comma is eaten by the regular expression and thus not used farther down the line. But twice gets them all.

There is another way to do this without running the data twice, using "lookahead" capacities that are in Perl. The quote above, which I think I will print out and frame and hang on my wall, comes from a blog post by Mike that I think may explain this ... but my solution was quicker and remained efficient. I swear to you, everyone in the room thought I was brilliant, and this was a room crammed with scientists, mostly cladists. It is pretty rare to get cladists thinking anyone else is smart.

More like this

I can do all this in spreadsheets, using conditional links to the output of pivot tables, because I've had to. It's better than the brute force option, because I can do it in such a way that it updates when the data does. But it is not elegant. It does, however, inspire some awe.

I'm sorry, but I feel that this post is just not complete without a reference to this XKCD comic. ;-)

Luckily you didn't have any qualified fields in that csv file.

What statistical software are you using?

By AnonymousCoward (not verified) on 12 Sep 2008 #permalink

With the spreadsheet can't you just hide the raw data and then copy only visible cells to a new spreadsheet.
Sometimes you have to write a simple formula to propagate the titles of the averages to a blank cell on the same row as the subtotal.

I am pretty sure this is feasible with Excel not sure about other spreadsheet software.

By Rohan Smith (not verified) on 12 Sep 2008 #permalink

Rohan: No! Amazingly enough. I have never gotten this to work with Excel. Let me know if you can actually do it. It might well be possible, but there is a point at which trying to figure out something becomes far more time consuming than banging out a quick sed script.

Coward: R

Stephanie: Yes, you can also use database functions. The problem is you still have to redo or adjust when the number (or level) of factors changes . The automatic "make sub totals" works best to get the average (or other calculation) but then you cant get them out.

You can copy and paste a pivot table, though. That tends to work where it works.

Greg, you will never see me copying and pasting a pivot table. They are some of the ugliest things on Earth. Even if I'm the only person who will ever see it, presentation counts.

It's possible to set up the summary table to make adding a factor as easy as adding a line and entering the factor. A checksum will tell you whether you need to. Adding a level is a nightmare unless you're so well-versed in the syntax of pivot table calls that you can do it with search and replace.

What Rohan is talking about should work in Excel as long as you use the filter function to get just the rows you want. Just don't use undo after pasting and expect to have the same contents in your clipboard. Blech.

You can't hide the raw data and then copy the set of visible cells. You need to copy every line (of summary data) one at a time.

I have not tried imposing a filter on top of the summary data. That might work, but until I see it happen I'm not buying it . So far my method is still better.

Actually, what is even better and that I'll eventually do is to write an awk script that eats a data file and spits out a summary on factor file. How hard can that be? It can probably be done in fewer than ten lines.

Oh, by all the useless gods, yes, your way is better. No question. Hands down. Much, much better. I just don't get to use it at work. :p

I think it says somewhere in the comments that you are using R.

This all could have been done within R very simply (if I understand what you are doing). One of my favorite and most widely used R functions is aggregate. You can use it to generate a summary of a data.frame, performing some function on groups of data sorted by factors. For instance

aggregate(data.frame$var, list(data.frame$factor), FUN = mean)

will give you the mean var for each level of factor. It should also automatically put the NAs in there if it can't calculate the means for some reason.

Hope it helps.

Is there any reason you have to use a spreadsheet rather than a relation database ? If you can go the relational database route then a little SQL will sort you out.

Most database management systems allow easy import from Excel.

By Matt Penfold (not verified) on 12 Sep 2008 #permalink

You can avoid the problem of the triple comma if you use Perl and split the line on the comma. Say I have the following file foo.csv, which has the following contents:

bacon,eggs,spam,,foobar,spam,,,foo

I can use the following little Perl program to substitute the NAs (which I hope doesn't come out mangled):

use warnings;
use strict;

while (<>) {
my @line = split /,/;

for (my $i = 0; $i < @line; $i++) {
$line[$i] = 'NA' if $line[$i] eq '';
}
print join(' ', @line);
}

Fields that are empty strings are replaced by "NA", and the output is

bacon eggs spam NA foobar spam NA NA foo

By J. J. Ramsey (not verified) on 13 Sep 2008 #permalink

I love this. this is why I posted this, to get so many different perspectives.

Aggregate in R is the solution for PHase II. What we did so far is to prototype the whole process using the spreadsheet data, wiht the intention of eventually having the whole analysis done on a text file that get eaten by R and processed entirely with R functions. However, while R is a pretty good math and a very good stats environment, it simply is not a very good data manipulation environment.

Why spreadsheet instead of sql? Every morning I wake up and ask myself that. But seriously ... there are technical reasons having to do with data collection for the data to have originally been put in a spreadsheet. The, I seriously considered moving it to a database, but did not because my co-author, who collected all the data and knows the most about the specific quirks and issues of the data is not using a database and while she could easily learn it, had not previously done so, etc. etc. So we are going straight from spreadsheet to text file to R.

jj: Excellent perl solution, and probably better than the equivilant bash soluoitn using tr, etc.

What I would like to see happen, possibly as a nice open ended open source project, is a port of all of the functionality of R to the bash command line as utilities/commands that make certain assumptions about the data structures (and read sql as well as text), and this sort of perl program would be a perfect utility. What do you call it? csv2na? csv-na?

Fields that are empty strings are replaced by "NA"

Care is needed here.

A blank string is not the same thing as a NULL value. A blank string is a string that does not have any characters in it. A NULL value means there is no value assigned to that field. Searching on blank strings may not return those records with a NULL value in the field.

By Matt Penfold (not verified) on 13 Sep 2008 #permalink

Muse: you would think. The behavior is to change the formulas in those rows to values (which save as csv does as well by the way) but the raw data does not go away.

BTW, you CAN paste as values (and get the de-formula'd lines as well as raw data" to a different sheet then sort to get all the lines wiht "result" in them on one place then delete the copied raw data.... essentially the same thing that I did with grep. But it is still a kludge.

Matt, clearly we need to field test this. At least in perl, blank an null should be handled in similar ways across platforms, yes?

Matt, clearly we need to field test this. At least in perl, blank an null should be handled in similar ways across platforms, yes?

Greg,

It is rather dependent on both application(s) and platform. I raised the point because it has caught me out in the past. But yes, my experiences with Perl lead to me to believe blanks and nulls are handled the same. SQL of course is another matter, and you should never assume SQL will treat them the same. It really is one of things you need to test before using in anger.

Actually there are valid reasons for treating blank strings different from nulls.

By Matt Penfold (not verified) on 13 Sep 2008 #permalink

Matt Penfold: "A blank string is not the same thing as a NULL value."

True, but the "split" command, IIRC, treats an empty string between two delimiters as, well, an empty string, so I wrote the program accordingly. If I were dealing with something other than a CSV file, I'd do it differently.

By J. J. Ramsey (not verified) on 13 Sep 2008 #permalink

Greg Laden,

J. J. Ramsey's perl solution, although simple to understand, isn't all that good. Although it works for very simple input, as soon as you hand it data where some of the cells have commas in them, it CHANGES that perfectly valid data. Can you REALLY say that's what you want? Your sed solution is better for that reason alone. Plus, 5+ lines of perl replacing one line of sed? Bah. Check the -e option on how to make it a single sed command. Of course, as AnonymousCoward pointed out, you both are not handling data where consecutive commas actually appear in your data. Finally, there seems to be no concern over how to handle error conditions. This could be so much better.

By Shawn Smith (not verified) on 13 Sep 2008 #permalink

Matt: I'm sure there is an interesting story as to why NA is "missing data"... I've done work with SPSS versions in which missing data was -9999. Which is totally dumb, of course, because that could be a measure. But a blank and a null, since they often look and act the same but sometimes not, are obviously bad choices.

Shawn: Well, you can't handle error conditions too easily and stick with the one-liner, but yes, my sed line was the best solution at the time because I wanted a one liner. That was not a program, but rather, a command issued under the condition that I knew what the data looked like.

In the command line statistical package (I think I'll call it statmagick) there would be utilities to fix up and verify the text-based or sql-based data sets.

Shawn Smith: "Although it works for very simple input, as soon as you hand it data where some of the cells have commas in them, it CHANGES that perfectly valid data."

But then, if you are dealing with commas in the data, then CSV format is not what you want. You'd be better off using a data format where there wasn't an issue of delimiters being confused for data. If you are collecting and storing data, then it should be stored in a format that makes processing it fairly trivial, either because the format is simple (e.g. CSV, tab-delimited) or because there are libraries to handle the bulk of the parsing and whatnot so that you don't have to do it yourself (e.g. XML, certain binary formats).

Shawn Smith: "Plus, 5+ lines of perl replacing one line of sed? Bah."

I'll take the 5+ lines of clear, readable Perl over a sed one-liner that has an apparent redundancy that requires an explanation. Further, if you want to handle error conditions, then it is not too difficult to modify and extend the Perl script, which isn't true for the sed one-liner.

By J. J. Ramsey (not verified) on 13 Sep 2008 #permalink

BTW: why did I use csv instead of tab delmited? Two reasons: One, I can see a comma. Two: a comma is a single keystroke but entering a tab (in a RE) is two. Otherwise, no reason, really. Either one is happily eaten by R.

Initially, all the utilities in statmagick will be programmed in various scripting languages such as perl, python, even bash, and of course awk. The question is, do we leave them that way or port them to c eventually?

J. J. Ramsey,

But then, if you are dealing with commas in the data, then CSV format is not what you want. You'd be better off using a data format where there wasn't an issue of delimiters being confused for data.

Granted, CSV is probably not ideal, but it is one of the simpler ones that could include any characters in the data--simply quote all data items and double any quotes that are part of the data. It's unlikely you can come up with a format that will avoid the issue of delimiters being confused for data, in the general case, especially if you want to internationalize your code, and still be as simple as CSV. And thanks to CPAN for perl, jakarta for Java, and equivalent repositories for other languages, processing almost any format is quick and easy. In terms of developer time, anyway.

If you are collecting and storing data, then it should be stored in a format that makes processing it fairly trivial,

That depends. It is certainly true when "processing" (whatever the hell that is) requires the bulk of the work. But when ease of display or low space usage is a priority, there are good reasons to store it in a way that makes "processing" it more difficult.

I'll take the 5+ lines of clear, readable Perl over a sed one-liner that has an apparent redundancy that requires an explanation.

That is probably one of the few times the phrase "readable Perl" does not evoke peals of laughter. It's easy for me to believe there are perl programmers (people who use perl on a day-to-day basis for years in CGI and database programming) who don't know what "while (<>)" does, or that you can use an array in a scalar context to get its length, but that really does say more about those programmers than perl.

Further, if you want to handle error conditions, then it is not too difficult to modify and extend the Perl script, which isn't true for the sed one-liner.

Very true.

To get back on topic, though, I would be concerned by the idea that really smart people are impressed with someone who knows how to deal with regular expressions (REs). REs really aren't that hard to understand, especially by people who are good at memorizing long lists (most biologists). If I did see someone who is impressed, I would have no problem letting them know that it really is no big deal, and do what I could to explain it to them so that they could use REs themselves. I certainly wouldn't want them to think there's anything difficult about REs, and that knowledge of how to use them implies expertise in areas not related to computers.

Isn't that one of the problems we have with so many engineers and computer programmers (I am one) being ID advocates (I am not)? They know they are smart and they are told by biologists they are smart because they can process their data so easily using regular expressions. Those programmers, who know how simple REs are to understand, then think they are smarter than biologists but they don't get the whole evolution thing. Therefore, evolution is wrong and ID must be right. bah.

It seems better to let as many people as possible know how to use REs (it shouldn't take more than a couple hours) so that we can remove one more example of magical thinking from the population.

By Shawn Smith (not verified) on 13 Sep 2008 #permalink

Oops--that was "while (<>)". Stupid, stupid mistake on my part.

By Shawn Smith (not verified) on 13 Sep 2008 #permalink

That is probably one of the few times the phrase "readable Perl" does not evoke peals of laughter.

I have to admit that I did chuckle at this.

I would be concerned by the idea that really smart people are impressed with someone who knows how to deal with regular expressions (REs). REs really aren't that hard to understand,

Stop! Stop! You are ruining it for everyone!!!

Shawn Smith: "It's unlikely you can come up with a format that will avoid the issue of delimiters being confused for data, in the general case, especially if you want to internationalize your code, and still be as simple as CSV."

"General case" is the operative phrase here. We aren't dealing with text processing code designed to be fully general, but rather something just suited to the purposes at hand. If you try to get too general, you risk introducing bugs that happen because you are trying to be too clever, and you are doing a lot of work for little benefit because you are designing for unrealistic scenarios that are unlikely to happen.

Of course, if you are writing what amounts to library code, that is, code that is meant to be used by lots of people for several somewhat related but different purposes, then the effort expended in being clever or more general is less likely to be wasted.

In short, the narrower the purpose of the code and the fewer the users, the more simplifying assumptions that you can make--and vice versa.

That said, it is also wise to make even narrow-purpose code straightforward and readable, so that if it does need to be extended, it is more feasible.

By J. J. Ramsey (not verified) on 14 Sep 2008 #permalink

JJ you are dead on here. Statmagick is for specific kinds of data. Most of the columns will be numbers. Rules can be imposed on the non-numeric columns.

Of course one problem with internationalization and comma delimited is those crazy Europeans who use commas as decimal points.

My own programming style, for my own purposes, is: Everything is a macro. But of course StatMagick can't have that philosophy.