To Automate or Not To Automate?

The Female Science Professor has a nice post about high and low tech data acquisition:

An MS student has repeatedly questioned why he/she has to use a low-tech method to acquire, somewhat tediously, some data that could be acquired more rapidly with a higher-tech method. I say 'more rapidly' because the actual acquisition time once the machine is on and ready for analysis can be fairly rapid, but this technique becomes much less rapid when the substantial (and tedious) preparation time is considered. In any case, with the low-tech method, you can get data any time you want, and the amount of data one gets is limited only by your time.

Without knowing more about her research field, it's sort of tough to develop a specific opinion, but this is a pretty universal question. When I was at NIST, I had several lunchtime conversations with a guy down the hall who maintained that it was always a good idea to spend a day or two automating everything in the data collection system when you first got an experimental signal. I was generally more of the "Woo-hoo! Data!" school, and tended to just plunge ahead using tedious, non-automated methods until that became completely intolerable.

In one experiment, this took the form of the world's most expensive laser stabilization system: we set up a spectrum analyzer to measure the frequency of the laser we were trying to control, and then somebody stood next to the laser control box and tweaked the frequency if it started to drift too much. We referred to it as the "biological lock," and as somebody else in the lab pointed out, were were using an NRC post-doc to do the job of a fifteen-dollar box of electronics. It saved us a few days of building and de-bugging an actual lock circuit, though, and this was supposed to be a one-afternoon experiment (which eventually took three months, but, hey, we were getting data all that time...).

The advantage of automation, of course, is that it allows you to easily collect vast amounts of data. At NIST, we had a couple of three-ring binders full of graphs of data that we took using an automated system (we set up a LabView program to scan the laser frequency over a wide range and record the signals we were looking for). We never did explain the phenomenon we were investigating in one of those sets of experiments-- the theory turned out to be extremely difficult-- but we were able to exhaustively explore variations of all the parameters, in a way that wouldn't have been possible if we hadn't automated the data collection.

On the other hand, non-automated methods have their advantages, as well. The showpiece graph for the first paper I was an author on (visible on this page that I made back in 1997) was from the very first day we got a useful signal, when the other grad student on the project and I sat in the lab watching the LED display on a digital counter, and writing numbers down on paper. We got tons of other similar graphs later on, after we automated the system, and in some respects the experimental conditions for those later runs were better, but in the end, the data taken by hand made for a cleaner graph, because we were able to exercise some judgement about points that were anomalous while they were being acquired, where the automated system just took down everything, and produced noisier data.

It's kind of a tough call, though, and comes down to a trade-off is really between preparation time and acquisition time. On balance, I suspect my colleague from NIST is probably right, that spending time automating things at the beginning of an experiment is probably a net win, in terms of the total amount of time spent on data collection. Psychologically, though, the concentrated nature of the up-front prep work seems much more unpleasant than the more spread out time spent on taking data, meaning that it can be awfully attractive to just plunge ahead into data collection even though some more prep time would pay dividends in the end.


More like this

As I said in the introduction to the previous post, this was the first paper on which I was the lead author, and it may be my favorite paper of my career to date. I had a terrific time with it, and it led to enough good stories that I'm going to split the making-of part into two posts. The…
One of the things I'd like to accomplish with the current series of posts is to give a little insight into what it's like to do science. This should probably seem familiar to those readers who are experimental scientists, but might be new to those who aren't. I think that this is one of the most…
The third category in our look at lab apparatus, after vacuum hardware and lasers and optics is the huge collection of electronic gear that we use to control the experiments. I'll borrow the sales term "test and measurement" as a catch-all description, though this is really broader than what you'll…
Back when I was a grad student at NIST, we had a large frame argon ion laser that put out 10-15 watts of green light that we then used to pump a Ti:Sapph laser to produce the infrared light we used for laser cooling. This particular type of laser had a small design flaw-- the energy needed for…

This is an exact analogue, I think, to the question for programmers and systems admins: do you do something by hand (collect data by hand), write a one-shot script to just sped it up this one time (automate the specific experiment) or write a more general tool meant to be reused (set up infrastructure to seimautomate a whole series of investigations, prepared for follow-ups and changes in focus).

And in my experience, doing it by hand is worthwhile - if nothing else it makes you intimately familiar with what you're actually trying to do; and doing a general tool is worthwhile as you'll thank yourself for having it many times in the future. One-shot scripts, though, are usually not worth it - you'll jut end up having to (re-)tweak the script over and over again as the one-time use inevitably crops up, in different variations, over and over again.

Your colleague is almost certainly right. It's always a trade-off, and the main trade-off, as you say, is prep time vs acquisition time. So for anything that is more than a one-off or two-off, you're invariably better putting in the prep time up front. It's not universal, but it is strong.

There are some more benefits that you don't mention.

First, a LabView or similar set-up can serve as a supplement to your notebook. A save of your program and a snapshot of your set-up provides an unambiguous record of exactly what it was you were doing, and how you were doing it. I've settled several arguments this way. ("No, we did it this way, dammit. Just like I said-- it's in my notebook, and here's the damned program. Now go fix your own stuff!")

Second, software is re-useable. If you've got any kind of cleverness in your set-up, you shouldn't be designing these programs from scratch every time. You should be developing libraries. So after you've got three or four set-ups under your belt, you should find that your prep time goes way down. You also get a kind of a bootstrap effect in that, if you do this right, improving a sub-program for one experiment gives you the option of including that refinement in old experiments if you repeat them.

Third, particularly for you, if you have undergrad students who are destined for engineering rather than True Physics, this is a great thing for them to do. They're probably already being taught the basics in their own lab classes, and having actual experience is a fantastic thing for their resume. For scaling purposes, I had a co-op for 20 hours a week, half a year, and he put together four pretty high level system test software suites built out of pre-existing lower level software objects. And he came in with no microwave experience, no test experience, and very little software experience.

Fourth, Dr. Millikan, I'm dubious on the whole "throw out the obviously bad data," approach unless you got a Nobel Prize for it. If you're throwing out bad data left and right, you ought to be fixing your set-up, and using the software as a tool for doing that. Or, at the very least, setting up some objective criteria for the filter and letting the software do it for you.

(Disclaimer: I have a huge bias here. My test set-ups are probably bigger than yours, and my data collection is probably more intense, so my trade-off is automatically on the side of automation. And I just got done schooling half the organization on just exactly how this stuff is done. Literally, yesterday.)

And finally, if you really just hate LabView, try MatLab. Someone took me to school on that one the day before yesterday, and if I hadn't had pre-existing LabView stuff to work with, I probably would have wanted to use that instead. It has the advantage that Oh my God is that a better language for data processing.

By John Novak (not verified) on 24 May 2007 #permalink

Sometimes its not a binary decision. I work with software -and sometimes electronic data. In any case often the time consuming part of writing/testing an automated tool, is to get the rare end-cases right. Sometimes you can just create/run a half-baked tool, which alerts the "biological monitor" whenever it suspects an end-case.

In my case, data taking must be automated. Our experiments involve space flight hardware, so once it's launched no hands-on intervention is possible. Of course, that means we test the heck out of our data taking algorithms.

For data analysis, it boils down to whether you can tell a computer how to find what you are looking for--which is usually harder to do than you think it is (I have abandoned at least one project because I wasn't a good enough computer scientist to tell the computer how to recognize a data feature I can identify in my sleep). We have software libraries which are building blocks for doing whatever you want to do. So in practice I automate what I can and do the rest (often the most important parts) by hand.

Also keep in mind that even if you think you can and should automate it, *always* do a test run. The computer has a nasty habit of doing what you tell it to do instead of what you want it to do.

By Eric Lund (not verified) on 24 May 2007 #permalink

If there is something I learned that suits both research and industry, it is that minimizing feedback time is the primary goal. Anything that can cut the time until you see the errors or get a large enough result to continue with is to be prioritized in both cases.

So I will always try the quick-and-dirty first. And if automation would eventually speed up reuse it would be done with the benefits Janne lists, but preferable in an organic way of continuous improvement with minimized feedback cycles. Hmm. "Quick-and-tweak" may be a descriptor.

if you really just hate LabView, try MatLab

I find them somewhat orthogonal in prize and application. LabView handles basic stuff well for a feasible cost, while MatLab excels in high end applications with or without costly packages.

If by the above description you suspect that at times I have ended up with LabView as front end and MatLab as back end, you are correct. If you aren't time-critical, it can be quite a nice solution. Since both have GUI builders, you can also create software packages (even unified ones by way of one of several nice general installation scripts) that you can hand off to others - to save their time, of course. :-)

By Torbjörn Lars… (not verified) on 24 May 2007 #permalink

Torbjorn Larsson is right---it's all about feedback time. If the "failure mode" of your experiment is only evident with large statistics, by all means automate it. If the failure mode is evident by drawing the first three data-points and eyeballing the linearity, by all means do it by hand.

It's not just important to acquire data by hand; it's important to plot it by hand and interpret as fast as you can write. We always kept stashes of (increasingly hard-to-find) log-log and semilog paper around for impromptu graphs to be taped into notebooks.

Our other incredibly useful lab custom was that we covered all of our lab benches and desks with butcher paper. Need to write something down in a hurry? Write it on the bench. (Don't forget to transfer to a lab notebook later---the paper gets replaced periodically, and every once in a while someone will lose a few numbers in the cleanup.)

There is another consideration than cost. That is reliability of the data. Sometimes this is better for hand data, sometimes hand data is subject to clerical error. How costly might these errors be? In my professional area, writing/modifying computer programs, the cost of a typographical error can be quite high. Given a choice of spending ten minutes to generate a repetitive block of code by hand with a text editor, or by writing a program to automatically generate it, I'll often opt for the automatic generation. Even if it would take ten minutes to do by hand, and say thirty minutes to write the code to automate it, it may still pay to automate, and thus avoid a costly debug effort. Probably half of all program bugs are a single byte mistyped character.

Torbjörn Larsson:

If by the above description you suspect that at times I have ended up with LabView as front end and MatLab as back end, you are correct.

You say that like there's something wrong with the notion. To the contrary, I did the same thing on a recent project. My co-op student wrote LabView because it was simple and he had building blocks. His software collected reams of data over several months.

Then I beat it to death with Matlab for the various analysis, because regardless LabView's virtues, elegant numeric coding isn't any of them.

Ben M:

If the failure mode is evident by drawing the first three data-points and eyeballing the linearity, by all means do it by hand.

...Or get your software to plot to the screen in-process, instead of only at the end. Mine does.

By John Novak (not verified) on 24 May 2007 #permalink

I guess my mind was semantically primed by the "shark virgin birth" story, but I could've sworn the title of this post was referring to mating with oneself...

I work in law publishing, and we actually run into similar problems all the time. Updating a book from year to year involves a lot of tedious changes (updating cross reference page numbers, making sure that our copies of statutes are identical to the state's), and there's always the question of whether we should spend a our time doing a cheap, low-tech process, or invest money and skilled labor up front on an automated solution. There are endless heated philosophical arguments about this.

Of course, business has its own version of grad students - interns - and there is no computing problem that can't be solved by cheap labor.

I tend to automate too much on the computer, and on a few occasions have spent more time writing an analysis script than it would have taken to chug manually through my data files. (One egregious example was writing a script to re-format the March Meeting program into a more compact format, and to count exactly how many parallel sessions there were.)

In the lab, though, I tend towards the manual. Maybe that's because I like turning knobs and writing stuff in a logbook. I had to automate my dissertation work, because I needed several days of continuous data taking. The system I used there was to write Tcl scripts on the old Linux Lab Project GPIB interface, which required a now-outdated version of the Linux Kernel. Then in my current lab, everything was automated with Quickbasic on Windows, which I had no interest in learning. So maybe I should sit down and try to learn Labview.

Re: #2, "Fourth, Dr. Millikan, I'm dubious on the whole "throw out the obviously bad data," approach unless you got a Nobel Prize for it.":

Richard Feynman, "Cargo Cult Science", about the difficulty of doing science well, and the temptation to take shortcuts and engage in things that look like science, but that don't advance the body of scientific knowledge.

"We have learned a lot from experience about how to handle some of the ways we fool ourselves. One example: Millikan measured the charge on an electron by an experiment with falling oil drops, and got an answer which we now know not to be quite right. It's a little bit off because he had the incorrect value for the viscosity of air. It's interesting to look at the history of measurements of the charge of an electron, after Millikan. If you plot them as a function of time, you find that one is a little bit bigger than Millikan's, and the next one's a little bit bigger than that, and the next one's a little bit bigger than that, until finally they settle down to a number which is higher."

"Why didn't they discover the new number was higher right away? It's a thing that scientists are ashamed of--this history--because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong--and they would look for and find a reason why something might be wrong. When they got a number close to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that. We've learned those tricks nowadays, and now we don't have that kind of a disease."

You say that like there's something wrong with the notion.

Not inherently wrong.

But have you tried to support a mixed package on the market? Too many constraints. :-(

The problem with "quick-and-tweak" is when the solution has evolved into something that is untenable in a changed situation. But that is perhaps another question than the one discussed here.

We've learned those tricks nowadays, and now we don't have that kind of a disease.

But it is also true that by taking time series to use short-time stability, and by throwing out some obvious noise events, you can improve data series when you are measuring near the performance limit of the system.

You sacrifice in repeatability and, sometimes, by introducing subjective systematic errors. It depends on the purpose of the experiment if you can take that.

By Torbjörn Lars… (not verified) on 25 May 2007 #permalink