APS 2008: Can we learn from errors? What if we're running a nuclear power plant?

Just a few quick notes about Michael Frese's talk, "Learning from Errors by Individuals and Organizations."

Frese gives a rule: "You make about 3-4 errors per hour no matter what you're doing."

If errors are so ubiquitous, maybe it makes more sense to train people to deal with errors, rather than to try to flush out every possible error. Frese and others have studied this phenomenon in the lab. They found that error management actually led to improved performance on computer training tasks: if you are trained to expect errors and deal with them, you do better on the task. There are limits to this approach; in general, the more complex the task, the more important it is to focus on error management rather than just avoiding errors.

They also found that feedback is important: If you have clear feedback, it's better to learn from errors than proscriptive training. If clear feedback isn't provided, then learning from errors isn't as effective.

Frese is an organizational psychologist, and for him the key is results. Small businesses do better -- in real, financial terms -- when their owners say they adapt well to mistakes. The killer stat from his talk: 20 percent of variability in corporate profitability is determined by error management culture. If a company focuses on managing errors rather than simply avoiding them, it's significantly more likely to be profitable than a company that focuses only on avoiding errors.

An interesting question from the audience: What if you're running a nuclear power plant? If an error is catastrophic, then how can you ever learn from it? Frese responded that he's actually consulted with power companies running nuclear plants. He said that errors inevitably occur, even in these situations. A company can take two approaches to errors when they occur -- sweep it under the rug while repairing the damage, or attempt to learn from the error and work on approaches to better handle such errors in the future. For him, the latter approach is still preferable, even when errors can literally mean the difference between life and death.

Tags

More like this

Power plants usually (always, I hope) have multiple levels of safety features, so that a catastrophic failure requires a whole series of errors. So error-management plans should be just as useful.

I think Rosie has it right. Most catastrophic failures (eg 3-mile Island, Chernobyl) are the result of a series of small errors/failures. If the initial error or failure can be quickly identified and (appropriate) corrective action taken then the chance of a disaster is reduced.

As a software developer, error handling is obviously part of my job description. I find myself in a constant battle with management regarding the importance of error detection. Too many of the systems we develop make no attempt to detect errors, especially data errors. Fortunately I don't work in the nuclear industry.

By NoAstronomer (not verified) on 23 May 2008 #permalink

I worked in a nuclear plant for years. We constantly ran casualty drills to simulate emergencies and see how people react. Then, errors would purposely be worked into the scenario. For example, if certain switches or valves are supposed to be operated in an emergency, then halfway through the scenario we would be told that the wrong valve was operated because the operator freaked out. So part of the emergency has gotten worse. Now what do you do? We trained to account for errors and respond.

This turned into a long post. Feel free to skip to the last paragraph.

Another software developer here. While without a doubt it is important to code for all input sequences so as to be able to render the unwanted ones impotent, these are not the only error sources involved. If the proposition is true, which I think it is, we also make continuous errors while creating the coding, while learning the application domain, while fitting the program to the domain, etc. I would also discriminate between errors of rote learning (and Nuclear Plant operation, piloting, and so on are examples in that one trains for the job and does not - nor can - learn on-the-job how to deal with things that go wrong) and creative errors. I focus on the latter.

There appears to be what I call a 'cognitive loop' involved in the act of (code) creation: idea -> design -> implement -> test -> idea. That is, testing a creation that reveals errors feeds back into refining the idea and therefore the subsequent steps in the repeat loop. But the process is limited by available cognitive resources and the time it takes to complete one circuit of the loop - needing too many cognitive resources or if the loop time is longer than short term memory limits significantly interferes with the ability to notice and learn from the mistakes.

Traditional waterfall models of software development have shown a serious problem in that their cognitive loop is corporate not personal, and its duration is the product cycle, so the process is incredibly sensitive to correctness in requirements and specification. And unless the project is one of automating existing well-known processes (banking, chemical plant, etc) it is fundamentally impossible to actually know in advance what the requirements or specifications really are. Plus, on a personal note, my whole reason for being in this field is the joy of exploration, of doing what by definition has not been done before, of continuous learning OTJ.

Which brings me back to the small stuff: the everyday experience of a programmer. How to balance how much code to write before syntax checking, test harness runs, operational test. These decisions are modulated by the cost of doing each part. And this cost is implicit in amongst other things the tool set available. The traditional model is edit - compile - test - edit. Which can be quite a long loop - long enough to get a cup of coffee during the compile step (maybe). I have found that (for me) an exploratory model is much more productive - continuous course correction, continuously running and demo-able code, small increments. The biggest factor has been developing a programming environment that eliminated the compile step - suddenly the cognitive loop is closed at the edit stage and course-corrected motion towards the final product becomes more like swimming than run-wait-fear jerky progress. And the environment proved also to be accessible to non-programmers; our marketing VP said "I like VNOS because it makes me feel smart", and he wrote himself a weekly alarm applet that played an mp3 at happy hour on Fridays.

Summing up: the shorter the cognitive loop the better the learning, the surer the corrections; a lesson is learned only if the mistake is visible both as what is wrong with the result and why it went wrong and how to fix it and advance. Don't choke the student.

By Gray Gaffer (not verified) on 23 May 2008 #permalink

An entire field of human factors research is devoted to this problem.
Anyone who flies would like to know that planes are safer than cars, and safer than ever before. The reason we have the oldest ever fleet of jumbos the world has ever seen, and so few crashes?
1. All humans err, accept it, and build in layers of defenses. In aviation (and nuclear power) this "Swiss cheese" model is used. Ie. anyone can err, be it the guy checking the bolts in the engine, or the pilot who has flown for 10 hours, but for every conceivable KNOWN error there is a check, or built in buffer of safety. for an accident to occur, (or a less serious incident) the holes in the slices all have to line up.
2. Learn from mistakes and errors. Not just how to fix, or accommodate, or mitigate, these are cover-ups, BUT learn how to recognize, and respond to errors which will occur. This is a safety culture. This has enabled airlines to get to a point where 80-90% of all incidents are traced to human error, and most with some intent to cut corners... But that was another assignment.
Simplistically put...
The three mile island incident less so, but Chernobyl certainly was a poorly attempted safety culture, that resulted in attributing blame to individuals and not finding solutions; so when mistakes where made, and many were, they where ignored, or covered up. In conjunction with some design flaws, which to have been superseded by this processes of constant learning... (for example failed pumps can no longer result in lack of coolant, but rather in to much, and electricity failures within a reactor result in (reaction-halting) carbon control rods being dropped IN, not stuck OUT... and so on) ...
3. Always focus on safety. This means looking for things to improve, and looking fro errors or failures, and pretty soon you get real good at spotting anomalies, and at finding ways to engineer, or train to avoid them.

There are others not relevant here, but any system is enhanced with an attitude of open eyes and acceptance.

Sorry if I make no sense, It is late here, and it was a long day finishing an assignment...

By Pat Kershaw (not verified) on 25 May 2008 #permalink