This is a sort of parable about infrastructure, risk, and why certain things are hidden from view until suddenly they’re not. It’s also a warning about how repeated success, organizational incentives, and role-based cognition can conspire to make crucial infrastructure—and its risks—invisible until it fails catastrophically.
Here’s how it starts, for me: I was sitting in a classroom watching an astounding technical and scientific marvel on live T.V. when it exploded. According to my mother, I came home from school that day and told her I didn’t want to be an astronaut anymore. I don’t remember thinking all that much about the Challenger disaster in the intervening years, but when I was in graduate school, O-ring failure probabilities came up in some tutorial or book on Bayesian analyses. I worked the example with a sort of reverence and discovered it made me tear up. I guess it still does, actually.
This event is newly relevant to me today. This issue of the Developer Science Review is about infrastructure and this is an example of a catastrophic failure of infrastructure. There are deeper connections to the experiences of infrastructure workers in software engineering, though. The question that has been on my mind lately is something like, “what sociocultural forces contribute to the invisibility of infrastructure and its cultivators.” One possible part of an answer to that question is that it is in the interest of high level directors (e.g., bosses, managers, C-suites) to put infrastructure in the background as much as possible to minimize expenditure of resources, so long as everything is going fine with the shiny stuff (this comes up in the target article of this comment). In the Challenger disaster, there is good evidence that the engineers at NASA and their contractors had noted the risks and had been faced with managers who had a difference of opinion.
The article is not an account of who was right and who was wrong, though. I don’t think any respectable account would reduce the complexity to such an extent. The insights about infrastructure here are about what is visible and invisible to people based on ways of thinking that may sometimes be motivated by their role.
The authors talk about this mismatch in terms of different mental models or cognitive theories of how our understanding of risk in the present should be influenced by past performance. In their telling, engineers had one model in mind (that past success is irrelevant for the current moment) and managers had another (past success indicated that failure probabilities were actually lower than they had originally thought).
The authors actually lay out three theories, in total. The authors’ theory 1: Independent Trials (Engineer view). This is the way in statistics we think of probability distributions of random variables. A past success doesn’t change today’s probability; it only offers a little information about the actual nature of the full probability distribution. This was what led engineers to voice concern when they noticed deterioration of crucial O-rings in past successful launches. Sure, it went fine last time, but that doesn’t mean it’s not going to fail next time.
The authors’ theory 3: Bayesian Optimism (Manager view). Past success predicts future success. When you get started on a series of risky and repeated endeavors, no one really knows for real, like actually, what the probability of failure is. Someone does a lot of math and testing and hopefully gets close to the real number. After one has endeavored multiple times and been successful, flawless even, multiple times, then one thinks, perhaps reasonably, that the kinks have been ironed out, and the probability of failure is lower than one would have expected based on the initial estimates. This can certainly be a reasonable way of thinking and it’s what makes Bayesian updating so powerful. If you flipped heads 25 times in a row, you’d likely conclude the coin isn’t fair—regardless of what you believed going in. What’s crucial about this frame of mind, though, is that your model has to be right, and you have to have enough data. A sociotechnical system like the space shuttle is vastly more complicated than a two-sided probability machine.
The authors’ theory 2: Complacency Drift (What really happens). This is the authors’ theory of what really tends to happen, and what happened in this case to disastrous effect. Each success leads to some amount of complacency, and fine tuning to favor efficiency or some other goal at the cost of safety. This article documents how, after repeated successes, many of which had apparent O-ring degradation, managers shifted their belief in how failure-prone these parts were given that even after degraded, none had failed. The repeated deviance from safety standards in the presence of these successes led to normalizing that deviance[1]. Indeed, this adjustment of standards made failure more likely.
Mulloy later reflected: ‘Since the risk Of O-ring erosion was accepted and indeed expected, it was no longer considered an anomaly to be resolved before the next flight…. I concluded that we're taking a risk every time. We all signed up for that risk. And the conclusion was, there was no significant difference in risk from previous launches. We'd be taking essentially the same risk on Jan. 28 that we have been ever since we first saw O-ring erosion’ (Bell and Esch, 1987: 43, 47).
A few days prior to the launch, even after the O-ring problems had been deemed closed, engineers met to hypothesize what was going wrong with them. One of those hypotheses was that cold temperatures made them operate incorrectly, or sluggishly. So those engineers recommended that no launch should take place below a particular temperature threshold (based on the lowest temperature of previous successful launches). This recommendation was ignored.
I see an echo of this in incident reports like Hazel Weakly’s The Mother of all Outages. An organization steaming along under the blissful power of Theory 3, underinvests in the people that make infrastructure work, until critical knowledge walks out the door. Engineers inside those systems, operating from Theory 1—or at best a skeptical Theory 3—see the risk but lack the resources, influence, or incentives to correct course. This is probably a familiar story to many of you. What I think this article helps us do is understand one mechanism that might contribute in some places, sometimes, to this phenomenon. It helps us begin to make sense of the thought processes and motivations.
This is what I consider my job as a psychological scientist investigating the work-life of technical communities. Understanding the human processes, thoughts, motivations, behaviors, and social contexts that cause things to be a certain way is what I consider to be my job as a psychological scientist investigating the work-life of technical communities. There is a lot of pretty good information out there from scientists and practitioners alike that bring a lot of visibility to how things are—e.g., personal blogs, industry reports—but determining why things are this way requires a whole set of epistemological machinery for teasing out a veridical causal story. And it is knowing the why that offers us insight into how to change things or ensure stability. The target article here uses a careful reading of historical data to offer generative insights into this why.
To wrap up, this paper foregrounds how the concerns and expertise of different roles can possibly lead to diverging mental models of risk and evidence that end up producing conflicting decision signals. I’d like to ask you, the reader, to read this article looking for how the whole project of the space shuttle, and its success, begs for its crucial infrastructure to remain in the background, and then how this motive may lead one to favor a particular model of how past performance informs risk assessment in the present. It’s a mechanism that I hope gets some exploration in future work on how software infrastructure workers are seen and heard (or not) by their managers and coworkers.