Design and Reality

Mathias Verraes and Rebecca Wirfs-Brock describe a design breakthrough in "Design and Reality: Reframing the problem through design". Rebecca was invited to consult for a company that makes hardware and software for oil rigs. The initial model assumed alarms are directly connected to emergency conditions in the world. The system’s perception of the world is distorted: when engineers turn off the alarm, the system believes the emergency is over. But it’s not, turning an alarm off doesn’t change the emergency condition in the world. ... the team discovered a distinction between the alarm sounding, and the state of alertness. ... The new model decoupled the emergency from the sounding of the alarm by introducing "alert conditions" in addition to "alerts". blog

Drilling causes a lot of friction, and “drilling mud” (a proprietary chemical substance) is used as a lubricant. Equipment monitors the drilling mud pressure, and by changing the composition of the mud during drilling, you can control that pressure. Too much pressure is a really bad thing.

And then an oil rig in the gulf exploded.

As the news stories were coming out, the team found out that the rig was using a competitor’s equipment. Whew! The team started speculating about what could have happened, and were thinking about how something like that could happen with their own systems. Was it faulty equipment, sensors, the telemetry, communications between various components, the software?

When in doubt, look for examples. The team ran through scenarios. What happens when a catastrophic condition occurs? How do people react? When something fails, it’s a noisy environment for the oil rig engineers: sirens blaring, alarms going off, … We discovered that when a problem couldn’t be fixed immediately, the engineers, in order to concentrate, would turn off the alarms after a while. When a failure is easy to fix, the control system logs reflect that the alarm went on and was turned off a few minutes later.

The team explored scenarios, specifically the weird ones, the awkward edge cases where nobody really knows how the system behaves, or even how it should behave. One such scenario is when two separate sensor measurements raise alarms at the same time. The alarm sounds, an engineer turns it off, but what happens to the second alarm? Should the alarm still sound or not? Should turning off one turn off the other? If it didn’t turn off, would the engineers think the off switch didn’t work and just push it again?

By working through these scenarios, the team figured out there was a distinction between the alarm sounding, and the state of alertness. Now, in this new model, when measurements from the sensors exceed certain thresholds or exhibit certain patterns, the system doesn’t sound the alarm directly anymore. Instead, it raises an alert condition, which is also logged. It’s this alert condition that is associated with the actual problem. The new alert concept is now responsible for sounding the alarm (or not). The alarm can still be turned off, but the alert condition remains. Two alert conditions with different causes can coexist without being confused by the single alarm. This model decouples the emergency from the sounding of the alarm.

Is it a better model?

Different measurements can be associated with the same alert.

We can define alarm behaviors for simultaneous alert conditions, for example by spacing the alarms, or picking different sound patterns.

These new options are relevant, and likely to bring value. Yet another sign we’d hit on a better model is that we had new conversations with the domain experts. A lot of failure scenarios became easier to detect and respond to. We started asking, what other alert conditions could exist? What risks aren’t we mitigating yet? How should we react?

In a world-centric view of design, only the sensors and the alarms existed in the real world, and the old software model reflected that accurately. Therefore it was an accurate model. The new model that includes alerts isn’t more “accurate” than the old one, it doesn’t come from the real world, it’s not more realistic, and it isn’t more “domain-ish”. But it is more useful. Sensors and alarms are objective, compared to alert conditions. Something is an alert condition because in this environment, we believe it should be an alert condition, and that’s subjective.

These creative introductions of novel concepts into the model are rarely discussed in literature about modelling. Software design books talk about turning concepts into types and data structures, but what if the concept isn’t there yet? Forming distinctions, not just abstractions, however, can help clarify a model. These distinctions create opportunities.

Ward Cunningham tells a story of discovering a method object in Wycash. The story finds a similar insight from programmers pushing against code for real applications discovering insights the experts could not articulate. See Story of Advancers.