Human Performance in Systems

How Your Systems Keep Running Day After Day. John Allspaw at DevOps Enterprise Summit 2017. youtube

YOUTUBE xA5U85LSk0M Published on Nov 15, 2017.

Social Issues In Postmortems. This is an excerpt from the Stella Report, which is the result of the first project “Coping With Complexity” performed by the Resilience Engineering In Business-Critical Software Consortium, SNAFUcatchers. medium

Doctors David Woods and Richard Cook present the report of the Stella meeting and the first year of SNAFU Catchers work to the attendees of the Velocity Meeting in New York City. post report

The System from The STELLA Report source One of the most important diagrams on the Internet.

All businesses are software businesses, or quickly becoming software businesses. Some of them know it. All software is quickly growing into something shaped like this diagram of The System from the STELLA report.

.

# Mental models youtube 6:27

Sections of The System

_The areas given to different sections of the visualization of whole system is interesting: 20% service 30% infrastructure 50% above the "line of representation": people, their conversations, mental models, tasks and intentions._

You never actually see or touch the stuff below the line. It doesn't exit. These are the questions we can ask above the line:

"How does our software work really, versus how it's described in the wiki and in documentation and in the diagrams?" We know that those aren't comprehensively accurate. "How does our software break really, versus how we thought it would break when we designed safeguards and circuit breakers and guardrails?" "What do we do to keep it all working?"

# Incidents _(outages, degradations, breaches, accidents, near-misses, "glitches," basically untoward or unexpected events)_ youtube 12:29 We can learn from the study of other high-tempo, high-consequence domains, and if we do, we can see that we can study incidents.

Obvious reasons incidents are interesting are loss of revenue or damage to reputation.

Deeper learning from incidents today can drive our architecture tomorrow.

Incidents give birth to new forms of regulation, policies, norms, compliance, auditing, and constraints. Examples include SEC Regulation SCI was a reaction to the flash crash of 2010, to Knight Capital, BATS IPO, and Facebook IPO. PCI DSS came about when Visa and MasterCard compared notes and discovered $750 million lost to fraud over 10 years.

Incidents influence downstream staffing, budgets, planning, roadmaps.

Incidents are encoded messages from below the line, drawing attention to parts of the system you thought you understood but you didn't.

Incidents could lead to new training, new tooling, new organizational structures, new funding dynamics, and insights your competitors don't have.

# Common Questions During Incidents youtube 17:23 What's it doing? Why is it doing that? What will it do next? How did it get into this state? WTF is happening? If we do Y, will it help us figure out what to do? Is it getting worse? It looks like it's fixed, but is it? If we do X, will it prevent it from getting worse, or will it make it worse? Who else should we call that can help us? Is this our issue, or are we being attacked?

# What We Can Learn youtube 19:14 Incidents provide calibration about: how decisions are _**focused**_ how attention is _**focused**_ how coordination is _**focused**_ how escalation is _**focused**_ the _**impact**_ of time pressure the _**impact**_ of uncertainty the _**impact**_ of ambiguity the _**consequences**_ of consequences

.

Gustav Fechner constructed experiments to measure the boundaries of the "subliminal": exactly where do phenomena and experience fall below the line of human perception. The thresholds of human perception are explored deeply from this completely different angle in Risk of Serious Inquiry.

Peter Alvaro offers some strategies for approaching the mess that's below the line in Allspaw's talk: Twilight of the Experts