John Allspaw, CTO at Adaptive Capacity Labs, presented _How Your Systems Keep Running Day After Day_ at DevOps Enterprise Summit 2017. He brings vital attention to the complexity of the systems we build and the essential role of human performance. youtube
The cloud in the lower right, below and connected to "monitoring tools," is The System as Peter Alvaro described: a graph of connected systems and a combinatorial search space. John Allspaw and SNAFUcatchers expand The System to include all the infrastructure and tooling and monitoring used to manage the artifacts of the system.
All of these computational systems exist below the line of representation. We cannot see our systems directly. We can only see and manipulate them through this line of representation.
More importantly, all of the real work done in the system happens above the line where the humans are. After an incident when the system has surprised us we conduct retros and come up with todo lists to prevent the same incident from happening again. These interventions are completely necessary. And they are still interventions below the line.
These two experts in resiliance will tell you, all those technologists (us included)?
It's easy to dig in the wrong place. Remember it's a combinatorial space. As I quoted Peter Alvaro earlier: > We're going to have to be smart about how we select our experiments.
John Allspaw tells us that incidents are the light to guide us.