Nobody likes a fragile design; when you provide it with the tiniest excuse to fail, it will. Everybody likes robust systems. Robustness can be defined many ways, but I think of it as the ability perform as intended, in the presence of a wide range of both expected and unexpected conditions. Thus, a robust system is relatively imperturbable.
As engineers, how do we achieve this robustness? Two general approaches are passive and active robustness. Passive robustness uses design margin to reduce the magnitude and likelihood of the undesired response. Active robustness uses feedback loops to counteract the perturbation.
Consider the problem of preventing a sailboat from capsizing in unpredictable gusts of wind. One passive approach is to add ballast to lower the center of gravity. This will increase righting moments by increasing the distance between the center of buoyancy and the center of gravity. As a result, the boat will not heel over as easily. Unfortunately, this robustness comes at the expense of performance; since the boat is heavier, it will not sail as fast. Importantly, we pay the cost of passive robustness in both strong and light winds, when we need it, and when we don’t.
In contrast, active robustness uses feedback loops to respond to changes in the wind. For example, we might ease off our heading in the presence of a strong gust. We might have the crew hike out on the windward side of the boat to counterbalance the heel. An active approach allows us to expose more sail area and sail the boat faster in the presence of variable winds. We are safe because our feedback loops prevent dangerous conditions from developing. With active robustness we sacrifice less performance than we do with passive robustness, since our feedback loops only dampen response when this is necessary.
However, though active robustness is normally superior to passive robustness, it does have a dark side. The feedback loops can mask the progressive deterioration of performance and set us up for sudden and catastrophic failure. Ironically, the more effective our active feedback loops, the more vulnerable we are to such a catastrophic failure.
Let’s illustrate this phenomenon using the exquisitely well-designed feedback loops of the human body. The body uses many feedback systems to achieve robust performance in the presence of trauma. The medical term for this is homeostasis: maintaining a stable state. For example, when an accident victim bleeds severely they experience what is called hypovolemic shock. Their blood loss can prevent sufficient oxygenated blood from reaching vital organs such as the brain. This can reduce brain function, endangering survival. The body responds to this blood loss by activating feedback loops to maintain the flow of oxygenated blood the brain. Heart rate, breathing rate, and heart stroke volume increase; vasoconstriction decreases blood flow to non-critical organs. This stage of shock is called “compensated shock,” because the body successfully maintains blood pressure by compensating for the loss of blood volume.
But, what happens if the victim continues to lose blood? Eventually, the feedback loops have done everything they can to maintain blood pressure, and it is inadequate—the body transitions to “uncompensated shock”. In this stage it is unable to maintain blood flow to the brain, heart, and lungs; the availability of oxygenated blood drops rapidly; and sadly, you die. Uncompensated shock quickly leads to rapid, irreversible, and catastrophic deterioration.
During the initial stage of hypovolemic shock, the compensating mechanisms maintain blood pressure and prevent the most serious consequence of blood loss—reduced perfusion to the brain. System deterioration is occurring, but it does not appear as a drop in blood pressure. Let’s think of blood pressure a Key Performance Indicator (KPI), since it is. In fact, it is so important that our body’s feedback loops manage it very effectively. Unfortunately this effectiveness can mask the victim’s deteriorating physical condition. In fact, the more successful the feedback loop, the more it hides the real deterioration. The KPI does not tell us that system performance is deteriorating. Specifically, it does not tell us that we are losing our margin of safety, that is the size of a perturbation that the system can now cope with.
Of course, similar problems occur in product development. We can focus on the big three KPIs: performance, cost, and schedule and fail to pay attention to our safety margin. Suppose an overaggressive schedule causes the team to fall behind on their work. They compensate by heroically working longer hours, and no milestones are missed: performance, schedule, and cost are on track. Should we worry? Yes, we are losing our safety margin and our project is becoming riskier.
What should we do? Monitor the safety margin in your feedback loops in addition to your KPIs. In the case of compensated hypovolemic shock, even when blood pressure was not dropping, the signs of trouble were clearly present. Heart rate and respiration rate were up; vasoconstriction was making the skin pale and clammy. The signs of decreasing safety margin are always present, if you know what to look for. Do you think you have no worries because you haven’t missed a milestone? Check to see if your team’s average work week has increased from 50 hours to 100. If so, you are living on borrowed time.
There is also a psychological trap when we only focus on KPIs like schedule, cost, and performance in a system with effective feedback loops. We will see a project repeatedly absorb perturbations with no effect on our KPIs. This conditions us to incorrectly believe that similar future perturbations will be absorbed equally gracefully. If you don’t monitor the feedback loops you will be unable to see the real consequences of the absorbed perturbations.
What should you do? If you just want simple solutions, restrict yourself to passive robustness. If you need higher performance, use active robustness, but be fully aware of its risks. You can manage these risks if you know how your important feedback loops work. What compensation mechanism is maintaining the performance of your KPIs? How much capacity does your system have to compensate? How much of this capacity have you already consumed? You will be much less vulnerable when you know exactly what is standing between you and catastrophe.