The odds would say that they’re (Black Swans, catastrophic system failures are) extremely rare......overlapping failures – one part of the economy shuts down which pulls another with it. -John Wilder
Mr Wilder had to skip lightly across the tops of the waves as he had a larger point he wanted to make. Exploring some of the more provocative premises he made would dilute his larger goal.
I am not under that constraint.
Never go down alone
What Mr Wilder described was that simultaneous, gross failure of multiple, independent systems should have astronomical odds against them
And yet those failures happen with distressing frequency.
Is the math behind the statistics flawed, or are the premises?
In fact, neither the math nor the premises made at the start of the venture are flawed. The flaw is to assume that the system does not change, that the interactions between the sub-systems are static rather than dynamically evolving as they compete to maximize the larger system for their own, personal purposes.
It is easier to describe than to explain.
Suppose you checked welds in an industrial plant. You have been given a plan to check welds every hour and that the plan specifies that every combination of parts welded by each robot be checked in that hour.
The quality inspector (you) had a brain-fart or came back from vacation and showed off the pictures on your phone instead of checking parts or perhaps some part combinations are very rare and the inspector just happened to miss them.
Suppose the robot failed and the inspector missed the failure for eight consecutive checks.
You got fired, right?
All modern systems have multiple, redundant safety interlocks. Everybody who walked within fifteen feet of that robot is implicated because they all failed. If they fire you, the quality inspector, they also have to fire the four electricians who worked on the robot between third and first shifts as well as the maintenance and quality supervisors.
The electricians failed to "validate" their work. The electricians on the following shift failed, at a minimum, to notice the uncharacteristic wear pattern on the weld caps when they changed them....all written standards in their job descriptions.
The supervisors failed to follow-up on maintenance work that was done too quickly. The emphasis was to get the equipment back into service rather than ensuring that every "I" was dotted and "T" was crossed.
A member of the US military was pulling a trailer through a crowded, southeast Asian city. The hitch separated from the truck and the trailer careened through the crowded sidewalk and miraculously only killed three civilians.
The serviceman (Air Force, incidentally) was supposed to have performed a comprehensive, pre-operation safety inspection before starting the vehicle. He put check-marks in every box but failed to observe that the trailer hitch welds were severely corroded.
He got court-martialed, right?
Wrong. He skated.
He pointed out that corrosion does not happen over-night. Every serviceman who drove that truck in the past six months should have D-Xed the truck and none of them did.
Also, the truck had frequent maintenance intervals. Say what you will, but the US Military really does try to make the equipment last for decades. Every mechanic who signed off on the truck maintenance should have caught the failing trailer hitch.
Faced with the prospect of losing the majority of his motor-pool drivers and a goodly chunk of his skilled mechanics, the base commander "disappeared" the event. It reflected badly on his subordinates and by inference, on his leadership.
Man, the rational actor
In a slightly different vein.
Suppose a factory has a typical absenteeism rate (including vacations) of 12%. Also suppose that the factory is composed of six person teams.
To deal with absenteeism, the factory management decided that each team gets an extra worker. If every worker on the team showed up, then the extra worker is assigned a "continuous improvement" project.
A bright-eyed and bushy-tailed college graduate makes a case that regression-to-the-mean suggests that variation smooths out as sample sizes get larger.
By way of example, suppose you have an urn with white marbles and black marbles. If you only draw out one marble, you are guaranteed that it will either be 100% white or 100% black. If you pull out a larger number of marbles, it is highly likely that the sample will look much more like the true population than a single draw.
Our BE&BT college graduate suggests that all of the absentee replacement workers be put into a common pool and "flowed" to where they are needed. His back-of-envelop calculations suggest that one-third of the absentee replacement workers can be laid off.
In effect, the BE&BT created a first order coupling with the intention of making the system more robust. Then, exploiting the increased robustness, to reduce costs.
What could go wrong?
Assume for a moment that the factory is composed of three shops. Let's call them Body, Paint and General Assembly. Also assume that the relative populations of the shops are 2-1-5.
Body Shop is heavy work, not air conditioned and requires heavy person-protective-equipment.
Paint Shop is light work, is air conditioned and requires certification.
General Assembly is busy-but-light work, is not air conditioned and has minimal personal-protective-equipment.
What could go wrong?
On the first really hot day of summer absenteeism hits 25%.
Everybody who has scheduled vacation takes their vacation.
Everybody who was feeling sickly calls in sick.
All of the absentee replacements know that they will be sent to the seventh level of hell, the Body Shop. They all call in sick.
All of the "special assignment" workers...the ones working on future products and running the library and the public tours...they call in sick. They are not stupid. They are not work hardened. Working on the line will make them hobble for a week. They are out of practice and they will be held accountable for any errors they make
- Initial estimates of process robustness assume that processes are truly independent.
- The only way to keep processes truly independent is to employ superhuman effort, like double-blind-placebo experiments, to maintain that independence.
- Increased longevity invariably results in increased coupling between formerly independent processes and the invisible erosion of system "robustness".
- Processes put into place to track the erosion of "robustness" rarely work.
- Processes put into place to forestall the erosion of "robustness" perversely nourish collusion between the "sheep dogs" put in place to catch that erosion and make the system more brittle.