Most Breakdowns Are
Management Failures,
Not Machine Failures
The evidence is clear. The excuses are comfortable. The cost of avoiding this conversation is paid in downtime, incidents, and wasted careers.
Every time a crane fails unexpectedly, the investigation finds something wrong with the equipment. Rarely does it ask what was wrong with the decisions made — and not made — in the months before.
Photo: Unsplash — Industrial operations
Let me make the argument plainly, and then spend the rest of this piece defending it with evidence: the majority of unplanned equipment breakdowns in heavy industrial facilities — steel plants, mines, process plants, power stations — are not primarily caused by machines malfunctioning. They are caused by management systems that failed to prevent machines from reaching the state where malfunction was inevitable.
This is not a comfortable position. It implicates decisions made by real people in positions of responsibility. It challenges the narrative comfort of "the machine just failed" — a framing that distributes blame across an inanimate object and closes the conversation without threatening anyone's career. But it is the position that the evidence supports, and it is the position that, when taken seriously, actually produces improvement rather than repetition.
When a bearing fails, we fix the bearing.
When a bearing fails because the lubrication interval was halved to meet a production target, we need to fix something else.
Plant maintenance perspective — the distinction that most post-breakdown investigations never reach
The pattern in the table above is not an accident. It reflects a consistent finding across multiple decades of reliability engineering and incident investigation research: most breakdowns have identifiable precursors, and those precursors were either not detected (because monitoring systems were inadequate), detected but not acted on (because the action cost money or stopped production), or acted on too slowly (because the priority framework put production metrics above maintenance risk).
The Maintenance Budget Is a Management Decision
Every piece of equipment in an industrial facility has an engineered maintenance requirement — specified by the manufacturer and refined by operating experience. Lubrication intervals. Inspection frequencies. Component replacement lives. These are not suggestions. They are the maintenance programme that keeps the machine operating within its designed failure envelope.
The decision about whether to fund and execute that maintenance programme is made by management. When the budget is cut, maintenance intervals are stretched. When production pressure is high, planned outages are deferred. When headcount is reduced, PM tasks are abbreviated or skipped. These decisions are made by people with authority over resources and schedules. They are management decisions. And when the deferred maintenance eventually produces a failure, tracing the breakdown to its actual origin leads directly back to those decisions — not to the machine.
The steel industry operates in commodity price cycles that create enormous pressure on cost management during downmarket periods. Maintenance budgets are among the most visible and most heavily scrutinised cost lines — and they compress accordingly. The problem is that maintenance cost compression in Year 1 is paid for with compound interest in Years 2 and 3, in the form of accelerated asset degradation, increased breakdown frequency, and the much higher costs of emergency repair versus planned maintenance. This is one of the best-documented cost patterns in reliability engineering, and it continues to be repeated because the people making the Year 1 cost decision are rarely the people who will manage the Year 3 consequences.
Production Pressure Overrides Safety Signals
One of the most consistent findings in post-incident investigations across manufacturing industries is the presence of warning signals — mechanical or operational — that preceded the failure and were known to at least some of the people in the system. The question is not why these signals weren't detected. Often, they were. The question is why they didn't produce a response that prevented the failure.
The answer almost always involves production pressure. The operator mentions the unusual vibration to the supervisor at the start of the shift. The supervisor acknowledges it and says they'll monitor it — because stopping the crane now means calling the shift manager and explaining why the bay is down. The shift manager would have to call production planning. Production planning has a heat scheduled in three hours. The vibration feels manageable. Nobody makes a decision to accept the risk — the decision is made by not making a decision, by allowing the crane to keep running while the developing fault deepens.
Diane Vaughan's analysis of the Challenger Space Shuttle disaster identified this pattern with remarkable precision — she called it the normalisation of deviance. The rules say one thing. The pressures of the system push in a different direction. Over time, the deviation from the rule becomes the norm. And the norm persists until the consequence arrives. This dynamic is not unique to NASA. It operates in every industrial facility where production metrics and maintenance risk assessments are evaluated by the same person under competing pressures.
What the Research Consistently Shows
James Reason's Swiss Cheese Model (1990) — major incidents rarely have a single cause. Multiple latent failures in the management system align to allow an active failure to produce a consequence. The "cheese" is the management system. The holes are management decisions.
Heinrich's Triangle (revised understanding) — the original 1:29:300 ratio of major to minor incidents pointed toward high-frequency low-consequence events as precursors. The management implication: eliminating near-misses and minor incidents requires addressing the organisational conditions that generate them, not just counting them.
Reliability engineering literature broadly supports the finding that planned maintenance deferral is the single largest contributor to unplanned equipment failure in heavy industry — typically accounting for a large share of all unplanned stoppages in facilities with inadequate PM compliance.
World Steel Association Safety Reports consistently identify management system failures — inadequate hazard identification, poor permit-to-work systems, inadequate supervision — as contributing factors in the majority of serious incidents reviewed across member organisations.
The Machine vs. Management Breakdown
Framing breakdowns as machine failures versus management failures is not an exercise in blame. It is an exercise in problem definition. The solutions available to you look completely different depending on which category you place the root cause in.
Preventable through management action
- PM execution rate below schedule
- Warning signals not escalated
- Budget cut to maintenance programme
- Spare parts unavailable at critical time
- Skilled technician unavailable / untrained
- Bypass or workaround not resolved
- Work order backlog allowed to grow
- No accountability for PM compliance
Random / early life / undetectable
- Material defect in new component
- Manufacturing flaw in bearing or gear
- Early life infant mortality failure
- Random failure at end of statistical life
- Design flaw not identified in commissioning
- Wearout beyond all reasonable prediction
The proportions in the visual above are illustrative and derive from patterns in reliability engineering literature rather than a single definitive study. They will vary by facility, by maintenance programme maturity, and by asset class. But the fundamental asymmetry — most failures attributable to management-origin causes, minority attributable to random machine failure — is consistent across the research base. The implication is significant: in a facility where maintenance management is strong, the majority of breakdowns are preventable. They are not prevented because management systems are inadequate, not because machines are unpredictable.
The Six Management Failure Modes Behind Most Breakdowns
If most breakdowns have management origins, it helps to be specific about which management conditions most commonly appear in the causal chain. The following six are the recurring findings in post-breakdown investigations in heavy industrial settings.
PM Schedule Executed Incompletely
The most common antecedent. A bearing fails. The PM records show the last lubrication was seven weeks ago. The schedule calls for four weeks. The difference — three weeks of accumulated service without lubrication — was a management decision, whether deliberate or through neglect.
Investigation question: "Was the PM executed on schedule?" If not — this is a management failure. The machine did exactly what an under-maintained machine does.
Warning Signal Known but Not Acted On
The operator noted the change in brake feel on Tuesday. It's now Saturday. The crane has run four shifts since then. There is no work order. No one called maintenance. Nobody made the decision to stop — they made the decision to continue, without formally acknowledging the risk.
The signal was present. The management system — reporting culture, escalation pathway, authority to stop — failed to convert the signal into action.
Spare Parts Unavailability at the Critical Moment
The bearing is gone. The replacement is not in stock. Lead time is ten days. The crane is down for ten days while a production crisis unfolds. The spare parts inventory decision was made months earlier — the reorder point was set too low, or stock was rationalised out of the system to reduce inventory value. That was a management decision.
Spare parts management is maintenance management. An empty shelf when a critical component fails is not bad luck — it is the consequence of a parts strategy that underweighted availability risk.
Work Order Backlog Allowed to Grow Unaddressed
Maintenance notifications accumulate in the CMMS. Some of them are relatively minor — a noisy gearbox, a limit switch that's feeling stiff, a panel heater that's failed. None individually seems urgent enough to stop production for. The backlog grows over weeks. Eventually, the noisy gearbox seizes and the "relatively minor" item becomes a major unplanned stoppage.
Work order backlog growth is a management indicator. When it is rising, the maintenance system is consuming less work than it is generating. Without management intervention, the backlog accumulates risk.
Inadequate Competency for the Maintenance Required
The crane drive has been modified to a newer VFD type. The maintenance team's training covers the original relay-logic system. When the VFD develops a fault, the technician's diagnostic approach is based on the wrong mental model — and an inappropriate intervention makes the fault worse. The training gap was known. It was on the training plan. The training hadn't been delivered yet.
Competency management is maintenance management. An undertrained technician working on equipment they don't fully understand is a management failure, not a technical one.
Breakdown Investigation That Stops at the Component
This is both a management failure and the mechanism by which management failures perpetuate themselves. The investigation concludes "bearing failed due to inadequate lubrication." The bearing is replaced. The lubrication interval remains the same. Three months later, the bearing fails again. The investigation that stops at the component never asks: why was the lubrication interval inadequate? Was it set correctly? Was it executed on schedule? The systemic issue is never addressed.
Root cause analysis that terminates at the component is not root cause analysis. It is component replacement with documentation.
What Management Responsibility Actually Looks Like
Identifying management failures as the primary breakdown drivers is not an exercise in assigning personal blame. It is an exercise in identifying where the leverage for improvement actually sits. If breakdowns are management failures, then improving the maintenance performance of a facility requires changing the management conditions that generate them — not simply working harder on the technical side of maintenance.
Make PM Compliance a Managed KPI at Leadership Level
PM completion rate — the percentage of planned maintenance tasks executed on schedule — should be reviewed at the same frequency and with the same seriousness as production output, quality metrics, and safety performance. When it drops below target, the question is not "why didn't maintenance do their job?" It is "what conditions prevented the work from being done?" — and the answer almost always involves resource allocation, access, or prioritisation decisions that management can address.
Build Escalation Pathways That Work Under Production Pressure
A warning signal that a crane operator cannot escalate without triggering a production crisis is a warning signal that won't be escalated. The management system needs a defined pathway — a structured route from "operator notices something concerning" to "decision made about whether to continue or stop" — that can be navigated quickly, without requiring the operator to be the person who calls a halt. Clear authority, clear communication channels, and explicit support for the person who raises the concern.
Require That Investigations Ask "Why Was Management Unprepared?"
Mandate that post-breakdown investigations include a systemic root cause question that cannot be closed by identifying the failed component. For every equipment failure, the investigation must address: Was the PM executed on schedule? Were there warning signals, and if so, what was done? Was the relevant spare part available? Was the maintenance team trained for this failure mode? If any answer reveals a management gap, that gap becomes an action item — not a footnote.
Protect Maintenance Budgets From In-Year Cuts as Operational Reality
The most damaging pattern in maintenance budgeting is the in-year cut that happens when a production shortfall creates pressure on discretionary spending. Maintenance is cut; production pressure persists; the deferred maintenance accumulates; a breakdown occurs that costs five to ten times the savings from the budget cut. Protecting maintenance budgets requires leadership that understands the long-term cost structure of maintenance deferral — and that can articulate this to financial decision-makers before the cut is made, not after the breakdown investigation.
Create Psychological Safety for Maintenance Concerns
The culture that allows warning signals to be silenced — because raising a concern means being the person who stopped production — is a management-created culture, and it can be management-changed. It changes when the people who raise concerns are visibly supported and when decisions to continue in the face of known concerns are the ones that require formal justification. This is a leadership behaviour change before it is a system change.
Hold Maintenance Planner and Operations Leaders Jointly Accountable for Reliability
When maintenance reliability is solely the maintenance department's accountability, the structural conflict between production scheduling and maintenance access never gets resolved at the right level. Reliability performance — breakdown frequency, PM compliance, backlog trend — should be a joint accountability of the production manager and the maintenance manager. This creates the alignment between access planning and maintenance execution that individual functional accountability cannot produce.
The Conversation We Need to Stop Avoiding
Every post-breakdown debrief that concludes with "the bearing failed" and goes no further is a management system choosing comfort over learning. The bearing did fail. But the bearing failed because it wasn't lubricated on schedule, because the lubrication schedule was stretched when headcount was cut, because the headcount cut was made in Q3 when the carbon price fell and the maintenance budget was the available lever, because the person who could have pushed back on that cut didn't have the data to argue with confidence that the cost of the cut would exceed the saving.
That chain of decisions — from commodity price to bearing condition — is a management story. It is not a machine story. And every industrial leader who prefers the machine story is buying temporary comfort at the cost of the next breakdown, and the one after that.
The machines are not failing us. We are failing the machines. The distinction matters — because only one of those framings leads to improvement.
Sources & References
- Reason, J. (1990). Human Error. Cambridge University Press. [Swiss Cheese Model — latent and active failures in management systems]
- Reason, J. (1997). Managing the Risks of Organizational Accidents. Ashgate Publishing. [Organisational accident model; management system failures]
- Vaughan, D. (1996). The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. University of Chicago Press. [Normalisation of deviance — production pressure overriding safety signals]
- Dekker, S. (2011). Drift into Failure: From Hunting Broken Components to Understanding Complex Systems. Ashgate Publishing.
- Dekker, S. (2014). The Field Guide to Understanding Human Error. 3rd ed. Ashgate Publishing.
- Moubray, J. (1997). Reliability-Centred Maintenance. 2nd ed. Butterworth-Heinemann. [Failure mode distributions — management-preventable vs random failures]
- Smith, A.M. & Hinchcliffe, G.R. (2004). RCM — Gateway to World Class Maintenance. Elsevier Butterworth-Heinemann.
- Hopkins, A. (2008). Failure to Learn: The BP Texas City Refinery Disaster. CCH Australia. [Management system failure as primary cause of major incident]
- Heinrich, H.W. (1931, rev. Petersen & Roos, 1980). Industrial Accident Prevention. McGraw-Hill. [Incident triangle and precursor theory]
- World Steel Association. (2023). Safety and Health Report. worldsteel.org [Management system factors in steel industry serious incidents]
- Bureau of Indian Standards. IS 807:2006 — Design, Erection and Testing of Cranes and Hoists. BIS, New Delhi. [Maintenance obligation context]
- Health and Safety Executive (UK). (2019). Causes of Major Incidents in the Manufacturing Sector: Systematic Review. HSE Research Programme. hse.gov.uk
No comments:
Post a Comment