MAINTENANCE INTELLIGENCE REPORT · FAILURE MODE ANALYSIS · ISSUE 07
When Predictive Maintenance Predicts Nothing Useful
The sensors are installed. The algorithms are running. The dashboards are live. Failures keep happening. This is the honest post-mortem that the vendor presentations never show you.
Photo: Unsplash — Industrial monitoring equipment
Let me describe a situation that more steel plant and crane maintenance professionals have experienced than would care to admit publicly. A predictive maintenance program is installed — sensors, edge compute, cloud analytics, a dashboard with health scores and remaining useful life estimates. Management is pleased. The vendor is satisfied. The maintenance team is cautiously optimistic.
Then, six months later, a hoist motor fails without warning. The PdM system had flagged nothing. Three weeks after that, a gearbox that the system identified as "high risk" is opened for inspection and found to be perfectly healthy. The maintenance team spends two shifts disassembling and reassembling an asset that didn't need intervention. Nobody says it out loud in the post-project review, but everyone in the room is thinking the same thing: we are not getting what we were promised.
This piece is a systematic post-mortem of why PdM programs underperform — not from a vendor's perspective, and not from a researcher's theoretical framework, but from the plant floor. Every failure mode described here has a real operational basis. None of them are solved by buying better sensors.
Failure Mode · Sensor & Coverage
You're monitoring what's convenient, not what fails
The most fundamental problem in most PdM deployments is the gap between what is monitored and what actually causes unplanned downtime. Sensor placement decisions are typically made on the basis of three factors: what is technically easy to instrument, what the vendor's standard package includes, and what seems important at initial scoping. The result is a sensor coverage map that reflects convenience more than failure probability.
In an overhead crane environment, a typical sensor package might cover hoist gearbox vibration, motor temperatures, drive current, and load cell readings. What's typically not monitored: runway rail surface condition and joint gaps, rope equaliser pulley bearings, crane bridge structural fatigue at welded joints, pendant cable insulation condition along its drag path, conductor bar section connections on long runways, and the mechanical condition of the brake assembly beyond the electromagnetic engagement signal.
The practical consequence: the failures that weren't predicted weren't predicted because they weren't being looked at. The PdM system performs well on the components it monitors. It offers zero predictive value on the much larger number of components it doesn't. And because the monitored components — motors, main gearboxes — are often the most robust, the highest-frequency failure modes may live entirely outside the sensor coverage.
Why this happens: Sensor coverage decisions are made at procurement time, without systematic FMEA to map failure modes against monitoring requirements. The most dangerous failure modes and the most monitored components are rarely the same list.
What to do: Before expanding sensor coverage, complete a Failure Mode and Effects Analysis (FMEA) for the specific equipment. Rank failure modes by criticality and frequency. Map existing sensor coverage against those ranked failure modes. The gap between the two lists is where your next unplanned failure will come from.
Failure Mode · Data Quality
The algorithm is only as good as the data feeding it
Every PdM algorithm — whether it's a simple threshold alarm, a statistical anomaly detector, or a machine learning model — is trained on and evaluated against the data its sensors produce. If that data is unreliable, the algorithm's outputs will be unreliable in proportion. This is the garbage-in-garbage-out principle applied to industrial analytics, and it is dramatically underappreciated in most PdM implementations.
Data quality problems in steel plant crane sensor installations are both common and diverse. Sensor mounting degradation — vibration sensors whose mounting bolts have loosened, changing the coupling and therefore the frequency response. Thermal sensor contact corrosion — accelerated in steel plant atmospheres, causing progressive measurement drift. Cable shielding failures — causing electromagnetic interference from induction furnaces and VFDs to appear as spurious high-frequency vibration. Wireless sensor battery depletion — producing intermittent data that fills with default values or zeroes during the gap periods.
The dangerous outcome of data quality problems is not just missed predictions — it is false alarms. When noisy or drifting sensor data triggers anomaly detection algorithms, the resulting false positives erode team confidence in the system. An alert that is investigated and found to reflect sensor noise rather than a real fault condition is an alert that will be deprioritised the next time it fires. This is the opening chapter of alert fatigue.
Why this happens: Sensor commissioning is treated as a one-time activity. Data quality review is not part of the ongoing maintenance program. No one is systematically checking whether the sensors are still producing reliable data six months after installation.
What to do: Include sensor health checks in the standard periodic maintenance schedule — not just reading from sensors, but verifying mounting condition, cable integrity, calibration status, and signal quality indicators. Trend the trend data itself: a bearing vibration signal that has been suspiciously stable for three months despite varying load conditions is probably not measuring a stable bearing — it's measuring a stuck or saturated sensor.
Failure Mode · Model Training
The model was trained somewhere else on something else
Most commercially available PdM software packages come with pre-trained models — machine learning or statistical models trained on vibration, temperature, or current data from some reference dataset, often from a different industry, different equipment type, and different operating environment. These models may perform well on the reference dataset. Their performance on your specific crane, in your specific steel plant bay, under your specific load cycle and ambient conditions, is an empirical question that is rarely adequately tested before the system goes live.
The specificity problem is severe in crane environments. The vibration signature of a healthy 50-tonne hoist gearbox on a ladle crane operating in a 1,200°C ambient bay environment is different from the vibration signature of an identical gearbox operating in a workshop bay at 30°C. The thermal expansion characteristics are different. The lubrication film behaviour is different. The load cycle is different. A model trained on one cannot be reliably transferred to the other without significant site-specific calibration.
The training data volume problem compounds this. Machine learning models for equipment health monitoring require substantial fault data — data from equipment that has actually progressed through various failure stages — to learn the signatures of developing faults. Most operators don't have this data in digital form. The vendor's model fills this gap with general reference data, but general reference data is by definition not specific to your equipment history, your failure modes, or your operating conditions.
Why this happens: Model validation against local conditions is time-consuming and delays commercial deployment. Vendors present their reference dataset performance as a reasonable proxy for local performance. It often isn't — and the difference isn't visible until the first missed prediction.
What to do: Demand site-specific model validation before sign-off. Run the system in shadow mode for a minimum period — receiving alerts but not acting on them — while experienced technicians continue physical inspections. Compare what the system predicts against what physical inspection finds. This produces a local validation dataset and recalibration opportunity before the system drives real maintenance decisions.
The Alert Fatigue Spiral — Illustrated
One of the most damaging dynamics in a failing PdM program is the progressive erosion of team response to system notifications. It develops gradually, it's almost always preventable, and by the time management notices it, the damage to program credibility is substantial.
Illustrative Alert Response Pattern — Month 1 to Month 6 (Composite Example)
Alerts Ignored (amber) vs Alerts Acted Upon (violet) — as false alarm rate increases, acted rate collapses
Figures are illustrative. Pattern reflects published research on alert fatigue in industrial monitoring environments.
Failure Mode · Human Factors
Alert fatigue makes the system invisible
Alert fatigue is the terminal stage of a PdM program that has produced too many false positives. It develops predictably: early in the program, every alert is investigated conscientiously. As false alarm rates accumulate — driven by data quality problems, model miscalibration, or threshold settings that weren't tuned to local conditions — the effort of investigating alerts that repeatedly reveal nothing eventually exceeds the perceived value of investigation. Teams begin triaging informally: known noisy sensors get mentally filtered out. Amber alerts get deprioritised. Eventually, even red alerts are met with "let's see if it goes away" rather than immediate response.
The danger arrives when a real fault generates an alert that looks identical to the dozens of previous false positives. The team's informed prior — this sensor tends to false-alarm — is actually correct based on their experience. But this time, the alert is real. And the response pattern that has been conditioned by months of false positives does not differentiate.
Alert fatigue is not a technology problem. It cannot be fixed by better sensors or smarter algorithms if the root cause is that the threshold settings were wrong or the model wasn't calibrated for local conditions. It is a human factors problem that develops as a consequence of an earlier technical problem — and it must be addressed at both levels.
Why this happens: Alert thresholds are set conservatively at deployment (to avoid missing faults) without a structured process for tuning them based on observed false-alarm rates. Nobody owns the ongoing responsibility of reducing false positives as the program matures.
What to do: Track alert quality metrics — not just alert volume, but what percentage of alerts lead to actual findings when investigated. If the finding rate falls below a defined threshold (many organizations target 40–60% as a sustainable minimum), the program requires immediate threshold recalibration. Assign a specific named person the responsibility of alert quality management, with authority to suppress known noisy channels and retune thresholds.
PdM Programs That Work vs Programs That Don't
The difference between PdM programs that deliver measurable value and those that deliver dashboard activity without operational impact is usually not the technology. It is almost always the implementation maturity and the human systems surrounding the technology.
- Sensor placement driven by vendor standard package
- No site-specific FMEA before deployment
- Pre-trained models, no local validation period
- Alert thresholds set once and never reviewed
- No defined ownership for alert response
- Physical inspections discontinued after "going digital"
- Sensor health checked only when system flags an issue
- Technicians not involved in system tuning
- Success measured by alerts generated, not faults found
- No integration between PdM alerts and CMMS workflow
- Sensor map derived from ranked FMEA failure modes
- FMEA completed before sensor specification
- Shadow mode validation period with physical inspection comparison
- Quarterly threshold review based on finding rate data
- Named role: alert quality owner with authority to tune
- Physical inspections continue as parallel intelligence layer
- Sensor health included in periodic maintenance schedule
- Experienced technicians involved in model recalibration
- Success measured by fault find rate and unplanned downtime reduction
- PdM alerts automatically generate CMMS work orders
Failure Mode · Operating Context
The system doesn't know what the crane is doing
Vibration monitoring algorithms interpret sensor readings against a model of "normal." The definition of normal for a crane gearbox is inherently load-dependent: the vibration signature at 10% of rated load is different from the signature at 80% of rated load. If the monitoring system doesn't know what load the crane was handling when a given reading was taken, it cannot correctly interpret whether a high vibration reading represents a fault or simply a high-load operating condition.
This context deprivation problem affects multiple sensor types. Motor temperature readings mean different things at different ambient temperatures and duty cycles. Current monitoring alerts mean different things during acceleration phases versus steady-state running. Rope tension readings vary with thermal expansion of the rope under different ambient conditions in the bay. Without contextual metadata — load, speed, ambient temperature, operating mode — the algorithm is pattern-matching against an underdetermined model.
In steel plant environments, the operating context variability is particularly pronounced. A ladle crane in the primary steelmaking bay operates in ambient temperatures ranging from under 30°C during cold weather to over 55°C near furnace mouths during a heat. The baseline vibration signature of the same healthy gearbox differs measurably between these conditions — and a system that doesn't account for this will generate thermally-correlated false positives every hot season.
Why this happens: Sensor systems are often deployed as standalone packages without integration into the crane's PLC or process data historian. The load cell data and the vibration sensor data exist in different systems that never talk to each other.
What to do: Insist on contextual data integration as part of the PdM implementation. The vibration monitoring system should receive load, speed, ambient temperature, and operating mode data from the crane's PLC or SCADA system in real time. This enables load-normalised analysis — comparing vibration readings at equivalent load conditions rather than absolute readings across varying operating states.

Failure Mode · Scope & Expectations
PdM was sold as a replacement for maintenance expertise
Perhaps the most damaging failure mode of all is a cultural one that originates in the sales process and the management expectations it creates. PdM technology is sometimes positioned — by vendors and by enthusiastic internal advocates — as a means of reducing dependence on experienced maintenance personnel. The implicit promise: automate the detection work, reduce your skilled headcount requirement, and achieve equivalent or better reliability at lower cost.
This promise is almost universally wrong in practice, and where it has been acted upon — where experienced maintenance teams have been reduced because "the system will tell us when something's wrong" — the outcomes have been poor. The system tells you when its sensors detect an anomaly that its algorithm classifies above a threshold. It does not tell you about the informal knowledge, the operator-observed behavioural changes, the accumulated contextual understanding of specific machine quirks that experienced technicians carry. When that knowledge base is reduced, the effective sensor coverage of the total machine population falls — even if the instrument count stays the same.
Why this happens: Procurement decisions for PdM are sometimes made by finance and IT departments rather than maintenance leadership. The ROI model shows savings from reduced headcount. The value of tacit maintenance knowledge does not appear in a spreadsheet.
What to do: Reframe the business case for PdM as extending the capability of experienced maintenance personnel — not replacing them. The target state is a maintenance team that knows more, earlier, about more failure modes. Experienced technicians should be primary users and co-developers of PdM insights, not passengers being replaced by the system.
A Structured Path to PdM Programs That Actually Work
None of the failure modes described above are fatal if caught and addressed before they become entrenched. Here is the sequence that turns a struggling PdM implementation into one that delivers real predictive value — tested against operational reality, not vendor benchmarks.
FMEA First — Before Any Sensor Discussion
Complete a Failure Mode and Effects Analysis for every asset in scope. Rank failure modes by criticality (severity × occurrence × detectability — the standard RPN approach from ISO 31010 / AIAG FMEA methodology). Use this ranking to define monitoring requirements. Only then specify sensors. This sequence reverses the typical vendor-led approach and grounds coverage decisions in failure reality rather than product catalogues.
Shadow Mode Validation — Minimum 90 Days
Run the PdM system in parallel with existing inspection practices for at least 90 days before allowing it to drive maintenance decisions. During this period, log every PdM alert alongside the outcome of physical investigation. This produces your local precision (what percentage of alerts are real findings?) and recall (what percentage of actual faults did the system flag?). Do not proceed without this data — it is your baseline for all subsequent tuning.
Alert Quality Management — Assign Ownership
Name a specific person as Alert Quality Owner. Their responsibility: track the finding rate of every alert channel monthly, escalate channels below the target finding rate for threshold review, and suppress or quarantine known noisy channels pending recalibration. This role is not an analyst role — it requires enough maintenance domain knowledge to distinguish a true fault signature from instrumentation noise in discussion with the technical team.
Integration — PdM Feeds CMMS, Not a Separate Dashboard
PdM alerts that require maintenance action should automatically generate work orders in the CMMS — not sit in a separate vendor dashboard that the maintenance planner never checks. This integration is the single most impactful workflow change available. It ensures alerts are routed to the people who can act on them, tracked through completion, and outcomes recorded in a way that enables later analysis of whether the alert was justified.
Sensor Maintenance — Included in Periodic Schedule
Treat sensor infrastructure as maintained equipment, not set-and-forget instrumentation. Vibration sensor mounting check: quarterly minimum. Thermocouple contact condition: bi-annual. Calibration verification for load cells and current transducers: annual. Signal quality review (noise floor, dropout rate, out-of-range readings): monthly from the analytics platform. The sensors are the foundation — if they degrade silently, everything downstream degrades with them.
Failure Mode · Workflow Integration
The alert exists in one system. The work order is in another. Nobody connects them.
This failure mode deserves its own entry because it is so common and so damaging, even in programs that have resolved most of the technical problems. The PdM platform generates an alert. The alert is visible on the vendor's dashboard. The maintenance planner creates a work order in the CMMS — a different system. The work order is completed and closed. The outcome of the investigation — what was found, what action was taken, whether the alert was justified — is never fed back into the PdM system. The loop never closes.
Without closed-loop feedback, PdM programs cannot learn from their own outcomes. The algorithm that generated an alert does not receive information about whether the alert was correct. The threshold that was set at commissioning is never updated based on the actual relationship between alert level and real fault severity. The precision and recall of the system are not tracked, because the outcome data required to calculate them never leaves the CMMS to be integrated with the alert history.
This is purely an implementation problem, not a technology problem. The integration between PdM alert management and CMMS work order management is technically straightforward. It is simply not configured, because it wasn't specified in the project scope, or the two systems are from different vendors who don't communicate, or nobody identified it as a priority during implementation.
Why this happens: PdM implementations are often scoped as technology projects — sensors, analytics, dashboard. The workflow integration between alert and action is a process design project that typically falls between IT, operations, and maintenance responsibilities without clear ownership.
What to do: Make CMMS integration a mandatory deliverable in the PdM project scope. Define the specific fields that must be captured in the work order to enable feedback: alert ID that triggered the work order, finding at inspection (fault found / no fault found / sensor issue), severity of finding, action taken. This data is the raw material for ongoing program improvement — and it takes fifteen minutes to configure if specified upfront.
The Prescription — Making PdM Earn Its Budget
Rx — Structured Improvement Plan
Audit your current sensor coverage against your FMEA
Map every active sensor against your equipment's known failure modes. The gap is your blind spot. Prioritise filling gaps by failure mode criticality, not by ease of installation.
Calculate your system's actual finding rate
Pull your last 12 months of PdM alerts. For each alert, what did physical investigation find? If you can't answer this because you don't have the outcome data, that itself is the first problem to solve.
Review sensor health, not just sensor readings
Schedule a physical check of every sensor installation. Mounting torque, cable condition, contact integrity, calibration status. You may find that a significant proportion of your sensor network is not producing reliable data.
Restore physical inspections if they were discontinued
PdM and physical inspection are complementary. If your program replaced physical inspections rather than supplementing them, restore the inspection regime. The combined intelligence is always more reliable than either alone.
Connect your PdM alerts to your CMMS workflow
Ensure every PdM alert that requires investigation generates a CMMS work order automatically, and that work order outcome is captured and linked back to the originating alert.
Involve your experienced technicians in model tuning
The people who know the machines are your most important calibration resource. Their judgment about which alerts are meaningful and which aren't is data that should inform threshold adjustments. This is not a software problem — it is a knowledge integration problem.
The Honest Summary
Predictive maintenance technology works. Under the right conditions — appropriate sensor coverage for the actual failure modes, reliable data quality, site-specific model calibration, integrated alert management, preserved physical inspection capability, and experienced human interpretation — it delivers measurable value: earlier fault detection, fewer unplanned outages, better resource allocation for maintenance activities.
Under typical implementation conditions — vendor standard sensor package, pre-trained models applied without local validation, alert thresholds set once and never reviewed, CMMS integration deferred to a future phase, physical inspections gradually discontinued as the "digital system" is trusted more — it delivers something that looks like predictive maintenance from the management dashboard and performs like reactive maintenance on the plant floor.
The gap between these two outcomes is entirely determined by implementation quality and ongoing program management discipline. Neither the sensors nor the algorithms are the limiting factor. The limiting factor is almost always the organisational commitment to do the unglamorous work: the FMEA that precedes sensor placement, the shadow mode validation that takes three months before anything changes, the monthly alert quality review that nobody wants to own, the sensor maintenance that gets deprioritised because the sensors "seem fine."
If your PdM program is not predicting anything useful, the technology is almost certainly not the problem. Work backwards from the failure modes described here. You will find the root cause in one of them — usually in several.
Predictive maintenance doesn't fail because the physics of condition monitoring is wrong. It fails because implementing it correctly is harder than the vendor presentations suggest — and the failure modes are almost all human and organisational, not technical.
Observation from post-mortem analysis of PdM implementations in industrial facilitiesSources & References
- ISO 13381-1:2015. Condition Monitoring and Diagnostics of Machines — Prognostics — Part 1: General Guidelines. ISO. [PdM methodology and limitations]
- ISO 13379-1:2012. Condition Monitoring and Diagnostics of Machines — Data Interpretation and Diagnostics Techniques. ISO.
- ISO 13373-1:2002. Condition Monitoring and Diagnostics of Machines — Vibration Condition Monitoring. ISO.
- AIAG & VDA. (2019). Failure Mode and Effects Analysis (FMEA) — FMEA Handbook. 1st ed. Automotive Industry Action Group. [FMEA RPN methodology]
- ISO 10816-3:2009. Mechanical Vibration — Evaluation of Machine Vibration by Measurements on Non-Rotating Parts. ISO. [Vibration severity zone limitations]
- Jardine, A.K.S., Lin, D. & Banjevic, D. (2006). "A Review on Machinery Diagnostics and Prognostics Implementing Condition-Based Maintenance." Mechanical Systems and Signal Processing, 20(7), 1483–1510.
- Hashemian, H.M. (2010). "State-of-the-Art Predictive Maintenance Techniques." IEEE Transactions on Instrumentation and Measurement, 60(1), 226–236.
- Deloitte Insights. (2017). Predictive Maintenance and the Smart Factory. Deloitte. deloitte.com
- Gartner. (2022). Market Guide for Asset Performance Management Solutions. Gartner Research. [PdM program maturity and failure patterns]
- Bureau of Indian Standards. IS 3177:1999 — Code of Practice for Electric Overhead Travelling Cranes. BIS, New Delhi.
- World Steel Association. (2023). Digitalisation in Steel: Technology Adoption and Operational Outcomes. worldsteel.org
- Lee, J., Bagheri, B. & Kao, H.A. (2015). "A Cyber-Physical Systems Architecture for Industry 4.0-Based Manufacturing Systems." Manufacturing Letters, 3, 18–23. [IIoT architecture and condition monitoring limitations]
No comments:
Post a Comment