RCM Framework
Chapter 9 — RCM Framework
Section titled “Chapter 9 — RCM Framework”From Static Spreadsheet to Living System
Section titled “From Static Spreadsheet to Living System”Traditional Reliability Centered Maintenance is a static exercise. A team of engineers spends weeks in a conference room filling out worksheets: functions, functional failures, failure modes, consequences, maintenance strategies. The output is a binder that sits on a shelf, consulted occasionally, updated rarely, and eventually forgotten as plant conditions change and equipment is modified.
RAPID AI transforms RCM from a static artifact into a living system. Sensor data continuously updates failure probabilities. Risk Priority Numbers recalculate in real time. Maintenance strategies adjust automatically as conditions change. The RCM workbook becomes a dynamic decision engine rather than a historical record.
The Function-First Approach
Section titled “The Function-First Approach”Dibyendu De’s RCM method inverts the traditional starting point. Classical RCM starts with components: “This pump has a bearing. What can go wrong with the bearing?” De’s method starts with function: “This pump must deliver 200 cubic meters per hour of cooling water at 35 meters head. How can that function fail?”
This inversion is critical because it ensures that maintenance effort is always directed at preserving function, not just replacing parts. A reliability engineer focused on components might replace a bearing that shows early damage. A reliability engineer focused on function might instead adjust the operating point to reduce the load that is damaging the bearing, preserving the function while extending the component life.
The RCM Hierarchy
Section titled “The RCM Hierarchy”The hierarchy flows from function to action:
Function (what the machine must do) -> Functional Failure (how can the function fail?) -> Failure Mode (what specific mechanism causes the functional failure?) -> Failure Cause (what initiates the failure mode?) -> Consequence (what happens when it fails?) -> Detection Method (can we see it coming?) -> Maintenance Strategy (what should we do?) -> Maintenance Task (specific action to perform)Example: Cooling Water Pump P-101
Section titled “Example: Cooling Water Pump P-101”| Level | Content |
|---|---|
| Function | Deliver 200 m3/hr at 35m head |
| Functional Failure | Reduced flow below 150 m3/hr |
| Failure Mode | Cavitation (FM-003) |
| Failure Cause | Insufficient NPSH / suction blockage |
| Consequence | Operational — process derate (severity 3) |
| Detection | Online — pressure + vibration + flow (effectiveness 5/5) |
| Strategy | Condition Based Maintenance |
| Task | NPSH verification + suction line inspection |
Risk Priority Number
Section titled “Risk Priority Number”Every failure mode is scored using the Risk Priority Number:
RPN = Severity x Probability x DetectabilityEach factor is ranked 1-5, producing an RPN range of 1-125.
| Factor | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Severity | Negligible | Minor | Moderate | Major | Catastrophic |
| Probability | Rare (<0.01/yr) | Unlikely (0.01-0.05) | Possible (0.05-0.10) | Likely (0.10-0.20) | Frequent (>0.20) |
| Detectability | Almost certain | High | Moderate | Low | Almost undetectable |
Note that detectability is inversely scored: a rank of 1 means the failure is almost certainly detectable (good), while 5 means it is almost undetectable (bad).
RPN Action Mapping
Section titled “RPN Action Mapping”| RPN Range | Risk Level | Response |
|---|---|---|
| 1-20 | Low | Routine monitoring |
| 21-50 | Moderate | Predictive maintenance |
| 51-100 | High | Proactive intervention |
| > 100 | Critical | Immediate action or redesign |
Dynamic RPN
Section titled “Dynamic RPN”In traditional RCM, the RPN is calculated once and filed. In RAPID AI, the Probability factor is continuously updated from sensor data:
- Module B confidence scores adjust the probability based on detected fault evidence
- Module B.2 trend severity adjusts the probability based on degradation rate
- Module F Weibull P_30 (30-day failure probability) provides a statistically grounded probability estimate
This means an RPN that was 36 (moderate) at commissioning can escalate to 90 (high) when sensor data reveals accelerating degradation, automatically triggering a strategy change from routine monitoring to proactive intervention.
The Six Strategy Tiers
Section titled “The Six Strategy Tiers”The RCM decision algorithm evaluates detected failure modes through a priority cascade. The first matching condition determines the strategy:
Tier 1 — Immediate Action / Escalation
Section titled “Tier 1 — Immediate Action / Escalation”Condition: consequence_category in [Safety, Environmental] AND severity >= 4
These failures threaten human safety or environmental compliance. Response is escalation to engineering and operations, with possible emergency shutdown. No diagnostic confidence threshold is applied — the consequence alone drives the decision.
Example: Motor insulation failure (FM-014) with severity 5 and safety consequence. Even a moderate confidence detection triggers escalation because the potential consequence (fire, electrical hazard) is unacceptable.
Tier 2 — Condition Based Maintenance
Section titled “Tier 2 — Condition Based Maintenance”Condition: detectable_online = true AND confidence >= 0.70
The failure progression is visible in sensor trends, and RAPID AI’s diagnostic confidence is high enough to act on. This is the primary operating mode — it covers the 82% of failures that Nowlan and Heap showed are NOT age-related (patterns D, E, and F).
The 0.70 confidence threshold is the RCM activation boundary defined in the confidence scoring standard. Below 0.70, the evidence is too uncertain for maintenance action; the system continues monitoring.
Example: Bearing outer race defect (FM-008) detected by envelope spectrum with confidence 0.90. Strategy: trend vibration and replace bearing in planned window.
Tier 3 — Scheduled Inspection
Section titled “Tier 3 — Scheduled Inspection”Condition: detectable_offline = true
The failure cannot be continuously monitored but can be found through periodic inspection — visual checks, thermography routes, alignment audits, lubrication sampling, NDT.
Example: Impeller fatigue crack (FM-002) detectable only by dye penetrant inspection at shutdown. Strategy: annual NDT inspection at planned outage.
Tier 4 — Planned Replacement
Section titled “Tier 4 — Planned Replacement”Condition: estimated_rul_days <= 30
The remaining useful life from Module D/F is known and short enough that replacement should be scheduled in the next available shutdown window.
Example: Mechanical seal (FM-006) with leak rate trending upward and RUL estimate of 45 days. Strategy: planned replacement at next available 8-hour window.
Tier 5 — Time-Based Replacement
Section titled “Tier 5 — Time-Based Replacement”Condition: failure_mode is age_related AND not reliably detectable
These are the minority of failures (Nowlan and Heap patterns A and B, roughly 6% of cases) where life is predictable and monitoring cannot catch degradation.
Example: Gasket leakage (FM-007) from creep/aging. Not reliably detectable online. Strategy: replace gaskets at overhaul (every 48 months).
Tier 6 — Run to Failure
Section titled “Tier 6 — Run to Failure”Condition: low consequence AND low repair cost AND spare available
When failure has no safety impact, replacement is cheap, and downtime is acceptable, the most economical strategy is to let it run until it breaks.
Example: Cooling fan for motor enclosure — consequence is minor (motor runs slightly warmer), repair cost is low ($200), spare is in stock. Strategy: replace on failure.
Fallthrough — Engineering Review
Section titled “Fallthrough — Engineering Review”If no tier matches, the algorithm falls through to Engineering Review — a signal that the data is conflicting, the failure mode is poorly understood, or a design issue may exist. This may trigger Module G (Contradiction Driven Engineering).
The Lambda-Driven Lubrication Protocol
Section titled “The Lambda-Driven Lubrication Protocol”Lubrication decisions deserve special treatment because lubrication failure is the single most common initiator of bearing damage. RAPID AI uses the lambda ratio — the relationship between minimum oil film thickness and composite surface roughness:
lambda = h_min / sqrt(Ra_1^2 + Ra_2^2)| Lambda | Regime | Action |
|---|---|---|
| > 3.0 | Full EHL film | Target zone — surfaces fully separated |
| 2.0-3.0 | Mixed lubrication | Monitor closely, consider relubrication |
| < 2.0 | Boundary regime | Immediate action — surfaces in contact |
Ultrasound Protocol
Section titled “Ultrasound Protocol”Ultrasound (UE) provides real-time lambda estimation without oil sampling:
| UE Change (dB over baseline) | Action |
|---|---|
| +8 dB | Trigger inspection |
| +12 dB | Micro-relubrication (controlled grease addition) |
| +16 dB | Full corrective action (purge and replace grease) |
This protocol connects to FRETTLSM factor L011 (film breakdown) and AFB rules AFB03/AFB04 (lubrication starvation and wrong viscosity).
The Nowlan and Heap Justification
Section titled “The Nowlan and Heap Justification”The fundamental statistical justification for RAPID AI’s CBM-first approach comes from the landmark 1978 study by F. Stanley Nowlan and Howard F. Heap for United Airlines. Their research, which became the foundation of modern RCM, revealed six failure patterns:
| Pattern | Shape | Prevalence | Description |
|---|---|---|---|
| A | Bathtub curve | 4% | Infant mortality, then random, then wear-out |
| B | Slow aging | 2% | Gradually increasing failure rate |
| C | Slow aging (no infant) | 5% | Constant-rate increase from new |
| D | Random with initial rise | 7% | Brief break-in, then constant random rate |
| E | Purely random | 14% | Constant failure rate, no age dependence |
| F | Infant mortality then random | 68% | High early failure rate, then constant |
Key finding: 89% of failures (patterns D + E + F) are NOT age-related. Time-based replacement cannot catch random failures. Only condition monitoring — measuring what the machine is actually doing right now — detects the symptoms before functional failure.
This is why Tier 2 (CBM) is the primary strategy in RAPID AI’s RCM framework, and why Tier 5 (time-based replacement) applies to only 6% of failure modes.
The Waddington Effect
Section titled “The Waddington Effect”The Nowlan and Heap data also explains why time-based replacement can make things worse. Pattern F (68% of failures) shows high infant mortality followed by a constant random rate. Replacing a component introduces a new infant mortality period. If the replacement interval is shorter than necessary, the organization spends more time in the infant mortality zone than in the stable random zone — actually increasing the failure rate through maintenance.
This is the Waddington Effect, named after the World War II operational research that first quantified it. RAPID AI’s condition-based approach avoids this trap by replacing components only when the condition evidence warrants it, not when a calendar says so.
RCM Decision Logic (SAE JA1011)
Section titled “RCM Decision Logic (SAE JA1011)”RAPID AI implements the RCM decision algorithm per SAE JA1011/JA1012 evaluation criteria.
The Seven Questions of RCM
Section titled “The Seven Questions of RCM”For each function/functional failure/failure mode:
- What are the functions? (What does the equipment do?)
- What are the functional failures? (How can each function fail?)
- What are the failure modes? (What causes each functional failure?)
- What are the failure effects? (What happens when the failure occurs?)
- What are the failure consequences? (Does it matter?)
- What can be done to prevent/predict? (Proactive tasks)
- What if no proactive task is applicable? (Default actions)
Consequence Categories (Decision Tree)
Section titled “Consequence Categories (Decision Tree)”Failure Mode Identified | +-- Hidden failure? (not evident to operating crew) | +-- Safety consequence? -> Redesign mandatory | +-- No safety? -> Scheduled failure-finding task | +-- Evident failure? +-- Safety/environmental consequence? | +-- Task MUST reduce risk to acceptable level | +-- Condition-based task (CBM) -> RAPID AI primary path | +-- Scheduled restoration | +-- Scheduled discard | +-- None effective? -> Redesign mandatory | +-- Operational consequence? (affects output/quality/service) | +-- Task must be cost-effective (cost < consequences) | +-- CBM preferred -> RAPID AI | +-- Scheduled restoration | +-- Scheduled discard | +-- None effective? -> Accept, redesign, or change procedures | +-- Non-operational consequence? (only repair cost) +-- Task must cost less than repair +-- CBM if justified -> RAPID AI +-- Scheduled tasks +-- None effective? -> Run to failure (acceptable)RAPID AI’s RCM Mapping
Section titled “RAPID AI’s RCM Mapping”| RCM Concept | RAPID AI Module | Implementation |
|---|---|---|
| Functional failure | IMS Column: failure_modes | 320 cataloged modes |
| Failure mode | Module B rules | 275+ fault detection rules |
| Failure effect | Module E assessment | Severity scoring |
| Consequence analysis | Module G CDE | Design-out recommendations |
| CBM task | Modules A-C | Automated monitoring |
| Task interval | Module F RUL | Dynamic P-F interval |
| Task effectiveness | Confidence scoring | 0.0-1.0 with propagation |
P-F Interval and Inspection Frequency
Section titled “P-F Interval and Inspection Frequency”The P-F interval is the time between first detectable evidence of failure (P) and functional failure (F):
P ---------------------- F| || <-- P-F Interval --> || || Inspection interval || must be <= P-F / 2 || (gives 2 chances || to catch the fault) |Typical P-F Intervals by Technology:
| Technology | Typical P-F | RAPID AI Module | Recommended Interval |
|---|---|---|---|
| Vibration | 1-9 months | A -> B -> B.2 -> B.3 | Monthly -> weekly as condition degrades |
| Oil analysis | 2-6 months | External input | Monthly |
| Thermography | 1-3 months | Temperature rules | Monthly |
| Visual inspection | 1 week - 3 months | Manual input | Weekly |
| Performance monitoring | 1-6 months | Process classification | Continuous |
RAPID AI dynamically adjusts monitoring frequency based on health state:
- Normal: Standard interval (monthly route)
- Watch: Double frequency (bi-weekly)
- Alert: 4x frequency (weekly)
- Alarm: Continuous online monitoring
- Critical: Operations notified, shutdown planning
EN 13306 Maintenance Terminology
Section titled “EN 13306 Maintenance Terminology”RAPID AI aligns with EN 13306:2017 maintenance terminology:
| EN 13306 Term | RAPID AI Equivalent |
|---|---|
| Corrective maintenance | Module G ACT actions (after fault) |
| Preventive maintenance | Module F scheduled tasks |
| Condition-based maintenance | Modules A-C monitoring pipeline |
| Predetermined maintenance | Time-based rules in RCM workbook |
| Predictive maintenance | Module F RUL estimation |
| Reliability centered maintenance | Complete RCM decision algorithm |
| Failure mode | IMS failure_modes column (320 modes) |
| Fault | Module B detected condition |
| Degraded state | SSI Watch/Alert states |
| Critical failure | SSI Critical state |
Dynamic RCM Workbooks
Section titled “Dynamic RCM Workbooks”The RCM framework operates through 10 structured worksheets (CSVs in the implementation layer):
| Sheet | Purpose | Key Fields |
|---|---|---|
| 1. Asset Hierarchy | Plant -> System -> Equipment -> Component | asset_id, type, criticality, location |
| 2. Functional Failures | Function -> Performance standard -> How it fails | function, performance_standard, functional_failure |
| 3. Failure Modes | Mechanism, cause, annual probability, detectability | failure_mode, failure_cause, p_of_f, detectable |
| 4. Consequences | Category, severity, safety/env risk, production loss, cost | consequence_category, severity, est_repair_cost |
| 5. Detection Methods | Sensor type, technique, effectiveness, online/offline | detection_method, effectiveness, online |
| 6. Decision Logic | Strategy selection based on consequence + detectability + confidence | strategy, justification |
| 7. Maintenance Tasks | Specific actions, skills, duration, tools | task_description, skill_required, est_duration_hrs |
| 8. Maintenance Plan | Frequency, trigger, responsible team, window | frequency, trigger_type, responsible_team |
| 9. Spares | Required parts, lead time, minimum stock | spare_part, lead_time, min_stock |
| 10. Review History | Audit trail of RCM decisions and revisions | review_date, reviewer, change_description |
In production, these worksheets are populated per-asset from the IMS defaults and then customized with plant-specific data. The sensor data pipeline continuously updates sheets 3 (probability), 5 (detection effectiveness), and 6 (strategy selection), making the workbook a living document rather than a static artifact.
Connecting RCM to the Pipeline
Section titled “Connecting RCM to the Pipeline”The RCM framework connects to every layer of the RAPID AI pipeline:
| Pipeline Module | RCM Connection |
|---|---|
| Module A (GUARD) | Data quality affects detection effectiveness scores |
| Module B (SENSE) | Matched rules map to specific failure modes in the RCM workbook |
| Module B.2 (TREND) | Trend severity updates failure probability |
| Module C (FUSE) | SSI drives health stage which drives strategy selection |
| Module D (PROGNOSE) | RUL estimate determines Tier 4 activation |
| Module E (ACT) | Priority score selects from the maintenance task catalog |
| Module F (WEIBULL) | 30-day failure probability updates RPN in real time |
| Module G (CDE) | Contradiction detection triggers Redesign strategy |
This end-to-end connection means that every sensor reading potentially affects the RCM strategy. A bearing showing accelerating degradation (Module B.2) with declining stability (Module B.3) and high system severity (Module C) will automatically escalate from routine monitoring to urgent intervention — without any human analyst having to manually recalculate the RPN.
Standards Alignment
Section titled “Standards Alignment”| Standard | Relevance to This Chapter |
|---|---|
| SAE JA1011/JA1012 — RCM evaluation criteria | The six-tier RCM decision algorithm directly implements SAE JA1011’s seven questions and JA1012’s detailed evaluation criteria, including consequence classification (safety, environmental, operational, hidden) and strategy selection logic. |
| ISO 14224 — Reliability and maintenance data | The 10-sheet RCM workbook structure (asset hierarchy, functional failures, failure modes, consequences, detection, decision logic, tasks, plan, spares, review) implements ISO 14224’s standardized reliability data collection framework. |
| ISO 13381-1 — Prognostics | The dynamic RPN computation (real-time probability updates from Modules B, B.2, and F) extends traditional RCM with ISO 13381-1-compliant prognostic information that continuously adjusts maintenance strategy. |
| ISO 55000/55001 — Asset management | The function-first RCM approach and the maintenance maturity progression align with ISO 55000’s asset management system requirements for risk-based, value-driven maintenance decision-making. |
| EN 13306 — Maintenance terminology | The six strategy tiers (run-to-failure, time-based, condition-based, predictive, redesign, operational change) use EN 13306-compliant maintenance type definitions. |
| IEC 61649 — Weibull analysis | The lambda-driven lubrication protocol and Nowlan-Heap failure pattern analysis reference the statistical failure modeling formalized in IEC 61649. |
Changelog
Section titled “Changelog”| Version | Date | Author | Changes |
|---|---|---|---|
| 2.1.0 | 2026-03-17 | Rick D | Added standards alignment, living doc metadata, changelog |
| 2.0.0 | 2026-03-17 | Rick D | Enriched with production codebase content |
| 1.0.0 | 2026-03-17 | Rick D | Initial chapter creation |
Next: Chapter 10 — Implementation Previous: Chapter 8 — Domain Frameworks