RCM Framework

Chapter 9 — RCM Framework

From Static Spreadsheet to Living System

Traditional Reliability Centered Maintenance is a static exercise. A team of engineers spends weeks in a conference room filling out worksheets: functions, functional failures, failure modes, consequences, maintenance strategies. The output is a binder that sits on a shelf, consulted occasionally, updated rarely, and eventually forgotten as plant conditions change and equipment is modified.

RAPID AI transforms RCM from a static artifact into a living system. Sensor data continuously updates failure probabilities. Risk Priority Numbers recalculate in real time. Maintenance strategies adjust automatically as conditions change. The RCM workbook becomes a dynamic decision engine rather than a historical record.

The Function-First Approach

Dibyendu De’s RCM method inverts the traditional starting point. Classical RCM starts with components: “This pump has a bearing. What can go wrong with the bearing?” De’s method starts with function: “This pump must deliver 200 cubic meters per hour of cooling water at 35 meters head. How can that function fail?”

This inversion is critical because it ensures that maintenance effort is always directed at preserving function, not just replacing parts. A reliability engineer focused on components might replace a bearing that shows early damage. A reliability engineer focused on function might instead adjust the operating point to reduce the load that is damaging the bearing, preserving the function while extending the component life.

The RCM Hierarchy

The hierarchy flows from function to action:

Function (what the machine must do)
  -> Functional Failure (how can the function fail?)
    -> Failure Mode (what specific mechanism causes the functional failure?)
      -> Failure Cause (what initiates the failure mode?)
        -> Consequence (what happens when it fails?)
          -> Detection Method (can we see it coming?)
            -> Maintenance Strategy (what should we do?)
              -> Maintenance Task (specific action to perform)

Example: Cooling Water Pump P-101

Level	Content
Function	Deliver 200 m3/hr at 35m head
Functional Failure	Reduced flow below 150 m3/hr
Failure Mode	Cavitation (FM-003)
Failure Cause	Insufficient NPSH / suction blockage
Consequence	Operational — process derate (severity 3)
Detection	Online — pressure + vibration + flow (effectiveness 5/5)
Strategy	Condition Based Maintenance
Task	NPSH verification + suction line inspection

Risk Priority Number

Every failure mode is scored using the Risk Priority Number:

RPN = Severity x Probability x Detectability

Each factor is ranked 1-5, producing an RPN range of 1-125.

Factor	1	2	3	4	5
Severity	Negligible	Minor	Moderate	Major	Catastrophic
Probability	Rare (<0.01/yr)	Unlikely (0.01-0.05)	Possible (0.05-0.10)	Likely (0.10-0.20)	Frequent (>0.20)
Detectability	Almost certain	High	Moderate	Low	Almost undetectable

Note that detectability is inversely scored: a rank of 1 means the failure is almost certainly detectable (good), while 5 means it is almost undetectable (bad).

RPN Action Mapping

RPN Range	Risk Level	Response
1-20	Low	Routine monitoring
21-50	Moderate	Predictive maintenance
51-100	High	Proactive intervention
> 100	Critical	Immediate action or redesign

Dynamic RPN

In traditional RCM, the RPN is calculated once and filed. In RAPID AI, the Probability factor is continuously updated from sensor data:

Module B confidence scores adjust the probability based on detected fault evidence
Module B.2 trend severity adjusts the probability based on degradation rate
Module F Weibull P_30 (30-day failure probability) provides a statistically grounded probability estimate

This means an RPN that was 36 (moderate) at commissioning can escalate to 90 (high) when sensor data reveals accelerating degradation, automatically triggering a strategy change from routine monitoring to proactive intervention.

The Six Strategy Tiers

The RCM decision algorithm evaluates detected failure modes through a priority cascade. The first matching condition determines the strategy:

Tier 1 — Immediate Action / Escalation

Condition: consequence_category in [Safety, Environmental] AND severity >= 4

These failures threaten human safety or environmental compliance. Response is escalation to engineering and operations, with possible emergency shutdown. No diagnostic confidence threshold is applied — the consequence alone drives the decision.

Example: Motor insulation failure (FM-014) with severity 5 and safety consequence. Even a moderate confidence detection triggers escalation because the potential consequence (fire, electrical hazard) is unacceptable.

Tier 2 — Condition Based Maintenance

Condition: detectable_online = true AND confidence >= 0.70

The failure progression is visible in sensor trends, and RAPID AI’s diagnostic confidence is high enough to act on. This is the primary operating mode — it covers the 82% of failures that Nowlan and Heap showed are NOT age-related (patterns D, E, and F).

The 0.70 confidence threshold is the RCM activation boundary defined in the confidence scoring standard. Below 0.70, the evidence is too uncertain for maintenance action; the system continues monitoring.

Example: Bearing outer race defect (FM-008) detected by envelope spectrum with confidence 0.90. Strategy: trend vibration and replace bearing in planned window.

Tier 3 — Scheduled Inspection

Condition: detectable_offline = true

The failure cannot be continuously monitored but can be found through periodic inspection — visual checks, thermography routes, alignment audits, lubrication sampling, NDT.

Example: Impeller fatigue crack (FM-002) detectable only by dye penetrant inspection at shutdown. Strategy: annual NDT inspection at planned outage.

Tier 4 — Planned Replacement

Condition: estimated_rul_days <= 30

The remaining useful life from Module D/F is known and short enough that replacement should be scheduled in the next available shutdown window.

Example: Mechanical seal (FM-006) with leak rate trending upward and RUL estimate of 45 days. Strategy: planned replacement at next available 8-hour window.

Tier 5 — Time-Based Replacement

Condition: failure_mode is age_related AND not reliably detectable

These are the minority of failures (Nowlan and Heap patterns A and B, roughly 6% of cases) where life is predictable and monitoring cannot catch degradation.

Example: Gasket leakage (FM-007) from creep/aging. Not reliably detectable online. Strategy: replace gaskets at overhaul (every 48 months).

Tier 6 — Run to Failure

Condition: low consequence AND low repair cost AND spare available

When failure has no safety impact, replacement is cheap, and downtime is acceptable, the most economical strategy is to let it run until it breaks.

Example: Cooling fan for motor enclosure — consequence is minor (motor runs slightly warmer), repair cost is low ($200), spare is in stock. Strategy: replace on failure.

Fallthrough — Engineering Review

If no tier matches, the algorithm falls through to Engineering Review — a signal that the data is conflicting, the failure mode is poorly understood, or a design issue may exist. This may trigger Module G (Contradiction Driven Engineering).

The Lambda-Driven Lubrication Protocol

Lubrication decisions deserve special treatment because lubrication failure is the single most common initiator of bearing damage. RAPID AI uses the lambda ratio — the relationship between minimum oil film thickness and composite surface roughness:

lambda = h_min / sqrt(Ra_1^2 + Ra_2^2)

Lambda	Regime	Action
> 3.0	Full EHL film	Target zone — surfaces fully separated
2.0-3.0	Mixed lubrication	Monitor closely, consider relubrication
< 2.0	Boundary regime	Immediate action — surfaces in contact

Ultrasound Protocol

Ultrasound (UE) provides real-time lambda estimation without oil sampling:

UE Change (dB over baseline)	Action
+8 dB	Trigger inspection
+12 dB	Micro-relubrication (controlled grease addition)
+16 dB	Full corrective action (purge and replace grease)

This protocol connects to FRETTLSM factor L011 (film breakdown) and AFB rules AFB03/AFB04 (lubrication starvation and wrong viscosity).

The Nowlan and Heap Justification

The fundamental statistical justification for RAPID AI’s CBM-first approach comes from the landmark 1978 study by F. Stanley Nowlan and Howard F. Heap for United Airlines. Their research, which became the foundation of modern RCM, revealed six failure patterns:

Pattern	Shape	Prevalence	Description
A	Bathtub curve	4%	Infant mortality, then random, then wear-out
B	Slow aging	2%	Gradually increasing failure rate
C	Slow aging (no infant)	5%	Constant-rate increase from new
D	Random with initial rise	7%	Brief break-in, then constant random rate
E	Purely random	14%	Constant failure rate, no age dependence
F	Infant mortality then random	68%	High early failure rate, then constant

Key finding: 89% of failures (patterns D + E + F) are NOT age-related. Time-based replacement cannot catch random failures. Only condition monitoring — measuring what the machine is actually doing right now — detects the symptoms before functional failure.

This is why Tier 2 (CBM) is the primary strategy in RAPID AI’s RCM framework, and why Tier 5 (time-based replacement) applies to only 6% of failure modes.

The Waddington Effect

The Nowlan and Heap data also explains why time-based replacement can make things worse. Pattern F (68% of failures) shows high infant mortality followed by a constant random rate. Replacing a component introduces a new infant mortality period. If the replacement interval is shorter than necessary, the organization spends more time in the infant mortality zone than in the stable random zone — actually increasing the failure rate through maintenance.

This is the Waddington Effect, named after the World War II operational research that first quantified it. RAPID AI’s condition-based approach avoids this trap by replacing components only when the condition evidence warrants it, not when a calendar says so.

RCM Decision Logic (SAE JA1011)

RAPID AI implements the RCM decision algorithm per SAE JA1011/JA1012 evaluation criteria.

The Seven Questions of RCM

For each function/functional failure/failure mode:

What are the functions? (What does the equipment do?)
What are the functional failures? (How can each function fail?)
What are the failure modes? (What causes each functional failure?)
What are the failure effects? (What happens when the failure occurs?)
What are the failure consequences? (Does it matter?)
What can be done to prevent/predict? (Proactive tasks)
What if no proactive task is applicable? (Default actions)

Consequence Categories (Decision Tree)

Failure Mode Identified
    |
    +-- Hidden failure? (not evident to operating crew)
    |   +-- Safety consequence? -> Redesign mandatory
    |   +-- No safety? -> Scheduled failure-finding task
    |
    +-- Evident failure?
        +-- Safety/environmental consequence?
        |   +-- Task MUST reduce risk to acceptable level
        |       +-- Condition-based task (CBM) -> RAPID AI primary path
        |       +-- Scheduled restoration
        |       +-- Scheduled discard
        |       +-- None effective? -> Redesign mandatory
        |
        +-- Operational consequence? (affects output/quality/service)
        |   +-- Task must be cost-effective (cost < consequences)
        |       +-- CBM preferred -> RAPID AI
        |       +-- Scheduled restoration
        |       +-- Scheduled discard
        |       +-- None effective? -> Accept, redesign, or change procedures
        |
        +-- Non-operational consequence? (only repair cost)
            +-- Task must cost less than repair
                +-- CBM if justified -> RAPID AI
                +-- Scheduled tasks
                +-- None effective? -> Run to failure (acceptable)

RAPID AI’s RCM Mapping

RCM Concept	RAPID AI Module	Implementation
Functional failure	IMS Column: failure_modes	320 cataloged modes
Failure mode	Module B rules	275+ fault detection rules
Failure effect	Module E assessment	Severity scoring
Consequence analysis	Module G CDE	Design-out recommendations
CBM task	Modules A-C	Automated monitoring
Task interval	Module F RUL	Dynamic P-F interval
Task effectiveness	Confidence scoring	0.0-1.0 with propagation

P-F Interval and Inspection Frequency

The P-F interval is the time between first detectable evidence of failure (P) and functional failure (F):

P ---------------------- F
|                        |
| <-- P-F Interval -->   |
|                        |
| Inspection interval    |
| must be <= P-F / 2     |
| (gives 2 chances       |
|  to catch the fault)   |

Typical P-F Intervals by Technology:

Technology	Typical P-F	RAPID AI Module	Recommended Interval
Vibration	1-9 months	A -> B -> B.2 -> B.3	Monthly -> weekly as condition degrades
Oil analysis	2-6 months	External input	Monthly
Thermography	1-3 months	Temperature rules	Monthly
Visual inspection	1 week - 3 months	Manual input	Weekly
Performance monitoring	1-6 months	Process classification	Continuous

RAPID AI dynamically adjusts monitoring frequency based on health state:

Normal: Standard interval (monthly route)
Watch: Double frequency (bi-weekly)
Alert: 4x frequency (weekly)
Alarm: Continuous online monitoring
Critical: Operations notified, shutdown planning

EN 13306 Maintenance Terminology

RAPID AI aligns with EN 13306:2017 maintenance terminology:

EN 13306 Term	RAPID AI Equivalent
Corrective maintenance	Module G ACT actions (after fault)
Preventive maintenance	Module F scheduled tasks
Condition-based maintenance	Modules A-C monitoring pipeline
Predetermined maintenance	Time-based rules in RCM workbook
Predictive maintenance	Module F RUL estimation
Reliability centered maintenance	Complete RCM decision algorithm
Failure mode	IMS failure_modes column (320 modes)
Fault	Module B detected condition
Degraded state	SSI Watch/Alert states
Critical failure	SSI Critical state

Dynamic RCM Workbooks

The RCM framework operates through 10 structured worksheets (CSVs in the implementation layer):

Sheet	Purpose	Key Fields
1. Asset Hierarchy	Plant -> System -> Equipment -> Component	asset_id, type, criticality, location
2. Functional Failures	Function -> Performance standard -> How it fails	function, performance_standard, functional_failure
3. Failure Modes	Mechanism, cause, annual probability, detectability	failure_mode, failure_cause, p_of_f, detectable
4. Consequences	Category, severity, safety/env risk, production loss, cost	consequence_category, severity, est_repair_cost
5. Detection Methods	Sensor type, technique, effectiveness, online/offline	detection_method, effectiveness, online
6. Decision Logic	Strategy selection based on consequence + detectability + confidence	strategy, justification
7. Maintenance Tasks	Specific actions, skills, duration, tools	task_description, skill_required, est_duration_hrs
8. Maintenance Plan	Frequency, trigger, responsible team, window	frequency, trigger_type, responsible_team
9. Spares	Required parts, lead time, minimum stock	spare_part, lead_time, min_stock
10. Review History	Audit trail of RCM decisions and revisions	review_date, reviewer, change_description

In production, these worksheets are populated per-asset from the IMS defaults and then customized with plant-specific data. The sensor data pipeline continuously updates sheets 3 (probability), 5 (detection effectiveness), and 6 (strategy selection), making the workbook a living document rather than a static artifact.

Connecting RCM to the Pipeline

The RCM framework connects to every layer of the RAPID AI pipeline:

Pipeline Module	RCM Connection
Module A (GUARD)	Data quality affects detection effectiveness scores
Module B (SENSE)	Matched rules map to specific failure modes in the RCM workbook
Module B.2 (TREND)	Trend severity updates failure probability
Module C (FUSE)	SSI drives health stage which drives strategy selection
Module D (PROGNOSE)	RUL estimate determines Tier 4 activation
Module E (ACT)	Priority score selects from the maintenance task catalog
Module F (WEIBULL)	30-day failure probability updates RPN in real time
Module G (CDE)	Contradiction detection triggers Redesign strategy

This end-to-end connection means that every sensor reading potentially affects the RCM strategy. A bearing showing accelerating degradation (Module B.2) with declining stability (Module B.3) and high system severity (Module C) will automatically escalate from routine monitoring to urgent intervention — without any human analyst having to manually recalculate the RPN.

Standards Alignment

Standard	Relevance to This Chapter
SAE JA1011/JA1012 — RCM evaluation criteria	The six-tier RCM decision algorithm directly implements SAE JA1011’s seven questions and JA1012’s detailed evaluation criteria, including consequence classification (safety, environmental, operational, hidden) and strategy selection logic.
ISO 14224 — Reliability and maintenance data	The 10-sheet RCM workbook structure (asset hierarchy, functional failures, failure modes, consequences, detection, decision logic, tasks, plan, spares, review) implements ISO 14224’s standardized reliability data collection framework.
ISO 13381-1 — Prognostics	The dynamic RPN computation (real-time probability updates from Modules B, B.2, and F) extends traditional RCM with ISO 13381-1-compliant prognostic information that continuously adjusts maintenance strategy.
ISO 55000/55001 — Asset management	The function-first RCM approach and the maintenance maturity progression align with ISO 55000’s asset management system requirements for risk-based, value-driven maintenance decision-making.
EN 13306 — Maintenance terminology	The six strategy tiers (run-to-failure, time-based, condition-based, predictive, redesign, operational change) use EN 13306-compliant maintenance type definitions.
IEC 61649 — Weibull analysis	The lambda-driven lubrication protocol and Nowlan-Heap failure pattern analysis reference the statistical failure modeling formalized in IEC 61649.

Changelog

Version	Date	Author	Changes
2.1.0	2026-03-17	Rick D	Added standards alignment, living doc metadata, changelog
2.0.0	2026-03-17	Rick D	Enriched with production codebase content
1.0.0	2026-03-17	Rick D	Initial chapter creation

Next: Chapter 10 — Implementation Previous: Chapter 8 — Domain Frameworks