Skip to content

RCM Framework

Traditional Reliability Centered Maintenance is a static exercise. A team of engineers spends weeks in a conference room filling out worksheets: functions, functional failures, failure modes, consequences, maintenance strategies. The output is a binder that sits on a shelf, consulted occasionally, updated rarely, and eventually forgotten as plant conditions change and equipment is modified.

RAPID AI transforms RCM from a static artifact into a living system. Sensor data continuously updates failure probabilities. Risk Priority Numbers recalculate in real time. Maintenance strategies adjust automatically as conditions change. The RCM workbook becomes a dynamic decision engine rather than a historical record.


Dibyendu De’s RCM method inverts the traditional starting point. Classical RCM starts with components: “This pump has a bearing. What can go wrong with the bearing?” De’s method starts with function: “This pump must deliver 200 cubic meters per hour of cooling water at 35 meters head. How can that function fail?”

This inversion is critical because it ensures that maintenance effort is always directed at preserving function, not just replacing parts. A reliability engineer focused on components might replace a bearing that shows early damage. A reliability engineer focused on function might instead adjust the operating point to reduce the load that is damaging the bearing, preserving the function while extending the component life.

The hierarchy flows from function to action:

Function (what the machine must do)
-> Functional Failure (how can the function fail?)
-> Failure Mode (what specific mechanism causes the functional failure?)
-> Failure Cause (what initiates the failure mode?)
-> Consequence (what happens when it fails?)
-> Detection Method (can we see it coming?)
-> Maintenance Strategy (what should we do?)
-> Maintenance Task (specific action to perform)
LevelContent
FunctionDeliver 200 m3/hr at 35m head
Functional FailureReduced flow below 150 m3/hr
Failure ModeCavitation (FM-003)
Failure CauseInsufficient NPSH / suction blockage
ConsequenceOperational — process derate (severity 3)
DetectionOnline — pressure + vibration + flow (effectiveness 5/5)
StrategyCondition Based Maintenance
TaskNPSH verification + suction line inspection

Every failure mode is scored using the Risk Priority Number:

RPN = Severity x Probability x Detectability

Each factor is ranked 1-5, producing an RPN range of 1-125.

Factor12345
SeverityNegligibleMinorModerateMajorCatastrophic
ProbabilityRare (<0.01/yr)Unlikely (0.01-0.05)Possible (0.05-0.10)Likely (0.10-0.20)Frequent (>0.20)
DetectabilityAlmost certainHighModerateLowAlmost undetectable

Note that detectability is inversely scored: a rank of 1 means the failure is almost certainly detectable (good), while 5 means it is almost undetectable (bad).

RPN RangeRisk LevelResponse
1-20LowRoutine monitoring
21-50ModeratePredictive maintenance
51-100HighProactive intervention
> 100CriticalImmediate action or redesign

In traditional RCM, the RPN is calculated once and filed. In RAPID AI, the Probability factor is continuously updated from sensor data:

  • Module B confidence scores adjust the probability based on detected fault evidence
  • Module B.2 trend severity adjusts the probability based on degradation rate
  • Module F Weibull P_30 (30-day failure probability) provides a statistically grounded probability estimate

This means an RPN that was 36 (moderate) at commissioning can escalate to 90 (high) when sensor data reveals accelerating degradation, automatically triggering a strategy change from routine monitoring to proactive intervention.


The RCM decision algorithm evaluates detected failure modes through a priority cascade. The first matching condition determines the strategy:

Condition: consequence_category in [Safety, Environmental] AND severity >= 4

These failures threaten human safety or environmental compliance. Response is escalation to engineering and operations, with possible emergency shutdown. No diagnostic confidence threshold is applied — the consequence alone drives the decision.

Example: Motor insulation failure (FM-014) with severity 5 and safety consequence. Even a moderate confidence detection triggers escalation because the potential consequence (fire, electrical hazard) is unacceptable.

Condition: detectable_online = true AND confidence >= 0.70

The failure progression is visible in sensor trends, and RAPID AI’s diagnostic confidence is high enough to act on. This is the primary operating mode — it covers the 82% of failures that Nowlan and Heap showed are NOT age-related (patterns D, E, and F).

The 0.70 confidence threshold is the RCM activation boundary defined in the confidence scoring standard. Below 0.70, the evidence is too uncertain for maintenance action; the system continues monitoring.

Example: Bearing outer race defect (FM-008) detected by envelope spectrum with confidence 0.90. Strategy: trend vibration and replace bearing in planned window.

Condition: detectable_offline = true

The failure cannot be continuously monitored but can be found through periodic inspection — visual checks, thermography routes, alignment audits, lubrication sampling, NDT.

Example: Impeller fatigue crack (FM-002) detectable only by dye penetrant inspection at shutdown. Strategy: annual NDT inspection at planned outage.

Condition: estimated_rul_days <= 30

The remaining useful life from Module D/F is known and short enough that replacement should be scheduled in the next available shutdown window.

Example: Mechanical seal (FM-006) with leak rate trending upward and RUL estimate of 45 days. Strategy: planned replacement at next available 8-hour window.

Condition: failure_mode is age_related AND not reliably detectable

These are the minority of failures (Nowlan and Heap patterns A and B, roughly 6% of cases) where life is predictable and monitoring cannot catch degradation.

Example: Gasket leakage (FM-007) from creep/aging. Not reliably detectable online. Strategy: replace gaskets at overhaul (every 48 months).

Condition: low consequence AND low repair cost AND spare available

When failure has no safety impact, replacement is cheap, and downtime is acceptable, the most economical strategy is to let it run until it breaks.

Example: Cooling fan for motor enclosure — consequence is minor (motor runs slightly warmer), repair cost is low ($200), spare is in stock. Strategy: replace on failure.

If no tier matches, the algorithm falls through to Engineering Review — a signal that the data is conflicting, the failure mode is poorly understood, or a design issue may exist. This may trigger Module G (Contradiction Driven Engineering).


Lubrication decisions deserve special treatment because lubrication failure is the single most common initiator of bearing damage. RAPID AI uses the lambda ratio — the relationship between minimum oil film thickness and composite surface roughness:

lambda = h_min / sqrt(Ra_1^2 + Ra_2^2)
LambdaRegimeAction
> 3.0Full EHL filmTarget zone — surfaces fully separated
2.0-3.0Mixed lubricationMonitor closely, consider relubrication
< 2.0Boundary regimeImmediate action — surfaces in contact

Ultrasound (UE) provides real-time lambda estimation without oil sampling:

UE Change (dB over baseline)Action
+8 dBTrigger inspection
+12 dBMicro-relubrication (controlled grease addition)
+16 dBFull corrective action (purge and replace grease)

This protocol connects to FRETTLSM factor L011 (film breakdown) and AFB rules AFB03/AFB04 (lubrication starvation and wrong viscosity).


The fundamental statistical justification for RAPID AI’s CBM-first approach comes from the landmark 1978 study by F. Stanley Nowlan and Howard F. Heap for United Airlines. Their research, which became the foundation of modern RCM, revealed six failure patterns:

PatternShapePrevalenceDescription
ABathtub curve4%Infant mortality, then random, then wear-out
BSlow aging2%Gradually increasing failure rate
CSlow aging (no infant)5%Constant-rate increase from new
DRandom with initial rise7%Brief break-in, then constant random rate
EPurely random14%Constant failure rate, no age dependence
FInfant mortality then random68%High early failure rate, then constant

Key finding: 89% of failures (patterns D + E + F) are NOT age-related. Time-based replacement cannot catch random failures. Only condition monitoring — measuring what the machine is actually doing right now — detects the symptoms before functional failure.

This is why Tier 2 (CBM) is the primary strategy in RAPID AI’s RCM framework, and why Tier 5 (time-based replacement) applies to only 6% of failure modes.

The Nowlan and Heap data also explains why time-based replacement can make things worse. Pattern F (68% of failures) shows high infant mortality followed by a constant random rate. Replacing a component introduces a new infant mortality period. If the replacement interval is shorter than necessary, the organization spends more time in the infant mortality zone than in the stable random zone — actually increasing the failure rate through maintenance.

This is the Waddington Effect, named after the World War II operational research that first quantified it. RAPID AI’s condition-based approach avoids this trap by replacing components only when the condition evidence warrants it, not when a calendar says so.


RAPID AI implements the RCM decision algorithm per SAE JA1011/JA1012 evaluation criteria.

For each function/functional failure/failure mode:

  1. What are the functions? (What does the equipment do?)
  2. What are the functional failures? (How can each function fail?)
  3. What are the failure modes? (What causes each functional failure?)
  4. What are the failure effects? (What happens when the failure occurs?)
  5. What are the failure consequences? (Does it matter?)
  6. What can be done to prevent/predict? (Proactive tasks)
  7. What if no proactive task is applicable? (Default actions)
Failure Mode Identified
|
+-- Hidden failure? (not evident to operating crew)
| +-- Safety consequence? -> Redesign mandatory
| +-- No safety? -> Scheduled failure-finding task
|
+-- Evident failure?
+-- Safety/environmental consequence?
| +-- Task MUST reduce risk to acceptable level
| +-- Condition-based task (CBM) -> RAPID AI primary path
| +-- Scheduled restoration
| +-- Scheduled discard
| +-- None effective? -> Redesign mandatory
|
+-- Operational consequence? (affects output/quality/service)
| +-- Task must be cost-effective (cost < consequences)
| +-- CBM preferred -> RAPID AI
| +-- Scheduled restoration
| +-- Scheduled discard
| +-- None effective? -> Accept, redesign, or change procedures
|
+-- Non-operational consequence? (only repair cost)
+-- Task must cost less than repair
+-- CBM if justified -> RAPID AI
+-- Scheduled tasks
+-- None effective? -> Run to failure (acceptable)
RCM ConceptRAPID AI ModuleImplementation
Functional failureIMS Column: failure_modes320 cataloged modes
Failure modeModule B rules275+ fault detection rules
Failure effectModule E assessmentSeverity scoring
Consequence analysisModule G CDEDesign-out recommendations
CBM taskModules A-CAutomated monitoring
Task intervalModule F RULDynamic P-F interval
Task effectivenessConfidence scoring0.0-1.0 with propagation

The P-F interval is the time between first detectable evidence of failure (P) and functional failure (F):

P ---------------------- F
| |
| <-- P-F Interval --> |
| |
| Inspection interval |
| must be <= P-F / 2 |
| (gives 2 chances |
| to catch the fault) |

Typical P-F Intervals by Technology:

TechnologyTypical P-FRAPID AI ModuleRecommended Interval
Vibration1-9 monthsA -> B -> B.2 -> B.3Monthly -> weekly as condition degrades
Oil analysis2-6 monthsExternal inputMonthly
Thermography1-3 monthsTemperature rulesMonthly
Visual inspection1 week - 3 monthsManual inputWeekly
Performance monitoring1-6 monthsProcess classificationContinuous

RAPID AI dynamically adjusts monitoring frequency based on health state:

  • Normal: Standard interval (monthly route)
  • Watch: Double frequency (bi-weekly)
  • Alert: 4x frequency (weekly)
  • Alarm: Continuous online monitoring
  • Critical: Operations notified, shutdown planning

RAPID AI aligns with EN 13306:2017 maintenance terminology:

EN 13306 TermRAPID AI Equivalent
Corrective maintenanceModule G ACT actions (after fault)
Preventive maintenanceModule F scheduled tasks
Condition-based maintenanceModules A-C monitoring pipeline
Predetermined maintenanceTime-based rules in RCM workbook
Predictive maintenanceModule F RUL estimation
Reliability centered maintenanceComplete RCM decision algorithm
Failure modeIMS failure_modes column (320 modes)
FaultModule B detected condition
Degraded stateSSI Watch/Alert states
Critical failureSSI Critical state

The RCM framework operates through 10 structured worksheets (CSVs in the implementation layer):

SheetPurposeKey Fields
1. Asset HierarchyPlant -> System -> Equipment -> Componentasset_id, type, criticality, location
2. Functional FailuresFunction -> Performance standard -> How it failsfunction, performance_standard, functional_failure
3. Failure ModesMechanism, cause, annual probability, detectabilityfailure_mode, failure_cause, p_of_f, detectable
4. ConsequencesCategory, severity, safety/env risk, production loss, costconsequence_category, severity, est_repair_cost
5. Detection MethodsSensor type, technique, effectiveness, online/offlinedetection_method, effectiveness, online
6. Decision LogicStrategy selection based on consequence + detectability + confidencestrategy, justification
7. Maintenance TasksSpecific actions, skills, duration, toolstask_description, skill_required, est_duration_hrs
8. Maintenance PlanFrequency, trigger, responsible team, windowfrequency, trigger_type, responsible_team
9. SparesRequired parts, lead time, minimum stockspare_part, lead_time, min_stock
10. Review HistoryAudit trail of RCM decisions and revisionsreview_date, reviewer, change_description

In production, these worksheets are populated per-asset from the IMS defaults and then customized with plant-specific data. The sensor data pipeline continuously updates sheets 3 (probability), 5 (detection effectiveness), and 6 (strategy selection), making the workbook a living document rather than a static artifact.


The RCM framework connects to every layer of the RAPID AI pipeline:

Pipeline ModuleRCM Connection
Module A (GUARD)Data quality affects detection effectiveness scores
Module B (SENSE)Matched rules map to specific failure modes in the RCM workbook
Module B.2 (TREND)Trend severity updates failure probability
Module C (FUSE)SSI drives health stage which drives strategy selection
Module D (PROGNOSE)RUL estimate determines Tier 4 activation
Module E (ACT)Priority score selects from the maintenance task catalog
Module F (WEIBULL)30-day failure probability updates RPN in real time
Module G (CDE)Contradiction detection triggers Redesign strategy

This end-to-end connection means that every sensor reading potentially affects the RCM strategy. A bearing showing accelerating degradation (Module B.2) with declining stability (Module B.3) and high system severity (Module C) will automatically escalate from routine monitoring to urgent intervention — without any human analyst having to manually recalculate the RPN.


StandardRelevance to This Chapter
SAE JA1011/JA1012 — RCM evaluation criteriaThe six-tier RCM decision algorithm directly implements SAE JA1011’s seven questions and JA1012’s detailed evaluation criteria, including consequence classification (safety, environmental, operational, hidden) and strategy selection logic.
ISO 14224 — Reliability and maintenance dataThe 10-sheet RCM workbook structure (asset hierarchy, functional failures, failure modes, consequences, detection, decision logic, tasks, plan, spares, review) implements ISO 14224’s standardized reliability data collection framework.
ISO 13381-1 — PrognosticsThe dynamic RPN computation (real-time probability updates from Modules B, B.2, and F) extends traditional RCM with ISO 13381-1-compliant prognostic information that continuously adjusts maintenance strategy.
ISO 55000/55001 — Asset managementThe function-first RCM approach and the maintenance maturity progression align with ISO 55000’s asset management system requirements for risk-based, value-driven maintenance decision-making.
EN 13306 — Maintenance terminologyThe six strategy tiers (run-to-failure, time-based, condition-based, predictive, redesign, operational change) use EN 13306-compliant maintenance type definitions.
IEC 61649 — Weibull analysisThe lambda-driven lubrication protocol and Nowlan-Heap failure pattern analysis reference the statistical failure modeling formalized in IEC 61649.
VersionDateAuthorChanges
2.1.02026-03-17Rick DAdded standards alignment, living doc metadata, changelog
2.0.02026-03-17Rick DEnriched with production codebase content
1.0.02026-03-17Rick DInitial chapter creation

Next: Chapter 10 — Implementation Previous: Chapter 8 — Domain Frameworks