Reliability Mathematics

Chapter 25: Reliability Mathematics

Reliability engineering is applied probability theory. Every maintenance decision — when to inspect, when to replace, when to accept risk — rests on mathematical models of how and when things fail. This chapter presents the complete mathematical foundation that RAPID AI uses to transform raw sensor data and operational history into actionable reliability intelligence.

The equations here are not academic curiosities. They are the production formulas implemented in rul_engine.py, weibull_model_engine.py, and the reliability analysis services. Every variable name maps to a field in the Module F schema or the reliability database tables. Every threshold has a physical justification.

25.1 Probability Distributions for Failure

Three distributions dominate reliability engineering. Each models a different physical failure mechanism. Selecting the wrong one produces dangerously incorrect predictions.

Exponential Distribution — Constant Failure Rate (Memoryless)

The exponential distribution assumes that the probability of failure in the next instant is the same regardless of how long the component has been running. This is the simplest model and the only continuous distribution with the memoryless property.

Probability Density Function:

f(t) = lambda * exp(-lambda * t)

Reliability Function:

R(t) = exp(-lambda * t)

Cumulative Distribution Function (unreliability):

F(t) = 1 - exp(-lambda * t)

Mean Time To Failure:

MTTF = 1 / lambda

When to use:

Random failures caused by external shocks (lightning, operator error, foreign object damage)
Electronic components after burn-in screening
Systems where wear-out is prevented by scheduled replacement
Any scenario where the Weibull shape parameter beta is approximately 1.0

When NOT to use:

Mechanical wear-out (bearings, seals, gears) — these have increasing failure rates
Infant mortality screening — decreasing failure rate requires Weibull with beta < 1

RAPID AI application: The RUL engine (rul_engine.py) uses the exponential hazard model for 30-day failure probability computation. Given an estimated RUL, the hazard rate lambda = 1/RUL is assumed constant over the short prediction window:

P(failure in 30 days) = 1 - exp(-lambda * 30)

This is a deliberate simplification. Over a 30-day window, the constant-rate assumption introduces negligible error compared to the uncertainty in the RUL estimate itself.

Weibull Distribution — The Workhorse of Reliability Engineering

The Weibull distribution is the most versatile life distribution in reliability engineering. With two parameters, it can model decreasing, constant, or increasing failure rates — covering the entire bathtub curve.

Parameters:

beta (shape parameter, dimensionless): governs how the failure rate changes with time
eta (scale parameter, same units as time): the characteristic life

Probability Density Function:

f(t) = (beta / eta) * (t / eta)^(beta - 1) * exp(-(t / eta)^beta)

Reliability Function:

R(t) = exp(-(t / eta)^beta)

Cumulative Distribution Function:

F(t) = 1 - exp(-(t / eta)^beta)

Hazard Rate (instantaneous failure rate):

h(t) = (beta / eta) * (t / eta)^(beta - 1)

Mean Life (MTTF for non-repairable items):

mu = eta * Gamma(1 + 1/beta)

where Gamma is the gamma function.

Bxx Life (time by which xx% of the population has failed):

t_p = eta * (-ln(1 - p))^(1/beta)

The B10 life (10th percentile, also called L10 in bearing standards):

B10 = eta * (-ln(0.9))^(1/beta) = eta * (0.10536)^(1/beta)

Characteristic Life Interpretation:

eta is the 63.2nd percentile — 63.2% of the population has failed by t = eta
eta is NOT the mean life (unless beta = 1)
eta is NOT the “expected life” in the colloquial sense

Shape Parameter Interpretation:

beta Range	Failure Rate	Physical Meaning	Maintenance Implication
beta < 0.5	Steeply decreasing	Severe infant mortality; manufacturing or installation defects	Burn-in, commissioning checks, installation audit
0.5 <= beta < 1.0	Decreasing	Moderate infant mortality; early quality issues	Improve commissioning, PM after break-in
beta = 1.0 (0.95-1.05)	Constant	Exponential distribution; random external causes	CBM or run-to-failure if consequence is low
1.0 < beta < 2.0	Mildly increasing	Early wear-out; friction and fatigue beginning	Age-based PM; start CBM monitoring
beta = 2.0	Linearly increasing (Rayleigh)	Progressive wear with constant wear rate	Time-based replacement at B10 or B20
2.0 < beta < 4.0	Increasing	Definite wear-out; design life visible	Scheduled replacement before B10
beta = 3.44	Near-normal (bell-shaped)	Tight life distribution; predictable failure	Optimal for fixed-interval replacement
beta > 4.0	Steeply increasing	Rapid wear-out; brittle fracture, fatigue	Aggressive CBM; inspect before B5

How RAPID AI uses Weibull: The Reliability Intelligence Layer fits Weibull parameters from failure history using either Median Rank Regression (MRR, for 3-20 failures) or Maximum Likelihood Estimation (MLE, for 5+ failures with censored data). Module F then uses the fitted beta to classify hazard behaviour and select the appropriate prognostic model.

Lognormal Distribution — Fatigue and Repair Time

The lognormal distribution arises when the logarithm of the life data is normally distributed. It models multiplicative degradation processes where many small factors combine.

Probability Density Function:

f(t) = (1 / (t * sigma * sqrt(2*pi))) * exp(-(ln(t) - mu)^2 / (2 * sigma^2))

Parameters:

mu = mean of ln(t) (location parameter in log space)
sigma = standard deviation of ln(t) (scale parameter in log space)

Key statistics:

Median = exp(mu)
Mean   = exp(mu + 0.5 * sigma^2)
Mmax (95th percentile) = exp(mu + 1.6449 * sigma)

When to use:

Fatigue life (crack initiation and propagation)
Repair time modelling (MTTR distributions)
Corrosion rates
Any process governed by multiplicative random shocks

RAPID AI application: Repair times in the maintainability analysis use lognormal fitting. The median repair time, mean repair time, and Mmax (95th percentile) are computed for maintenance workload planning. This is why MTTR alone is insufficient — the Mmax tells you how long the worst-case repair will take.

25.2 The Bathtub Curve

The bathtub curve is the composite failure rate function h(t) for a population of components. It is not a single distribution but the superposition of three competing failure mechanisms, each dominant in a different life phase.

Phase I: Infant Mortality (beta < 1)

The failure rate is decreasing. Failures are caused by:

Manufacturing defects that escaped quality control
Installation errors (misalignment, wrong torque, contamination)
Commissioning faults (incorrect settings, missing lubrication)
The Waddington Effect: maintenance itself introduces defects

Duration: Typically the first 5-10% of design life, or the first few hundred operating hours for rotating machinery.

Maintenance strategy: Burn-in screening, acceptance testing, commissioning procedures. Time-based replacement makes things WORSE in this region because every replacement resets the clock to the high-failure-rate zone.

Phase II: Useful Life (beta = 1)

The failure rate is approximately constant. Failures are:

Random external events (power surges, foreign object damage, operator errors)
Stress exceedances beyond design envelope
Hidden defects that manifest at random times

Duration: The majority of service life for well-designed equipment.

Maintenance strategy: Condition-based monitoring (CBM) is optimal. Time-based replacement provides no benefit because the failure rate is constant — the item is equally likely to fail immediately after replacement as before.

Phase III: Wear-Out (beta > 1)

The failure rate is increasing. Failures are caused by:

Mechanical wear (bearings, seals, impellers)
Fatigue (shafts, blades, structural members)
Corrosion (piping, heat exchanger tubes)
Material degradation (insulation, elastomers, lubricants)

Duration: Onset varies by component. Bearings may enter wear-out after 30,000 hours; corrosion may take decades.

Maintenance strategy: This is the ONLY region where time-based replacement works. Schedule replacement before the B10 life. CBM can extend intervals by monitoring actual condition rather than relying on population statistics.

The Nowlan and Heap Finding

The landmark 1978 study by Nowlan and Heap (commissioned by United Airlines) analysed the failure patterns of aircraft components and found:

4% follow the classic bathtub curve (infant mortality + constant + wear-out)
2% follow a wear-out pattern only (increasing failure rate)
5% follow a slowly increasing pattern
7% follow a constant failure rate (flat)
14% follow an infant mortality pattern (decreasing then constant)
68% follow a decreasing-then-constant pattern

Total: 82% of failures are NOT wear-out. They follow random or infant-mortality patterns. This means time-based replacement is effective for only 18% of failure modes. The remaining 82% require condition-based monitoring or redesign — which is exactly what RAPID AI provides.

25.3 Key Reliability Metrics

MTBF — Mean Time Between Failures

MTBF = Total Operating Time / Number of Failures

For repairable systems only. MTTF (Mean Time To Failure) is used for non-repairable items.

Complete data (all failures observed):

MTBF = operating_hours_total / n_failures

Censored data (some units still running):

Total_time = sum of ALL operating intervals (failed + surviving)
MTBF = Total_time / n_failures    # only actual failures in denominator

Chi-squared confidence interval (assuming exponential distribution):

MTBF_lower = 2*T / chi2(1 - alpha/2, 2*n + 2)
MTBF_upper = 2*T / chi2(alpha/2, 2*n)

where T = total accumulated time, n = number of failures, alpha = significance level.

Common misuse: MTBF is NOT expected life. A pump with MTBF = 50,000 hours does not “last” 50,000 hours. MTBF is the reciprocal of failure rate, valid only when the failure rate is constant (beta = 1). For wear-out failures (beta > 1), the B10 or B50 life from Weibull analysis is the correct planning metric.

MTTR — Mean Time To Repair

MTTR = Total Repair Time / Number of Repairs

Components of MTTR:

Diagnosis time (fault identification)
Parts procurement time (logistics delay)
Repair execution time (wrench time)
Testing and recommissioning time

Lognormal model (preferred for planning):

Median repair time = exp(mu)
Mean repair time   = exp(mu + 0.5 * sigma^2)
Mmax (95th pctile) = exp(mu + 1.6449 * sigma)

The Mmax is critical for spare parts and crew planning — it tells you how long to plan for the worst-case repair.

Maintenance workload:

Monthly_workload = total_labour_hours / observation_months
Man_hours = Monthly_workload * crew_size

Availability

Inherent availability (excludes logistics delay):

A_i = MTBF / (MTBF + MTTR)

Operational availability (includes everything):

A_o = MTBM / (MTBM + MDT)

where MTBM = Mean Time Between Maintenance (all events), MDT = Mean Down Time (repair + waiting + logistics + admin).

Direct method (preferred when runtime is logged):

A = Operating_Hours / Calendar_Hours

Targets for critical rotating equipment: 97-99.5% availability. A single percentage point of availability for a critical pump can represent millions in production value.

Failure Rate lambda(t)

The instantaneous failure rate (hazard rate) at time t:

lambda(t) = f(t) / R(t)

For Weibull:

lambda(t) = (beta / eta) * (t / eta)^(beta - 1)

Interpretation:

lambda(t) is a rate (per unit time), not a probability
For constant failure rate: lambda = 1/MTBF
The cumulative hazard H(t) = integral of lambda from 0 to t
R(t) = exp(-H(t))

25.4 The P-F Curve

The P-F curve is the conceptual foundation of condition-based maintenance. It describes how a failure develops over time, from undetectable to catastrophic.

Point P (Potential Failure): The earliest point at which incipient failure can be detected by a condition monitoring technique. The machine is still functional but degradation is measurable.

Point F (Functional Failure): The point at which the machine can no longer perform its required function to the required standard.

P-F Interval: The elapsed time between P and F. This is the window of opportunity for planned intervention.

The Half-Interval Rule

The inspection interval must be no greater than half the P-F interval:

Inspection_interval <= P-F / 2

This ensures at least two inspections will detect the failure before it reaches point F, providing one check to detect and one to confirm and plan the response.

Detection Methods and Typical P-F Intervals

Detection Method	Typical P-F Interval	What It Detects
Vibration analysis	1-12 months	Bearing defects, misalignment, unbalance, looseness
Oil analysis	1-6 months	Wear debris, contamination, lubricant degradation
Thermography	1-4 weeks	Hot bearings, electrical faults, insulation breakdown
Ultrasound	1-6 months	Bearing lubrication, steam/gas leaks, electrical arcing
Performance monitoring	Days to weeks	Efficiency loss, fouling, internal wear
Audible noise	1-4 weeks	Late-stage bearing failure, cavitation, mechanical looseness
Visual inspection	Days to weeks	Leaks, corrosion, looseness, crack propagation

How RAPID AI Uses the P-F Curve

Module F estimates where on the P-F curve the asset currently sits using three inputs:

Current severity score (S): How far the measured parameters deviate from baseline. Higher S means closer to point F.
Degradation slope (log_slope): The rate of change in log-domain. Steeper slope means shorter time to F.
Instability index (NLI): How erratic the degradation signal is. High NLI means the asset may be near the “knee” of the P-F curve where degradation accelerates.

The RUL estimate is essentially the predicted distance from the current position to point F, expressed in days.

25.5 Risk Priority Number (RPN)

The Risk Priority Number comes from Failure Mode and Effects Analysis (FMEA). It quantifies the risk of each failure mode as the product of three factors.

RPN = Severity * Occurrence * Detection

Each factor is scored on a scale of 1 to 10:

Severity (S): How bad is the consequence if this failure mode occurs?

1 = No effect
5 = Moderate production impact, no safety concern
8 = Major production loss or environmental impact
10 = Safety-critical, potential for injury or death

Occurrence (O): How likely is this failure mode?

1 = Nearly impossible (< 1 in 100,000)
5 = Moderate (1 in 1,000)
8 = High (1 in 20)
10 = Near certain (> 1 in 2)

Detection (D): How likely is the failure to be detected before it reaches the customer or causes harm?

1 = Certain detection (automatic monitoring with proven track record)
5 = Moderate (periodic inspection catches most occurrences)
8 = Low (failure rarely detected before consequence)
10 = No detection method exists

RPN range: 1 to 1,000.

How RAPID AI Computes RPN Dynamically

Traditional FMEA assigns static RPN values during a design review. RAPID AI makes this dynamic by computing live Occurrence scores from sensor data:

Severity is pre-assigned per failure mode in the IMS (Integrated Master Schema), scored by domain experts based on consequence category
Occurrence is updated from real-time sensor evidence — the confidence score from the rule evaluator maps to probability of the failure mode being active
Detection is adjusted based on which sensors are installed and reporting. More active sensor categories = lower detection score (better detection)

Risk Index (Module F simplified form):

RI = 100 * severity_S * criticality_K

where severity_S is the normalized severity (0-1) and criticality_K is the asset criticality factor (0-1). Range: 0 to 100.

Limitations of RPN

RPN has well-documented weaknesses:

Non-linear masking: S=9, O=2, D=2 gives RPN=36, while S=3, O=6, D=6 gives RPN=108. The safety-critical failure ranks lower.
Ordinal scale treated as ratio: The multiplication assumes equal intervals between scores, which is not justified.
No discrimination at extremes: Many failure modes cluster around RPN 100-200, making prioritization difficult.

RAPID AI supplements RPN with:

Confidence scores (0.0-1.0) from the diagnostic engine
Consequence categories (safety, environmental, production, economic) for direct severity comparison
Risk Index (0-100) as a continuous, non-discretized alternative

25.6 System Reliability

Individual component reliability tells you about one machine. System reliability tells you about the production line.

Series Systems

All components must function for the system to function. The system fails when ANY component fails.

R_system = R_1 * R_2 * ... * R_n

Key insight: Series reliability is always LOWER than the least reliable component. A line of 10 machines, each at 99% reliability, gives system reliability of only 90.4%.

Series availability:

A_system = A_1 * A_2 * ... * A_n

Parallel Systems (Active Redundancy)

The system fails only if ALL components fail simultaneously.

R_system = 1 - (1 - R_1) * (1 - R_2) * ... * (1 - R_n)

Key insight: Parallel redundancy dramatically improves reliability. Two pumps at 95% each give system reliability of 99.75%.

k-out-of-n Systems

The system functions if at least k of n identical components function.

R_system = SUM from i=k to n of [C(n,i) * R^i * (1-R)^(n-i)]

where C(n,i) is the binomial coefficient.

Common application: 2-out-of-3 pump configurations.

R_2of3(R=0.95) = 3*(0.95)^2*(0.05) + (0.95)^3 = 0.9928

Standby Redundancy

Cold standby: Backup has zero failure rate while idle. Switchover delay is the risk. Warm standby: Backup runs at reduced duty; lower failure rate than active. Hot standby: Backup runs at full duty; same failure rate as active (this IS active parallel redundancy).

Common Cause Failure (Beta-Factor Model)

Redundancy assumes independent failures. In reality, common causes (power loss, flooding, design defects) can fail multiple components simultaneously.

The beta-factor model separates the failure rate into independent and common-cause components:

lambda_total = lambda_independent + lambda_common
beta_ccf = lambda_common / lambda_total

For a parallel system with common cause:

R_system = exp(-lambda_independent * t) * (2 - exp(-lambda_independent * t))
           * exp(-lambda_common * t)

Typical beta_ccf values: 0.01-0.10 for well-separated redundant systems.

RAPID AI Dependency-Weighted Risk Propagation

RAPID AI models plant topology as a weighted directed graph. Each edge carries dependency_strength (how tightly coupled are the assets) and production_impact_weight (how much does upstream failure affect downstream output).

Propagated risk for a target node:

propagated_risk(target) = SUM over upstream edges of:
    source_risk * dependency_strength * production_impact_weight

Dependency-adjusted risk:

adjusted_risk(target) = max(own_risk, propagated_risk)

The graph is traversed in topological order (Kahn’s algorithm) so that adjusted risks propagate through multi-level chains. This identifies critical path equipment and reliability bottlenecks that would be invisible from single-asset analysis.

25.7 Condition-Adjusted RUL (Module F Mathematics)

Module F is the prognostic heart of RAPID AI. It estimates Remaining Useful Life from the current condition state using three models, selected automatically based on the degradation characteristics.

Input Contract

Field	Type	Range	Source
severity_score_S	float	0-1	Modules A/B/C
confidence_C	float	0-1	Confidence engine
log_slope	float	any	Trend analysis (log-domain slope per day)
slope_change	float	any	Second derivative of log trend
instability_index_NLI	float	0-1	Non-linearity index
current_value	float	> 0	Current measurement amplitude
failure_threshold	float	> 0	Defined failure threshold
criticality_K	float	0-1	Asset criticality factor

Model F001: Linear Degradation

Selection condition: slope_change < 0.01 AND NLI < 0.3

The degradation is growing at a steady exponential rate in the original domain (linear in the log domain). The time to reach the failure threshold is:

RUL = (ln(failure_threshold) - ln(current_value)) / slope_log

Physical interpretation: If current vibration is 5.0 mm/s, the failure threshold is 8.0 mm/s, and the log-domain slope is 0.05 per day:

RUL = (ln(8.0) - ln(5.0)) / 0.05
    = (2.0794 - 1.6094) / 0.05
    = 0.4700 / 0.05
    = 9.4 days

Guard conditions: If slope_log = 0, there is no degradation trend and RUL is returned as 0 (indicating insufficient data, not imminent failure). If current_value or failure_threshold is non-positive, the logarithm is undefined and RUL returns 0.

Model F002: Accelerating Degradation

Selection condition: slope_change >= 0.01

The degradation rate itself is increasing (the second derivative of the log trend is positive). This typically occurs when a bearing defect reaches the spalling stage, or when a crack enters the rapid propagation phase.

RUL = ln(failure_threshold / current_value) / (slope_log + slope_change)

Why the effective slope is slope_log + slope_change: This is a first-order approximation. The true trajectory is nonlinear, but over the remaining life window, using the sum of the current slope and its rate of change provides a conservative (shorter) RUL estimate — which is the safe direction for maintenance planning.

Model F003: High Instability

Selection condition: NLI >= 0.6

The degradation signal is highly erratic — the non-linearity index (NLI) indicates that the log-trend residuals are large relative to the trend itself. This occurs when the failure mechanism is chaotic (e.g., intermittent rub, loose bearing race) or when the asset is near the “knee” of the P-F curve.

Base_RUL = F001 formula (linear estimate)
RUL = Base_RUL * (1 - NLI)

Physical interpretation: If the linear model estimates 50 days but NLI = 0.7, the adjusted RUL is 50 * (1 - 0.7) = 15 days. The instability itself is treated as evidence that failure is closer than the trend line suggests.

At NLI = 1.0: RUL = 0 (total instability implies imminent failure).

Model Selection Priority

The models are evaluated in specification order:

Check F001 conditions first (slope_change < 0.01 AND NLI < 0.3)
If slope_change >= 0.01, use F002 (acceleration takes priority)
If NLI >= 0.6, use F003 (instability takes priority)
Fallback: F001 (for the gap between NLI 0.3-0.6 with low slope_change)

Failure Probability (30-Day Horizon)

Given the RUL estimate, the 30-day failure probability uses the exponential hazard model:

lambda = 1 / RUL_days                           (hazard rate)
P_raw = 1 - exp(-lambda * 30)                   (raw 30-day probability)
P_adjusted = P_raw * confidence_C               (confidence-adjusted)

Confidence adjustment rationale: If the upstream diagnostic confidence is only 0.6, the failure probability should be discounted accordingly. A highly uncertain diagnosis should not trigger the same urgency as a confident one.

When RUL <= 0: P_raw = 1.0 (certain failure). The confidence adjustment still applies: if confidence is 0.5, the adjusted probability is 0.5 — reflecting that we are uncertain about both the diagnosis and the imminence.

Risk Index

RI = 100 * severity_S * criticality_K

Range: 0 to 100. Combines “how bad” (severity) with “how important” (criticality) to produce a single prioritization score. Used to rank assets for maintenance scheduling.

Recommended Intervention Window

RUL (days)	Recommendation
<= 7	Immediate
<= 30	7-day window
<= 90	30-day window
> 90	Planned (next scheduled outage)

25.8 Shannon Entropy (Module B.3 Mathematics)

RAPID AI uses information-theoretic entropy to quantify how “disordered” or “concentrated” the energy in a vibration signal is. A healthy machine distributes energy predictably across known frequencies. A deteriorating machine scatters energy in unexpected ways.

Shannon Entropy

For a discrete probability distribution P = {p(x_1), p(x_2), …, p(x_N)}:

H(X) = -SUM from i=1 to N of [p(x_i) * log2(p(x_i))]

Units: bits (when using log base 2).

Range: 0 (all energy in one bin — perfect certainty) to log2(N) (uniform distribution — maximum uncertainty).

Normalized entropy:

H_norm = H(X) / log2(N)

Range: 0 to 1. This allows comparison across signals with different numbers of frequency bins.

Spectral Entropy (SE)

Measures how uniformly energy is distributed across frequency bands in a single measurement direction.

Computation:

Compute the power spectral density (PSD) of the vibration signal
Normalize the PSD to a probability distribution: p(f_i) = PSD(f_i) / SUM(PSD)
Compute Shannon entropy of the normalized PSD

Interpretation:

Low SE (< 0.3): Energy concentrated in a few dominant frequencies. The machine has clear tonal signatures — either healthy (1x, blade pass) or a specific defect (bearing frequencies, gear mesh).
High SE (> 0.7): Energy spread across many frequencies. Broadband energy indicates turbulence, looseness, or multiple competing fault mechanisms.

Temporal Entropy (TE)

Measures signal consistency over successive time windows. The signal is divided into overlapping windows and the same feature (e.g., RMS amplitude) is computed in each.

Computation:

Divide the time-domain signal into W overlapping windows
Compute a scalar feature (RMS, peak, kurtosis) in each window
Normalize the feature vector to a probability distribution
Compute Shannon entropy

Interpretation:

Low TE: The signal is consistent over time — steady-state operation.
High TE: The signal varies significantly between windows — transient events, intermittent contact, or load fluctuations.

Directional Entropy (DE)

Measures how energy is distributed across measurement directions (axial, horizontal, vertical).

Computation:

Compute RMS amplitude in each direction: A, H, V
Normalize to a probability distribution: p(d) = RMS(d) / (A + H + V)
Compute Shannon entropy of the three-element distribution

Interpretation:

Low DE: Energy dominated by one direction — consistent with specific fault types (axial = misalignment, radial = unbalance).
High DE (approaching log2(3) = 1.58 bits, or H_norm approaching 1.0): Energy equally distributed — structural looseness, multiple faults, or healthy baseline.

Stability Index

The Stability Index (SI) combines the three entropy measures into a single health indicator:

SI = 1 - (0.5 * SE + 0.3 * TE + 0.2 * DE)

Weights rationale:

Spectral Entropy (0.5): The dominant contributor because spectral shape is the most informative feature for fault detection.
Temporal Entropy (0.3): Captures non-stationarity, which indicates developing faults.
Directional Entropy (0.2): Supplements with directional information but is the least discriminating on its own.

Interpretation:

SI close to 1.0: Low entropy across all dimensions. Energy is flowing through the machine in a predictable, organized pattern. The machine is healthy.
SI close to 0.0: High entropy everywhere. Energy is scattered, trapped, or misdirected. The machine is degraded or has multiple active fault mechanisms.

Physical analogy: A healthy pump converts electrical energy into hydraulic energy along a defined path (motor -> coupling -> shaft -> impeller -> fluid). Entropy measures how much energy leaks off this path into vibration, heat, and noise. Low leakage = high SI = healthy. High leakage = low SI = degraded.

25.9 SSI Fusion Mathematics (Module C)

The System Severity Index (SSI) is Module C’s composite health score. It fuses evidence from multiple diagnostic blocks (vibration, thermal, electrical, process, inspection) into a single 0-1 severity score per asset.

Block Scores

Each diagnostic block produces a block score B_k from the component-level evidence within that block:

B_k = weighted combination of component-level severity indicators

The exact weighting depends on the block type. For vibration, the components might be 1x amplitude, bearing defect frequencies, broadband energy, and axial/radial ratio. Each is weighted by its diagnostic significance for the asset type.

Profile-Weighted Fusion

Different asset types weight the blocks differently. A motor weights electrical evidence more heavily than a pump does. A heat exchanger weights thermal evidence more heavily than a compressor does.

The SSI is the profile-weighted mean of block scores:

SSI = SUM(w_k * B_k) / SUM(w_k)

where w_k is the weight assigned to block k by the asset profile.

Block-Score-Range (BSR) Override

When any single block score exceeds a critical threshold, the SSI is overridden regardless of the weighted average. This prevents a catastrophic vibration reading from being diluted by normal temperature and electrical readings.

BSR override rules:

If any B_k > 0.85: SSI = max(SSI_calculated, B_k * 0.95)
If any B_k > 0.95: SSI is forced to at least 0.90

Normalization for Missing Blocks

When sensor data is unavailable for one or more blocks, the SSI adjusts:

SSI = SUM(w_k * B_k for available blocks) / SUM(w_k for available blocks)

The confidence score is reduced proportionally to reflect the incomplete evidence.

Module C Confidence

The confidence of the SSI is computed from two factors:

C_module_c = 0.6 * SSI_quality + 0.4 * SEI

where SSI_quality measures the completeness and consistency of the block scores, and SEI (Sensor Evidence Index) measures how many sensor types contributed to the assessment.

25.10 Weibull Fitting Algorithms

RAPID AI implements two methods for fitting Weibull parameters from field data.

Median Rank Regression (MRR)

Used when data is sparse (3-20 failures). The algorithm:

Sort failure times ascending. Separate uncensored failures from censored (suspended) items.
Assign median ranks using Bernard’s approximation:
```
F(i) = (i - 0.3) / (n + 0.4)
```
For censored data, use the adjusted rank method with increments:
```
increment = (n + 1 - prev_rank) / (remaining + 1)
```

Linearize to Weibull probability paper:

X_i = ln(t_i)
Y_i = ln(ln(1 / (1 - F_i)))

Ordinary Least Squares regression on (X, Y) to get slope and intercept.

Recover parameters:

beta = slope
eta = exp(-intercept / beta)

Goodness of fit: R-squared >= 0.85 for “fitted” status; >= 0.70 for “fitted_low_confidence”; below 0.70 is “poor_fit”.

Maximum Likelihood Estimation (MLE)

Used for 5+ failures or when censored data is present. Statistically optimal for larger datasets.

Log-likelihood for mixed data:

L(beta, eta) = SUM_failed  [ln(beta) - beta*ln(eta) + (beta-1)*ln(t_i) - (t_i/eta)^beta]
             + SUM_censored [-(t_j/eta)^beta]

Score equation for beta (set dL/d(beta) = 0):

n_f/beta + SUM_f(ln(t_i)) - [SUM_all(t_i^beta * ln(t_i))] / [SUM_all(t_i^beta)] = 0

Closed-form eta given beta:

eta = (SUM_all(t_i^beta) / n_f)^(1/beta)

Solve numerically using Brent’s method (bracket beta in [0.01, 200]).

Method Selection

Condition	Method	Status
n_failures < 3	None	insufficient_data
3 <= n_failures < 5	MRR	low_sample_warning
5 <= n_failures < 20, no censoring	MRR	ok
n_failures >= 20 or censored data	MLE	ok

25.11 Reliability Growth (Duane Model)

The Duane model tracks whether reliability is improving or deteriorating over time. Used in the Reliability Intelligence Layer to identify assets where maintenance interventions are (or are not) having the desired effect.

Cumulative MTBF:

MTBF_cum(T) = T / N(T)

where N(T) is the cumulative number of failures by time T.

Duane plot: Log(MTBF_cum) vs Log(T). If the relationship is linear:

ln(MTBF_cum) = a + alpha * ln(T)

Duane slope alpha:

alpha > 0: Reliability is improving (MTBF is increasing over time)
alpha = 0: Reliability is constant
alpha < 0: Reliability is deteriorating

Typical targets: alpha = 0.3-0.5 indicates healthy reliability growth during a commissioning or improvement program.

Summary

The mathematics in this chapter is not a textbook exercise. Every equation maps to a production code path in RAPID AI:

Mathematics	Implementation	Purpose
Weibull distribution	weibull_model_engine.py	Failure pattern classification
Exponential hazard	rul_engine.py	30-day failure probability
F001/F002/F003 models	rul_engine.py	Remaining Useful Life estimation
Shannon entropy	Module B.3 scoring	Signal disorder quantification
SSI fusion	Module C scoring	Multi-source health assessment
Series/parallel reliability	Dependency graph analysis	System-level availability
Duane model	Reliability growth analysis	Improvement tracking
P-F interval	Inspection scheduling	Monitoring frequency selection

The goal is not mathematical elegance but engineering utility: transform sensor readings and failure history into the three numbers a maintenance engineer needs — how bad is it (severity), how sure are we (confidence), and how long do we have (RUL).

Standards Alignment

Standard	Relevance to This Chapter
IEC 61649 — Weibull analysis	This chapter implements IEC 61649’s two-parameter Weibull model (shape beta, scale eta) as the foundation for reliability prediction, extended with condition-based adjustments that bridge population statistics with real-time sensor evidence.
ISO 13381-1 — Prognostics	The RUL estimation formulas (linear, accelerating, Weibull-adjusted) implement ISO 13381-1’s prognostic methodology, providing mathematically grounded remaining useful life predictions with quantified uncertainty.
ISO 14224 — Reliability and maintenance data	The reliability metrics (MTBF, MTTR, availability, hazard functions) use ISO 14224-compliant data structures and calculation methods for equipment reliability assessment in petroleum, petrochemical, and natural gas industries.

Changelog

Version	Date	Author	Changes
2.1.0	2026-03-17	Rick D	Added standards alignment, living doc metadata, changelog
2.0.0	2026-03-17	Rick D	Enriched with production codebase content
1.0.0	2026-03-17	Rick D	Initial chapter creation