Reliability Mathematics
Chapter 25: Reliability Mathematics
Section titled “Chapter 25: Reliability Mathematics”Reliability engineering is applied probability theory. Every maintenance decision — when to inspect, when to replace, when to accept risk — rests on mathematical models of how and when things fail. This chapter presents the complete mathematical foundation that RAPID AI uses to transform raw sensor data and operational history into actionable reliability intelligence.
The equations here are not academic curiosities. They are the production formulas
implemented in rul_engine.py, weibull_model_engine.py, and the reliability analysis
services. Every variable name maps to a field in the Module F schema or the reliability
database tables. Every threshold has a physical justification.
25.1 Probability Distributions for Failure
Section titled “25.1 Probability Distributions for Failure”Three distributions dominate reliability engineering. Each models a different physical failure mechanism. Selecting the wrong one produces dangerously incorrect predictions.
Exponential Distribution — Constant Failure Rate (Memoryless)
Section titled “Exponential Distribution — Constant Failure Rate (Memoryless)”The exponential distribution assumes that the probability of failure in the next instant is the same regardless of how long the component has been running. This is the simplest model and the only continuous distribution with the memoryless property.
Probability Density Function:
f(t) = lambda * exp(-lambda * t)Reliability Function:
R(t) = exp(-lambda * t)Cumulative Distribution Function (unreliability):
F(t) = 1 - exp(-lambda * t)Mean Time To Failure:
MTTF = 1 / lambdaWhen to use:
- Random failures caused by external shocks (lightning, operator error, foreign object damage)
- Electronic components after burn-in screening
- Systems where wear-out is prevented by scheduled replacement
- Any scenario where the Weibull shape parameter beta is approximately 1.0
When NOT to use:
- Mechanical wear-out (bearings, seals, gears) — these have increasing failure rates
- Infant mortality screening — decreasing failure rate requires Weibull with beta < 1
RAPID AI application: The RUL engine (rul_engine.py) uses the exponential hazard
model for 30-day failure probability computation. Given an estimated RUL, the hazard
rate lambda = 1/RUL is assumed constant over the short prediction window:
P(failure in 30 days) = 1 - exp(-lambda * 30)This is a deliberate simplification. Over a 30-day window, the constant-rate assumption introduces negligible error compared to the uncertainty in the RUL estimate itself.
Weibull Distribution — The Workhorse of Reliability Engineering
Section titled “Weibull Distribution — The Workhorse of Reliability Engineering”The Weibull distribution is the most versatile life distribution in reliability engineering. With two parameters, it can model decreasing, constant, or increasing failure rates — covering the entire bathtub curve.
Parameters:
beta(shape parameter, dimensionless): governs how the failure rate changes with timeeta(scale parameter, same units as time): the characteristic life
Probability Density Function:
f(t) = (beta / eta) * (t / eta)^(beta - 1) * exp(-(t / eta)^beta)Reliability Function:
R(t) = exp(-(t / eta)^beta)Cumulative Distribution Function:
F(t) = 1 - exp(-(t / eta)^beta)Hazard Rate (instantaneous failure rate):
h(t) = (beta / eta) * (t / eta)^(beta - 1)Mean Life (MTTF for non-repairable items):
mu = eta * Gamma(1 + 1/beta)where Gamma is the gamma function.
Bxx Life (time by which xx% of the population has failed):
t_p = eta * (-ln(1 - p))^(1/beta)The B10 life (10th percentile, also called L10 in bearing standards):
B10 = eta * (-ln(0.9))^(1/beta) = eta * (0.10536)^(1/beta)Characteristic Life Interpretation:
- eta is the 63.2nd percentile — 63.2% of the population has failed by t = eta
- eta is NOT the mean life (unless beta = 1)
- eta is NOT the “expected life” in the colloquial sense
Shape Parameter Interpretation:
| beta Range | Failure Rate | Physical Meaning | Maintenance Implication |
|---|---|---|---|
| beta < 0.5 | Steeply decreasing | Severe infant mortality; manufacturing or installation defects | Burn-in, commissioning checks, installation audit |
| 0.5 <= beta < 1.0 | Decreasing | Moderate infant mortality; early quality issues | Improve commissioning, PM after break-in |
| beta = 1.0 (0.95-1.05) | Constant | Exponential distribution; random external causes | CBM or run-to-failure if consequence is low |
| 1.0 < beta < 2.0 | Mildly increasing | Early wear-out; friction and fatigue beginning | Age-based PM; start CBM monitoring |
| beta = 2.0 | Linearly increasing (Rayleigh) | Progressive wear with constant wear rate | Time-based replacement at B10 or B20 |
| 2.0 < beta < 4.0 | Increasing | Definite wear-out; design life visible | Scheduled replacement before B10 |
| beta = 3.44 | Near-normal (bell-shaped) | Tight life distribution; predictable failure | Optimal for fixed-interval replacement |
| beta > 4.0 | Steeply increasing | Rapid wear-out; brittle fracture, fatigue | Aggressive CBM; inspect before B5 |
How RAPID AI uses Weibull: The Reliability Intelligence Layer fits Weibull parameters from failure history using either Median Rank Regression (MRR, for 3-20 failures) or Maximum Likelihood Estimation (MLE, for 5+ failures with censored data). Module F then uses the fitted beta to classify hazard behaviour and select the appropriate prognostic model.
Lognormal Distribution — Fatigue and Repair Time
Section titled “Lognormal Distribution — Fatigue and Repair Time”The lognormal distribution arises when the logarithm of the life data is normally distributed. It models multiplicative degradation processes where many small factors combine.
Probability Density Function:
f(t) = (1 / (t * sigma * sqrt(2*pi))) * exp(-(ln(t) - mu)^2 / (2 * sigma^2))Parameters:
mu= mean of ln(t) (location parameter in log space)sigma= standard deviation of ln(t) (scale parameter in log space)
Key statistics:
Median = exp(mu)Mean = exp(mu + 0.5 * sigma^2)Mmax (95th percentile) = exp(mu + 1.6449 * sigma)When to use:
- Fatigue life (crack initiation and propagation)
- Repair time modelling (MTTR distributions)
- Corrosion rates
- Any process governed by multiplicative random shocks
RAPID AI application: Repair times in the maintainability analysis use lognormal fitting. The median repair time, mean repair time, and Mmax (95th percentile) are computed for maintenance workload planning. This is why MTTR alone is insufficient — the Mmax tells you how long the worst-case repair will take.
25.2 The Bathtub Curve
Section titled “25.2 The Bathtub Curve”The bathtub curve is the composite failure rate function h(t) for a population of components. It is not a single distribution but the superposition of three competing failure mechanisms, each dominant in a different life phase.
Phase I: Infant Mortality (beta < 1)
Section titled “Phase I: Infant Mortality (beta < 1)”The failure rate is decreasing. Failures are caused by:
- Manufacturing defects that escaped quality control
- Installation errors (misalignment, wrong torque, contamination)
- Commissioning faults (incorrect settings, missing lubrication)
- The Waddington Effect: maintenance itself introduces defects
Duration: Typically the first 5-10% of design life, or the first few hundred operating hours for rotating machinery.
Maintenance strategy: Burn-in screening, acceptance testing, commissioning procedures. Time-based replacement makes things WORSE in this region because every replacement resets the clock to the high-failure-rate zone.
Phase II: Useful Life (beta = 1)
Section titled “Phase II: Useful Life (beta = 1)”The failure rate is approximately constant. Failures are:
- Random external events (power surges, foreign object damage, operator errors)
- Stress exceedances beyond design envelope
- Hidden defects that manifest at random times
Duration: The majority of service life for well-designed equipment.
Maintenance strategy: Condition-based monitoring (CBM) is optimal. Time-based replacement provides no benefit because the failure rate is constant — the item is equally likely to fail immediately after replacement as before.
Phase III: Wear-Out (beta > 1)
Section titled “Phase III: Wear-Out (beta > 1)”The failure rate is increasing. Failures are caused by:
- Mechanical wear (bearings, seals, impellers)
- Fatigue (shafts, blades, structural members)
- Corrosion (piping, heat exchanger tubes)
- Material degradation (insulation, elastomers, lubricants)
Duration: Onset varies by component. Bearings may enter wear-out after 30,000 hours; corrosion may take decades.
Maintenance strategy: This is the ONLY region where time-based replacement works. Schedule replacement before the B10 life. CBM can extend intervals by monitoring actual condition rather than relying on population statistics.
The Nowlan and Heap Finding
Section titled “The Nowlan and Heap Finding”The landmark 1978 study by Nowlan and Heap (commissioned by United Airlines) analysed the failure patterns of aircraft components and found:
- 4% follow the classic bathtub curve (infant mortality + constant + wear-out)
- 2% follow a wear-out pattern only (increasing failure rate)
- 5% follow a slowly increasing pattern
- 7% follow a constant failure rate (flat)
- 14% follow an infant mortality pattern (decreasing then constant)
- 68% follow a decreasing-then-constant pattern
Total: 82% of failures are NOT wear-out. They follow random or infant-mortality patterns. This means time-based replacement is effective for only 18% of failure modes. The remaining 82% require condition-based monitoring or redesign — which is exactly what RAPID AI provides.
25.3 Key Reliability Metrics
Section titled “25.3 Key Reliability Metrics”MTBF — Mean Time Between Failures
Section titled “MTBF — Mean Time Between Failures”MTBF = Total Operating Time / Number of FailuresFor repairable systems only. MTTF (Mean Time To Failure) is used for non-repairable items.
Complete data (all failures observed):
MTBF = operating_hours_total / n_failuresCensored data (some units still running):
Total_time = sum of ALL operating intervals (failed + surviving)MTBF = Total_time / n_failures # only actual failures in denominatorChi-squared confidence interval (assuming exponential distribution):
MTBF_lower = 2*T / chi2(1 - alpha/2, 2*n + 2)MTBF_upper = 2*T / chi2(alpha/2, 2*n)where T = total accumulated time, n = number of failures, alpha = significance level.
Common misuse: MTBF is NOT expected life. A pump with MTBF = 50,000 hours does not “last” 50,000 hours. MTBF is the reciprocal of failure rate, valid only when the failure rate is constant (beta = 1). For wear-out failures (beta > 1), the B10 or B50 life from Weibull analysis is the correct planning metric.
MTTR — Mean Time To Repair
Section titled “MTTR — Mean Time To Repair”MTTR = Total Repair Time / Number of RepairsComponents of MTTR:
- Diagnosis time (fault identification)
- Parts procurement time (logistics delay)
- Repair execution time (wrench time)
- Testing and recommissioning time
Lognormal model (preferred for planning):
Median repair time = exp(mu)Mean repair time = exp(mu + 0.5 * sigma^2)Mmax (95th pctile) = exp(mu + 1.6449 * sigma)The Mmax is critical for spare parts and crew planning — it tells you how long to plan for the worst-case repair.
Maintenance workload:
Monthly_workload = total_labour_hours / observation_monthsMan_hours = Monthly_workload * crew_sizeAvailability
Section titled “Availability”Inherent availability (excludes logistics delay):
A_i = MTBF / (MTBF + MTTR)Operational availability (includes everything):
A_o = MTBM / (MTBM + MDT)where MTBM = Mean Time Between Maintenance (all events), MDT = Mean Down Time (repair + waiting + logistics + admin).
Direct method (preferred when runtime is logged):
A = Operating_Hours / Calendar_HoursTargets for critical rotating equipment: 97-99.5% availability. A single percentage point of availability for a critical pump can represent millions in production value.
Failure Rate lambda(t)
Section titled “Failure Rate lambda(t)”The instantaneous failure rate (hazard rate) at time t:
lambda(t) = f(t) / R(t)For Weibull:
lambda(t) = (beta / eta) * (t / eta)^(beta - 1)Interpretation:
- lambda(t) is a rate (per unit time), not a probability
- For constant failure rate: lambda = 1/MTBF
- The cumulative hazard H(t) = integral of lambda from 0 to t
- R(t) = exp(-H(t))
25.4 The P-F Curve
Section titled “25.4 The P-F Curve”The P-F curve is the conceptual foundation of condition-based maintenance. It describes how a failure develops over time, from undetectable to catastrophic.
Point P (Potential Failure): The earliest point at which incipient failure can be detected by a condition monitoring technique. The machine is still functional but degradation is measurable.
Point F (Functional Failure): The point at which the machine can no longer perform its required function to the required standard.
P-F Interval: The elapsed time between P and F. This is the window of opportunity for planned intervention.
The Half-Interval Rule
Section titled “The Half-Interval Rule”The inspection interval must be no greater than half the P-F interval:
Inspection_interval <= P-F / 2This ensures at least two inspections will detect the failure before it reaches point F, providing one check to detect and one to confirm and plan the response.
Detection Methods and Typical P-F Intervals
Section titled “Detection Methods and Typical P-F Intervals”| Detection Method | Typical P-F Interval | What It Detects |
|---|---|---|
| Vibration analysis | 1-12 months | Bearing defects, misalignment, unbalance, looseness |
| Oil analysis | 1-6 months | Wear debris, contamination, lubricant degradation |
| Thermography | 1-4 weeks | Hot bearings, electrical faults, insulation breakdown |
| Ultrasound | 1-6 months | Bearing lubrication, steam/gas leaks, electrical arcing |
| Performance monitoring | Days to weeks | Efficiency loss, fouling, internal wear |
| Audible noise | 1-4 weeks | Late-stage bearing failure, cavitation, mechanical looseness |
| Visual inspection | Days to weeks | Leaks, corrosion, looseness, crack propagation |
How RAPID AI Uses the P-F Curve
Section titled “How RAPID AI Uses the P-F Curve”Module F estimates where on the P-F curve the asset currently sits using three inputs:
- Current severity score (S): How far the measured parameters deviate from baseline. Higher S means closer to point F.
- Degradation slope (log_slope): The rate of change in log-domain. Steeper slope means shorter time to F.
- Instability index (NLI): How erratic the degradation signal is. High NLI means the asset may be near the “knee” of the P-F curve where degradation accelerates.
The RUL estimate is essentially the predicted distance from the current position to point F, expressed in days.
25.5 Risk Priority Number (RPN)
Section titled “25.5 Risk Priority Number (RPN)”The Risk Priority Number comes from Failure Mode and Effects Analysis (FMEA). It quantifies the risk of each failure mode as the product of three factors.
RPN = Severity * Occurrence * DetectionEach factor is scored on a scale of 1 to 10:
Severity (S): How bad is the consequence if this failure mode occurs?
- 1 = No effect
- 5 = Moderate production impact, no safety concern
- 8 = Major production loss or environmental impact
- 10 = Safety-critical, potential for injury or death
Occurrence (O): How likely is this failure mode?
- 1 = Nearly impossible (< 1 in 100,000)
- 5 = Moderate (1 in 1,000)
- 8 = High (1 in 20)
- 10 = Near certain (> 1 in 2)
Detection (D): How likely is the failure to be detected before it reaches the customer or causes harm?
- 1 = Certain detection (automatic monitoring with proven track record)
- 5 = Moderate (periodic inspection catches most occurrences)
- 8 = Low (failure rarely detected before consequence)
- 10 = No detection method exists
RPN range: 1 to 1,000.
How RAPID AI Computes RPN Dynamically
Section titled “How RAPID AI Computes RPN Dynamically”Traditional FMEA assigns static RPN values during a design review. RAPID AI makes this dynamic by computing live Occurrence scores from sensor data:
- Severity is pre-assigned per failure mode in the IMS (Integrated Master Schema), scored by domain experts based on consequence category
- Occurrence is updated from real-time sensor evidence — the confidence score from the rule evaluator maps to probability of the failure mode being active
- Detection is adjusted based on which sensors are installed and reporting. More active sensor categories = lower detection score (better detection)
Risk Index (Module F simplified form):
RI = 100 * severity_S * criticality_Kwhere severity_S is the normalized severity (0-1) and criticality_K is the asset criticality factor (0-1). Range: 0 to 100.
Limitations of RPN
Section titled “Limitations of RPN”RPN has well-documented weaknesses:
- Non-linear masking: S=9, O=2, D=2 gives RPN=36, while S=3, O=6, D=6 gives RPN=108. The safety-critical failure ranks lower.
- Ordinal scale treated as ratio: The multiplication assumes equal intervals between scores, which is not justified.
- No discrimination at extremes: Many failure modes cluster around RPN 100-200, making prioritization difficult.
RAPID AI supplements RPN with:
- Confidence scores (0.0-1.0) from the diagnostic engine
- Consequence categories (safety, environmental, production, economic) for direct severity comparison
- Risk Index (0-100) as a continuous, non-discretized alternative
25.6 System Reliability
Section titled “25.6 System Reliability”Individual component reliability tells you about one machine. System reliability tells you about the production line.
Series Systems
Section titled “Series Systems”All components must function for the system to function. The system fails when ANY component fails.
R_system = R_1 * R_2 * ... * R_nKey insight: Series reliability is always LOWER than the least reliable component. A line of 10 machines, each at 99% reliability, gives system reliability of only 90.4%.
Series availability:
A_system = A_1 * A_2 * ... * A_nParallel Systems (Active Redundancy)
Section titled “Parallel Systems (Active Redundancy)”The system fails only if ALL components fail simultaneously.
R_system = 1 - (1 - R_1) * (1 - R_2) * ... * (1 - R_n)Key insight: Parallel redundancy dramatically improves reliability. Two pumps at 95% each give system reliability of 99.75%.
k-out-of-n Systems
Section titled “k-out-of-n Systems”The system functions if at least k of n identical components function.
R_system = SUM from i=k to n of [C(n,i) * R^i * (1-R)^(n-i)]where C(n,i) is the binomial coefficient.
Common application: 2-out-of-3 pump configurations.
R_2of3(R=0.95) = 3*(0.95)^2*(0.05) + (0.95)^3 = 0.9928Standby Redundancy
Section titled “Standby Redundancy”Cold standby: Backup has zero failure rate while idle. Switchover delay is the risk. Warm standby: Backup runs at reduced duty; lower failure rate than active. Hot standby: Backup runs at full duty; same failure rate as active (this IS active parallel redundancy).
Common Cause Failure (Beta-Factor Model)
Section titled “Common Cause Failure (Beta-Factor Model)”Redundancy assumes independent failures. In reality, common causes (power loss, flooding, design defects) can fail multiple components simultaneously.
The beta-factor model separates the failure rate into independent and common-cause components:
lambda_total = lambda_independent + lambda_commonbeta_ccf = lambda_common / lambda_totalFor a parallel system with common cause:
R_system = exp(-lambda_independent * t) * (2 - exp(-lambda_independent * t)) * exp(-lambda_common * t)Typical beta_ccf values: 0.01-0.10 for well-separated redundant systems.
RAPID AI Dependency-Weighted Risk Propagation
Section titled “RAPID AI Dependency-Weighted Risk Propagation”RAPID AI models plant topology as a weighted directed graph. Each edge carries
dependency_strength (how tightly coupled are the assets) and production_impact_weight
(how much does upstream failure affect downstream output).
Propagated risk for a target node:
propagated_risk(target) = SUM over upstream edges of: source_risk * dependency_strength * production_impact_weightDependency-adjusted risk:
adjusted_risk(target) = max(own_risk, propagated_risk)The graph is traversed in topological order (Kahn’s algorithm) so that adjusted risks propagate through multi-level chains. This identifies critical path equipment and reliability bottlenecks that would be invisible from single-asset analysis.
25.7 Condition-Adjusted RUL (Module F Mathematics)
Section titled “25.7 Condition-Adjusted RUL (Module F Mathematics)”Module F is the prognostic heart of RAPID AI. It estimates Remaining Useful Life from the current condition state using three models, selected automatically based on the degradation characteristics.
Input Contract
Section titled “Input Contract”| Field | Type | Range | Source |
|---|---|---|---|
| severity_score_S | float | 0-1 | Modules A/B/C |
| confidence_C | float | 0-1 | Confidence engine |
| log_slope | float | any | Trend analysis (log-domain slope per day) |
| slope_change | float | any | Second derivative of log trend |
| instability_index_NLI | float | 0-1 | Non-linearity index |
| current_value | float | > 0 | Current measurement amplitude |
| failure_threshold | float | > 0 | Defined failure threshold |
| criticality_K | float | 0-1 | Asset criticality factor |
Model F001: Linear Degradation
Section titled “Model F001: Linear Degradation”Selection condition: slope_change < 0.01 AND NLI < 0.3
The degradation is growing at a steady exponential rate in the original domain (linear in the log domain). The time to reach the failure threshold is:
RUL = (ln(failure_threshold) - ln(current_value)) / slope_logPhysical interpretation: If current vibration is 5.0 mm/s, the failure threshold is 8.0 mm/s, and the log-domain slope is 0.05 per day:
RUL = (ln(8.0) - ln(5.0)) / 0.05 = (2.0794 - 1.6094) / 0.05 = 0.4700 / 0.05 = 9.4 daysGuard conditions: If slope_log = 0, there is no degradation trend and RUL is returned as 0 (indicating insufficient data, not imminent failure). If current_value or failure_threshold is non-positive, the logarithm is undefined and RUL returns 0.
Model F002: Accelerating Degradation
Section titled “Model F002: Accelerating Degradation”Selection condition: slope_change >= 0.01
The degradation rate itself is increasing (the second derivative of the log trend is positive). This typically occurs when a bearing defect reaches the spalling stage, or when a crack enters the rapid propagation phase.
RUL = ln(failure_threshold / current_value) / (slope_log + slope_change)Why the effective slope is slope_log + slope_change: This is a first-order approximation. The true trajectory is nonlinear, but over the remaining life window, using the sum of the current slope and its rate of change provides a conservative (shorter) RUL estimate — which is the safe direction for maintenance planning.
Model F003: High Instability
Section titled “Model F003: High Instability”Selection condition: NLI >= 0.6
The degradation signal is highly erratic — the non-linearity index (NLI) indicates that the log-trend residuals are large relative to the trend itself. This occurs when the failure mechanism is chaotic (e.g., intermittent rub, loose bearing race) or when the asset is near the “knee” of the P-F curve.
Base_RUL = F001 formula (linear estimate)RUL = Base_RUL * (1 - NLI)Physical interpretation: If the linear model estimates 50 days but NLI = 0.7, the adjusted RUL is 50 * (1 - 0.7) = 15 days. The instability itself is treated as evidence that failure is closer than the trend line suggests.
At NLI = 1.0: RUL = 0 (total instability implies imminent failure).
Model Selection Priority
Section titled “Model Selection Priority”The models are evaluated in specification order:
- Check F001 conditions first (slope_change < 0.01 AND NLI < 0.3)
- If slope_change >= 0.01, use F002 (acceleration takes priority)
- If NLI >= 0.6, use F003 (instability takes priority)
- Fallback: F001 (for the gap between NLI 0.3-0.6 with low slope_change)
Failure Probability (30-Day Horizon)
Section titled “Failure Probability (30-Day Horizon)”Given the RUL estimate, the 30-day failure probability uses the exponential hazard model:
lambda = 1 / RUL_days (hazard rate)P_raw = 1 - exp(-lambda * 30) (raw 30-day probability)P_adjusted = P_raw * confidence_C (confidence-adjusted)Confidence adjustment rationale: If the upstream diagnostic confidence is only 0.6, the failure probability should be discounted accordingly. A highly uncertain diagnosis should not trigger the same urgency as a confident one.
When RUL <= 0: P_raw = 1.0 (certain failure). The confidence adjustment still applies: if confidence is 0.5, the adjusted probability is 0.5 — reflecting that we are uncertain about both the diagnosis and the imminence.
Risk Index
Section titled “Risk Index”RI = 100 * severity_S * criticality_KRange: 0 to 100. Combines “how bad” (severity) with “how important” (criticality) to produce a single prioritization score. Used to rank assets for maintenance scheduling.
Recommended Intervention Window
Section titled “Recommended Intervention Window”| RUL (days) | Recommendation |
|---|---|
| <= 7 | Immediate |
| <= 30 | 7-day window |
| <= 90 | 30-day window |
| > 90 | Planned (next scheduled outage) |
25.8 Shannon Entropy (Module B.3 Mathematics)
Section titled “25.8 Shannon Entropy (Module B.3 Mathematics)”RAPID AI uses information-theoretic entropy to quantify how “disordered” or “concentrated” the energy in a vibration signal is. A healthy machine distributes energy predictably across known frequencies. A deteriorating machine scatters energy in unexpected ways.
Shannon Entropy
Section titled “Shannon Entropy”For a discrete probability distribution P = {p(x_1), p(x_2), …, p(x_N)}:
H(X) = -SUM from i=1 to N of [p(x_i) * log2(p(x_i))]Units: bits (when using log base 2).
Range: 0 (all energy in one bin — perfect certainty) to log2(N) (uniform distribution — maximum uncertainty).
Normalized entropy:
H_norm = H(X) / log2(N)Range: 0 to 1. This allows comparison across signals with different numbers of frequency bins.
Spectral Entropy (SE)
Section titled “Spectral Entropy (SE)”Measures how uniformly energy is distributed across frequency bands in a single measurement direction.
Computation:
- Compute the power spectral density (PSD) of the vibration signal
- Normalize the PSD to a probability distribution:
p(f_i) = PSD(f_i) / SUM(PSD) - Compute Shannon entropy of the normalized PSD
Interpretation:
- Low SE (< 0.3): Energy concentrated in a few dominant frequencies. The machine has clear tonal signatures — either healthy (1x, blade pass) or a specific defect (bearing frequencies, gear mesh).
- High SE (> 0.7): Energy spread across many frequencies. Broadband energy indicates turbulence, looseness, or multiple competing fault mechanisms.
Temporal Entropy (TE)
Section titled “Temporal Entropy (TE)”Measures signal consistency over successive time windows. The signal is divided into overlapping windows and the same feature (e.g., RMS amplitude) is computed in each.
Computation:
- Divide the time-domain signal into W overlapping windows
- Compute a scalar feature (RMS, peak, kurtosis) in each window
- Normalize the feature vector to a probability distribution
- Compute Shannon entropy
Interpretation:
- Low TE: The signal is consistent over time — steady-state operation.
- High TE: The signal varies significantly between windows — transient events, intermittent contact, or load fluctuations.
Directional Entropy (DE)
Section titled “Directional Entropy (DE)”Measures how energy is distributed across measurement directions (axial, horizontal, vertical).
Computation:
- Compute RMS amplitude in each direction: A, H, V
- Normalize to a probability distribution:
p(d) = RMS(d) / (A + H + V) - Compute Shannon entropy of the three-element distribution
Interpretation:
- Low DE: Energy dominated by one direction — consistent with specific fault types (axial = misalignment, radial = unbalance).
- High DE (approaching log2(3) = 1.58 bits, or H_norm approaching 1.0): Energy equally distributed — structural looseness, multiple faults, or healthy baseline.
Stability Index
Section titled “Stability Index”The Stability Index (SI) combines the three entropy measures into a single health indicator:
SI = 1 - (0.5 * SE + 0.3 * TE + 0.2 * DE)Weights rationale:
- Spectral Entropy (0.5): The dominant contributor because spectral shape is the most informative feature for fault detection.
- Temporal Entropy (0.3): Captures non-stationarity, which indicates developing faults.
- Directional Entropy (0.2): Supplements with directional information but is the least discriminating on its own.
Interpretation:
- SI close to 1.0: Low entropy across all dimensions. Energy is flowing through the machine in a predictable, organized pattern. The machine is healthy.
- SI close to 0.0: High entropy everywhere. Energy is scattered, trapped, or misdirected. The machine is degraded or has multiple active fault mechanisms.
Physical analogy: A healthy pump converts electrical energy into hydraulic energy along a defined path (motor -> coupling -> shaft -> impeller -> fluid). Entropy measures how much energy leaks off this path into vibration, heat, and noise. Low leakage = high SI = healthy. High leakage = low SI = degraded.
25.9 SSI Fusion Mathematics (Module C)
Section titled “25.9 SSI Fusion Mathematics (Module C)”The System Severity Index (SSI) is Module C’s composite health score. It fuses evidence from multiple diagnostic blocks (vibration, thermal, electrical, process, inspection) into a single 0-1 severity score per asset.
Block Scores
Section titled “Block Scores”Each diagnostic block produces a block score B_k from the component-level evidence within that block:
B_k = weighted combination of component-level severity indicatorsThe exact weighting depends on the block type. For vibration, the components might be 1x amplitude, bearing defect frequencies, broadband energy, and axial/radial ratio. Each is weighted by its diagnostic significance for the asset type.
Profile-Weighted Fusion
Section titled “Profile-Weighted Fusion”Different asset types weight the blocks differently. A motor weights electrical evidence more heavily than a pump does. A heat exchanger weights thermal evidence more heavily than a compressor does.
The SSI is the profile-weighted mean of block scores:
SSI = SUM(w_k * B_k) / SUM(w_k)where w_k is the weight assigned to block k by the asset profile.
Block-Score-Range (BSR) Override
Section titled “Block-Score-Range (BSR) Override”When any single block score exceeds a critical threshold, the SSI is overridden regardless of the weighted average. This prevents a catastrophic vibration reading from being diluted by normal temperature and electrical readings.
BSR override rules:
- If any B_k > 0.85: SSI = max(SSI_calculated, B_k * 0.95)
- If any B_k > 0.95: SSI is forced to at least 0.90
Normalization for Missing Blocks
Section titled “Normalization for Missing Blocks”When sensor data is unavailable for one or more blocks, the SSI adjusts:
SSI = SUM(w_k * B_k for available blocks) / SUM(w_k for available blocks)The confidence score is reduced proportionally to reflect the incomplete evidence.
Module C Confidence
Section titled “Module C Confidence”The confidence of the SSI is computed from two factors:
C_module_c = 0.6 * SSI_quality + 0.4 * SEIwhere SSI_quality measures the completeness and consistency of the block scores, and SEI (Sensor Evidence Index) measures how many sensor types contributed to the assessment.
25.10 Weibull Fitting Algorithms
Section titled “25.10 Weibull Fitting Algorithms”RAPID AI implements two methods for fitting Weibull parameters from field data.
Median Rank Regression (MRR)
Section titled “Median Rank Regression (MRR)”Used when data is sparse (3-20 failures). The algorithm:
-
Sort failure times ascending. Separate uncensored failures from censored (suspended) items.
-
Assign median ranks using Bernard’s approximation:
F(i) = (i - 0.3) / (n + 0.4)For censored data, use the adjusted rank method with increments:
increment = (n + 1 - prev_rank) / (remaining + 1) -
Linearize to Weibull probability paper:
X_i = ln(t_i)Y_i = ln(ln(1 / (1 - F_i))) -
Ordinary Least Squares regression on (X, Y) to get slope and intercept.
-
Recover parameters:
beta = slopeeta = exp(-intercept / beta) -
Goodness of fit: R-squared >= 0.85 for “fitted” status; >= 0.70 for “fitted_low_confidence”; below 0.70 is “poor_fit”.
Maximum Likelihood Estimation (MLE)
Section titled “Maximum Likelihood Estimation (MLE)”Used for 5+ failures or when censored data is present. Statistically optimal for larger datasets.
Log-likelihood for mixed data:
L(beta, eta) = SUM_failed [ln(beta) - beta*ln(eta) + (beta-1)*ln(t_i) - (t_i/eta)^beta] + SUM_censored [-(t_j/eta)^beta]Score equation for beta (set dL/d(beta) = 0):
n_f/beta + SUM_f(ln(t_i)) - [SUM_all(t_i^beta * ln(t_i))] / [SUM_all(t_i^beta)] = 0Closed-form eta given beta:
eta = (SUM_all(t_i^beta) / n_f)^(1/beta)Solve numerically using Brent’s method (bracket beta in [0.01, 200]).
Method Selection
Section titled “Method Selection”| Condition | Method | Status |
|---|---|---|
| n_failures < 3 | None | insufficient_data |
| 3 <= n_failures < 5 | MRR | low_sample_warning |
| 5 <= n_failures < 20, no censoring | MRR | ok |
| n_failures >= 20 or censored data | MLE | ok |
25.11 Reliability Growth (Duane Model)
Section titled “25.11 Reliability Growth (Duane Model)”The Duane model tracks whether reliability is improving or deteriorating over time. Used in the Reliability Intelligence Layer to identify assets where maintenance interventions are (or are not) having the desired effect.
Cumulative MTBF:
MTBF_cum(T) = T / N(T)where N(T) is the cumulative number of failures by time T.
Duane plot: Log(MTBF_cum) vs Log(T). If the relationship is linear:
ln(MTBF_cum) = a + alpha * ln(T)Duane slope alpha:
- alpha > 0: Reliability is improving (MTBF is increasing over time)
- alpha = 0: Reliability is constant
- alpha < 0: Reliability is deteriorating
Typical targets: alpha = 0.3-0.5 indicates healthy reliability growth during a commissioning or improvement program.
Summary
Section titled “Summary”The mathematics in this chapter is not a textbook exercise. Every equation maps to a production code path in RAPID AI:
| Mathematics | Implementation | Purpose |
|---|---|---|
| Weibull distribution | weibull_model_engine.py | Failure pattern classification |
| Exponential hazard | rul_engine.py | 30-day failure probability |
| F001/F002/F003 models | rul_engine.py | Remaining Useful Life estimation |
| Shannon entropy | Module B.3 scoring | Signal disorder quantification |
| SSI fusion | Module C scoring | Multi-source health assessment |
| Series/parallel reliability | Dependency graph analysis | System-level availability |
| Duane model | Reliability growth analysis | Improvement tracking |
| P-F interval | Inspection scheduling | Monitoring frequency selection |
The goal is not mathematical elegance but engineering utility: transform sensor readings and failure history into the three numbers a maintenance engineer needs — how bad is it (severity), how sure are we (confidence), and how long do we have (RUL).
Standards Alignment
Section titled “Standards Alignment”| Standard | Relevance to This Chapter |
|---|---|
| IEC 61649 — Weibull analysis | This chapter implements IEC 61649’s two-parameter Weibull model (shape beta, scale eta) as the foundation for reliability prediction, extended with condition-based adjustments that bridge population statistics with real-time sensor evidence. |
| ISO 13381-1 — Prognostics | The RUL estimation formulas (linear, accelerating, Weibull-adjusted) implement ISO 13381-1’s prognostic methodology, providing mathematically grounded remaining useful life predictions with quantified uncertainty. |
| ISO 14224 — Reliability and maintenance data | The reliability metrics (MTBF, MTTR, availability, hazard functions) use ISO 14224-compliant data structures and calculation methods for equipment reliability assessment in petroleum, petrochemical, and natural gas industries. |
Changelog
Section titled “Changelog”| Version | Date | Author | Changes |
|---|---|---|---|
| 2.1.0 | 2026-03-17 | Rick D | Added standards alignment, living doc metadata, changelog |
| 2.0.0 | 2026-03-17 | Rick D | Enriched with production codebase content |
| 1.0.0 | 2026-03-17 | Rick D | Initial chapter creation |