Skip to content

Testing Strategy

RAPID AI is a safety-adjacent system. When it tells a reliability engineer that a pump bearing has outer race spalling with 85% confidence, that engineer may shut down a process line, order a $40,000 replacement, and mobilize a maintenance crew. A false positive wastes money. A false negative risks catastrophic failure. The testing strategy exists to make the diagnostic engines trustworthy — not through hope, but through systematic verification at every layer.


Every module in the codebase must maintain 80% or higher line coverage. The diagnostic engines (Modules A through E) must exceed 90%. Coverage is measured by pytest-cov for Python and vitest for SvelteKit, and enforced in CI — a PR that drops coverage below threshold cannot merge.

All new diagnostic logic follows test-driven development:

  1. Red: Write a test that exercises the expected behavior. Run it. Watch it fail. This confirms the test is actually testing something.
  2. Green: Write the minimum code to make the test pass. No cleverness, no optimization — just make it work.
  3. Refactor: Clean up the implementation without changing behavior. The tests guard against regressions during refactoring.

RAPID AI contains hundreds of rules: 16 guard rules (DG001-DG019), 119 component failure mode rules, 50 signal feature rules (SF001-SF051), 22 block scoring rules (BSR001-BSR022), 10 health stage rules (HSR001-HSR010), and 10 priority window rules (PWR001-PWR010). Testing each rule individually would create a brittle, unmaintainable test suite that breaks every time Dibyendu adds a new failure mode.

Instead, the strategy is: test the engine, not individual rules. If the rule evaluator correctly handles every operator type (>, >=, <, <=, ==, !=, BETWEEN, IN, LIKE), correctly parses parenthetical grouping, and correctly treats semicolons as AND conjunctions, then any well-formed rule will evaluate correctly. Engine correctness guarantees rule correctness.

The exception is regression tests for specific known-good diagnostic scenarios (Section 24.5), which validate the combined behavior of rules plus engine.


Unit tests verify individual functions and modules in isolation, with no database, no network, and no filesystem access. All external dependencies are mocked or stubbed.

  • Python: pytest with fixtures, parametrize for data-driven tests, unittest.mock for stubs
  • TypeScript: vitest with SvelteKit testing utilities

The rule evaluator parses conditional expressions from the IMS and evaluates them against sensor data. It must handle:

# Test every comparison operator
def test_greater_than():
assert evaluate("rms > 4.5", {"rms": 5.0}) == True
assert evaluate("rms > 4.5", {"rms": 4.5}) == False
def test_between():
assert evaluate("kurtosis BETWEEN 3.0 AND 6.0", {"kurtosis": 4.5}) == True
assert evaluate("kurtosis BETWEEN 3.0 AND 6.0", {"kurtosis": 7.0}) == False
# Test semicolons as AND
def test_semicolon_conjunction():
expr = "rms > 4.5; crest_factor > 3.0; kurtosis > 5.0"
data = {"rms": 5.0, "crest_factor": 3.5, "kurtosis": 6.0}
assert evaluate(expr, data) == True
# Test parenthetical grouping
def test_parentheses():
expr = "(rms > 4.5 AND peak > 10.0) OR kurtosis > 8.0"
data = {"rms": 3.0, "peak": 5.0, "kurtosis": 9.0}
assert evaluate(expr, data) == True # kurtosis branch
# Edge cases
def test_missing_sensor_returns_false():
assert evaluate("rms > 4.5", {}) == False
def test_null_value_returns_false():
assert evaluate("rms > 4.5", {"rms": None}) == False

The Spectral Entropy + Distribution Lag engine computes three entropy components (SE, TE, DE) and classifies signal stability state:

@pytest.fixture
def uniform_signal():
"""All FFT bins equal -- maximum entropy."""
return np.ones(512) / 512
@pytest.fixture
def spike_signal():
"""Single dominant bin -- minimum entropy."""
signal = np.zeros(512)
signal[42] = 1.0
return signal
def test_spectral_entropy_uniform(uniform_signal):
se = compute_spectral_entropy(uniform_signal)
assert se == pytest.approx(1.0, abs=0.01)
def test_spectral_entropy_spike(spike_signal):
se = compute_spectral_entropy(spike_signal)
assert se == pytest.approx(0.0, abs=0.01)
def test_all_zeros_returns_zero_entropy():
se = compute_spectral_entropy(np.zeros(512))
assert se == 0.0 # guard against log(0)
def test_sedl_state_classification():
# Low SE + low TE + low DE = stable
assert classify_sedl_state(0.1, 0.1, 0.05) == "stable"
# High SE + high TE = chaotic
assert classify_sedl_state(0.9, 0.8, 0.3) == "chaotic"

Tests cover system profile loading, block score computation (BSR001-BSR022 3-pass evaluation), SSI weighted aggregation, and override logic:

def test_block_score_all_pass():
"""When all evidence blocks pass, SSI should be high."""
block_results = {f"BSR{i:03d}": 1.0 for i in range(1, 23)}
ssi = compute_ssi(block_results, profile="centrifugal_pump")
assert ssi > 0.85
def test_block_score_critical_failure():
"""A single critical block failure should dominate SSI."""
block_results = {f"BSR{i:03d}": 1.0 for i in range(1, 23)}
block_results["BSR001"] = 0.0 # bearing health block fails
ssi = compute_ssi(block_results, profile="centrifugal_pump")
assert ssi < 0.50
def test_profile_weights_sum_to_one():
"""System profile weights must sum to 1.0 for valid SSI."""
profile = load_profile("centrifugal_pump")
total = sum(profile.weights.values())
assert total == pytest.approx(1.0, abs=0.001)
def test_override_logic():
"""Manual override should replace computed SSI."""
result = compute_ssi_with_override(
block_results={...},
profile="centrifugal_pump",
override={"ssi": 0.3, "reason": "Known defect, awaiting parts"}
)
assert result.ssi == 0.3
assert result.override_active == True

Tests cover all three RUL models (F001 linear degradation, F002 exponential, F003 Weibull), boundary conditions, and zero-slope guards:

def test_rul_f001_linear_degradation():
trend = [0.5, 0.6, 0.7, 0.8, 0.9] # linear increase
rul = estimate_rul(trend, model="F001", threshold=2.0)
assert 8 < rul < 15 # reasonable remaining intervals
def test_rul_zero_slope_guard():
"""Flat trend should not predict imminent failure."""
trend = [1.0, 1.0, 1.0, 1.0, 1.0]
rul = estimate_rul(trend, model="F001", threshold=2.0)
assert rul == float('inf') # or sentinel value for "no degradation"
def test_rul_f003_weibull_shape_parameter():
"""Weibull beta > 1 = wear-out, beta < 1 = infant mortality."""
result = fit_weibull([100, 200, 150, 180, 220])
assert result.beta > 1.0 # wear-out pattern
def test_cde_trigger_evaluation():
"""CDE trigger fires when SSI drops below threshold."""
assert cde_should_trigger(ssi=0.35, threshold=0.40) == True
assert cde_should_trigger(ssi=0.55, threshold=0.40) == False
def test_causal_keyword_matching():
"""Causal engine ranks causes by keyword relevance."""
causes = rank_causes(
failure_mode="bearing outer race spalling",
evidence=["high_frequency_vibration", "elevated_temperature"]
)
assert causes[0].cause == "lubrication failure" # most relevant
assert causes[0].confidence > 0.6
def test_confidence_label_mapping():
assert confidence_label(0.95) == "very_high"
assert confidence_label(0.75) == "high"
assert confidence_label(0.55) == "moderate"
assert confidence_label(0.35) == "low"
assert confidence_label(0.15) == "very_low"

Integration tests verify that modules work together with real infrastructure — a test PostgreSQL database, actual SQL queries, and the full diagnostic pipeline.

The most critical integration test runs the complete diagnostic chain: sensor data in, diagnostic result out.

@pytest.fixture(scope="session")
def test_db():
"""Spin up a test database, apply migrations, load seed data."""
# Use testcontainers or a dedicated test database
engine = create_test_engine()
apply_migrations(engine)
load_seed_data(engine, "platform/data/00_run_all_seed_inserts.sql")
yield engine
engine.dispose()
async def test_full_diagnostic_pipeline(test_db):
"""End-to-end: sensor payload -> all 5 modules -> diagnostic result."""
payload = load_fixture("ims_scenario_001_bearing_spalling.json")
result = await run_full_pipeline(
asset_id="AST-PUMP-001",
sensor_data=payload,
db=test_db
)
assert result.module_a.quality_score > 0.8
assert len(result.module_b.failure_modes) > 0
assert 0.0 <= result.module_c.ssi <= 1.0
assert result.module_d.health_stage in ["normal", "watch", "alert", "critical"]
assert len(result.module_e.maintenance_tasks) > 0

Use the 100 IMS rows as integration test fixtures. For each row, construct a sensor payload that should trigger that specific failure mode, run it through the pipeline, and verify the output matches the expected diagnostic chain:

@pytest.mark.parametrize("ims_id", [f"IMS{i:03d}" for i in range(1, 101)])
async def test_ims_scenario(test_db, ims_id):
scenario = load_ims_scenario(ims_id)
result = await run_full_pipeline(
asset_id=scenario.asset_id,
sensor_data=scenario.sensor_payload,
db=test_db
)
assert scenario.expected_failure_mode in [
fm.failure_mode for fm in result.module_b.failure_modes
]
async def test_diagnose_endpoint_contract(client):
response = await client.post(
"/rapid-ai/v1/assets/AST-PUMP-001/diagnose",
json={"sensor_data": {...}, "timestamp": "2026-03-17T14:00:00Z"}
)
assert response.status_code == 200
body = response.json()
assert "ssi_score" in body
assert "failure_modes" in body
assert "health_stage" in body
assert "confidence" in body
async def test_diagnose_invalid_asset_returns_404(client):
response = await client.post(
"/rapid-ai/v1/assets/NONEXISTENT/diagnose",
json={"sensor_data": {...}}
)
assert response.status_code == 404
async def test_diagnose_missing_payload_returns_422(client):
response = await client.post(
"/rapid-ai/v1/assets/AST-PUMP-001/diagnose",
json={}
)
assert response.status_code == 422
async def test_unauthenticated_request_returns_401(client):
response = await client.get("/rapid-ai/v1/assets/AST-PUMP-001")
assert response.status_code == 401
async def test_operator_cannot_access_admin_routes(operator_client):
response = await operator_client.post("/rapid-ai/v1/admin/schema/reload")
assert response.status_code == 403
async def test_admin_can_reload_schema(admin_client):
response = await admin_client.post("/rapid-ai/v1/admin/schema/reload")
assert response.status_code == 200

E2E tests drive the actual browser through the SvelteKit frontend, verifying that the full stack works from the user’s perspective.

Playwright with TypeScript. Tests run against a fully deployed stack (backend + frontend + database with seed data).

// Flow 1: Dashboard to Diagnostic Result
test("operator views asset health and runs diagnostic", async ({ page }) => {
await page.goto("/login");
await page.fill('[name="email"]', "operator@plant.com");
await page.fill('[name="password"]', "test-password");
await page.click('button[type="submit"]');
// Dashboard loads
await expect(page.locator("h1")).toContainText("Plant Overview");
// Navigate to asset
await page.click('text=AST-PUMP-001');
await expect(page.locator(".health-card")).toBeVisible();
// Run diagnostic
await page.click('button:text("Run Diagnostic")');
await expect(page.locator(".diagnostic-result")).toBeVisible({ timeout: 10000 });
await expect(page.locator(".ssi-score")).not.toBeEmpty();
await expect(page.locator(".failure-modes")).toBeVisible();
});
// Flow 2: RCM Workbook Update
test("engineer updates RCM task and verifies RPN recalculation", async ({ page }) => {
await loginAsEngineer(page);
await page.goto("/rcm/AST-PUMP-001");
// Edit a maintenance task
await page.click('tr:has-text("bearing inspection") >> button:text("Edit")');
await page.fill('[name="severity"]', "8");
await page.click('button:text("Save")');
// Verify RPN recalculated
const rpn = page.locator('tr:has-text("bearing inspection") >> .rpn-value');
await expect(rpn).not.toHaveText("0");
});
// Flow 3: Copilot Interaction
test("operator asks copilot a diagnostic question", async ({ page }) => {
await loginAsOperator(page);
await page.goto("/copilot");
await page.fill('[name="question"]', "What causes high vibration in centrifugal pumps?");
await page.click('button:text("Ask")');
// Response streams in
const response = page.locator(".copilot-response");
await expect(response).toBeVisible({ timeout: 15000 });
// Response should cite rules
await expect(response).toContainText(/[A-Z]{2,3}\d{3}/); // rule ID pattern
});

Capture screenshots of key pages and compare against baselines:

test("dashboard visual regression", async ({ page }) => {
await loginAsOperator(page);
await page.goto("/dashboard");
await page.waitForLoadState("networkidle");
await expect(page).toHaveScreenshot("dashboard.png", { maxDiffPixels: 100 });
});
test("diagnostic endpoint handles concurrent load", async () => {
const payload = loadFixture("ims_scenario_001.json");
const requests = Array.from({ length: 50 }, () =>
fetch("http://localhost:8000/rapid-ai/v1/assets/AST-PUMP-001/diagnose", {
method: "POST",
headers: { "Content-Type": "application/json", "Authorization": "Bearer ..." },
body: JSON.stringify(payload),
})
);
const responses = await Promise.all(requests);
const allOk = responses.every((r) => r.status === 200);
assert(allOk, "All concurrent requests should succeed");
// p95 latency check
const durations = responses.map((r) => parseInt(r.headers.get("x-response-time") || "0"));
const p95 = durations.sort((a, b) => a - b)[Math.floor(durations.length * 0.95)];
assert(p95 < 500, `p95 latency ${p95}ms exceeds 500ms target`);
});

This is the most important testing layer. It validates that the system’s diagnostic outputs are correct — not just structurally valid, but factually right.

Curate a ground truth dataset from Dibyendu’s 4,000+ validated diagnostic cases. Each entry contains:

  • Asset type and configuration
  • Raw sensor readings (or synthetic equivalents)
  • Known failure mode (confirmed by inspection, teardown, or field validation)
  • Expected confidence range
  • Expected health stage

Start with the 100 IMS scenarios as the initial ground truth set. Expand to 500+ as field data accumulates.

def test_diagnostic_accuracy():
results = run_all_ground_truth_scenarios()
# Per-failure-mode metrics
for failure_mode in unique_failure_modes:
tp = count_true_positives(results, failure_mode)
fp = count_false_positives(results, failure_mode)
fn = count_false_negatives(results, failure_mode)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
assert precision > 0.80, f"{failure_mode}: precision {precision:.2f} < 0.80"
assert recall > 0.75, f"{failure_mode}: recall {recall:.2f} < 0.75"

Generate a confusion matrix showing which failure modes get misdiagnosed as which. Common confusions to watch for:

  • Bearing outer race vs. inner race (similar vibration signatures at different frequencies)
  • Misalignment vs. unbalance (both produce 1x RPM vibration)
  • Cavitation vs. recirculation (both cause broadband noise in pumps)

The confusion matrix is regenerated on every CI run and stored as a test artifact. Any new off-diagonal entry above 5% triggers a review.

def test_confidence_calibration():
"""Verify that stated confidence matches actual accuracy."""
results = run_all_ground_truth_scenarios()
# Bucket results by confidence range
buckets = {
"0.8-1.0": [], "0.6-0.8": [], "0.4-0.6": [],
"0.2-0.4": [], "0.0-0.2": []
}
for r in results:
bucket = get_bucket(r.confidence)
buckets[bucket].append(r.is_correct)
# 80-100% confidence should be correct 75-85% of the time
high_conf = buckets["0.8-1.0"]
if len(high_conf) > 10: # need sufficient sample size
actual_accuracy = sum(high_conf) / len(high_conf)
assert 0.70 < actual_accuracy < 0.95, \
f"High confidence calibration off: {actual_accuracy:.2f}"

Every time a new rule is added or an engine is modified, the full ground truth suite runs. The test fails if any previously passing scenario now fails:

def test_no_accuracy_regression():
current = run_accuracy_suite()
baseline = load_baseline("accuracy_baseline.json")
for scenario_id in baseline:
if baseline[scenario_id].passed and not current[scenario_id].passed:
pytest.fail(
f"Regression: {scenario_id} was correct, now incorrect. "
f"Expected: {baseline[scenario_id].failure_mode}, "
f"Got: {current[scenario_id].failure_mode}"
)

Each asset type has a data generator that produces realistic sensor payloads:

def generate_pump_sensor_data(
failure_mode: str | None = None,
severity: float = 0.5,
noise_level: float = 0.1,
sample_rate: int = 25600,
duration_seconds: float = 1.0,
) -> SensorPayload:
"""Generate synthetic vibration data for a centrifugal pump.
If failure_mode is specified, inject the corresponding spectral
signature (e.g., BPFO harmonics for outer race spalling).
"""
...

Generators exist for: centrifugal pump, electric motor, gearbox, compressor, fan, turbine, conveyor, agitator, cooling tower fan, and all other 19 asset types in the IMS.

Stored as JSON files in tests/fixtures/scenarios/:

tests/fixtures/scenarios/
├── IMS001_pump_bearing_spalling.json
├── IMS002_motor_stator_winding.json
├── ...
├── IMS100_conveyor_belt_tracking.json
├── edge_case_missing_sensors.json
├── edge_case_extreme_values.json
├── edge_case_contradictory_evidence.json
└── manifest.json # maps scenario -> expected result

Each fixture includes the sensor payload, expected failure modes, expected SSI range, expected health stage, and expected confidence range.

CaseDescriptionExpected Behavior
Missing sensorsOnly 2 of 5 expected sensor channels presentReduced confidence, partial diagnosis
Extreme valuesRMS = 999.9 mm/s (sensor malfunction)Guard rule DG003 blocks or penalizes
All zerosFlatline signal on all channelsGuard rule DG001 blocks with “flatline”
Contradictory evidenceTemperature normal but vibration criticalBoth reported, confidence reduced, CDE flags contradiction
All equal FFT binsUniform spectrum (white noise)Maximum spectral entropy, classified as “chaotic”

Track diagnostic time as a function of rule count:

@pytest.mark.benchmark
def test_diagnostic_performance_scaling(benchmark):
"""Diagnostic time should scale linearly with rule count."""
for rule_count in [50, 100, 200, 500]:
rules = generate_synthetic_rules(rule_count)
result = benchmark(run_diagnostic, rules=rules, sensor_data=fixture)
assert result.duration_ms < rule_count * 0.5 # 0.5ms per rule max

commit → pre-commit → PR checks → merge → staging → production

Run on every commit, locally and in CI:

.pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: ruff-lint
name: Python linting (ruff)
entry: ruff check --fix
types: [python]
- id: ruff-format
name: Python formatting (ruff)
entry: ruff format --check
types: [python]
- id: pyright
name: Python type checking (pyright)
entry: pyright
types: [python]
- id: svelte-check
name: SvelteKit type checking
entry: bun run check
types_or: [ts, svelte]
# GitHub Actions
pr-checks:
steps:
- name: Python unit tests
run: pytest tests/unit/ -v --cov=app --cov-report=xml --cov-fail-under=80
- name: Python integration tests
run: pytest tests/integration/ -v --timeout=120
services:
postgres:
image: pgvector/pgvector:pg17
- name: Frontend unit tests
run: bun run test:unit -- --coverage
- name: Frontend type check
run: bun run check
merge-checks:
steps:
- name: Full test suite
run: pytest tests/ -v --cov=app --cov-report=xml
- name: Diagnostic accuracy regression
run: pytest tests/accuracy/ -v --tb=long
# Fails if any previously passing ground truth scenario now fails
- name: Confusion matrix generation
run: python scripts/generate_confusion_matrix.py
# Uploads matrix as build artifact
staging-deploy:
needs: merge-checks
steps:
- name: Deploy to staging
run: ./scripts/deploy-staging.sh
- name: Playwright E2E tests
run: bun run test:e2e
env:
BASE_URL: https://staging.rapidai.example.com
- name: Performance smoke test
run: python scripts/load_test.py --target staging --concurrent 20 --duration 60
production-deploy:
needs: staging-deploy
steps:
- name: Deploy canary (10% traffic)
run: ./scripts/deploy-canary.sh
- name: Smoke tests against canary
run: |
curl -f https://rapidai.example.com/rapid-ai/v1/health
python scripts/smoke_test.py --target production
- name: Monitor error rate (5 minutes)
run: python scripts/check_error_rate.py --threshold 0.01 --window 300
- name: Promote to full deployment
run: ./scripts/promote-canary.sh

All test runs produce artifacts stored for 90 days:

  • Coverage reports (XML and HTML)
  • Confusion matrices (PNG and CSV)
  • Performance benchmark history (JSON, graphed in CI dashboard)
  • Playwright screenshots and traces (for failed E2E tests)
  • Accuracy baseline snapshots (updated on each merge to main)

A test that was green yesterday and red today is a regression, not a flaky test. Investigate immediately.


StandardRelevance to This Chapter
ISO 13374 — Condition monitoring and diagnostics of machinesThe testing strategy validates that each ISO 13374 processing level (L2 through L6) produces correct outputs for known diagnostic scenarios, with regression tests covering the full processing chain.
ISO 17359 — General guidelines for condition monitoringThe diagnostic accuracy benchmarking (known-good scenarios with expected outputs) implements ISO 17359’s requirement for validated, repeatable condition monitoring system performance.
VersionDateAuthorChanges
2.1.02026-03-17Rick DAdded standards alignment, living doc metadata, changelog
2.0.02026-03-17Rick DEnriched with production codebase content
1.0.02026-03-17Rick DInitial chapter creation