Testing Strategy

Chapter 24 — Testing Strategy

RAPID AI is a safety-adjacent system. When it tells a reliability engineer that a pump bearing has outer race spalling with 85% confidence, that engineer may shut down a process line, order a $40,000 replacement, and mobilize a maintenance crew. A false positive wastes money. A false negative risks catastrophic failure. The testing strategy exists to make the diagnostic engines trustworthy — not through hope, but through systematic verification at every layer.

24.1 Testing Philosophy

Coverage Target: 80% Minimum

Every module in the codebase must maintain 80% or higher line coverage. The diagnostic engines (Modules A through E) must exceed 90%. Coverage is measured by pytest-cov for Python and vitest for SvelteKit, and enforced in CI — a PR that drops coverage below threshold cannot merge.

TDD Cycle: Red, Green, Refactor

All new diagnostic logic follows test-driven development:

Red: Write a test that exercises the expected behavior. Run it. Watch it fail. This confirms the test is actually testing something.
Green: Write the minimum code to make the test pass. No cleverness, no optimization — just make it work.
Refactor: Clean up the implementation without changing behavior. The tests guard against regressions during refactoring.

Rules Are Data; Test the Engine

RAPID AI contains hundreds of rules: 16 guard rules (DG001-DG019), 119 component failure mode rules, 50 signal feature rules (SF001-SF051), 22 block scoring rules (BSR001-BSR022), 10 health stage rules (HSR001-HSR010), and 10 priority window rules (PWR001-PWR010). Testing each rule individually would create a brittle, unmaintainable test suite that breaks every time Dibyendu adds a new failure mode.

Instead, the strategy is: test the engine, not individual rules. If the rule evaluator correctly handles every operator type (>, >=, <, <=, ==, !=, BETWEEN, IN, LIKE), correctly parses parenthetical grouping, and correctly treats semicolons as AND conjunctions, then any well-formed rule will evaluate correctly. Engine correctness guarantees rule correctness.

The exception is regression tests for specific known-good diagnostic scenarios (Section 24.5), which validate the combined behavior of rules plus engine.

24.2 Unit Testing

Unit tests verify individual functions and modules in isolation, with no database, no network, and no filesystem access. All external dependencies are mocked or stubbed.

Framework

Python: pytest with fixtures, parametrize for data-driven tests, unittest.mock for stubs
TypeScript: vitest with SvelteKit testing utilities

Rule Evaluator

The rule evaluator parses conditional expressions from the IMS and evaluates them against sensor data. It must handle:

# Test every comparison operator
def test_greater_than():
    assert evaluate("rms > 4.5", {"rms": 5.0}) == True
    assert evaluate("rms > 4.5", {"rms": 4.5}) == False

def test_between():
    assert evaluate("kurtosis BETWEEN 3.0 AND 6.0", {"kurtosis": 4.5}) == True
    assert evaluate("kurtosis BETWEEN 3.0 AND 6.0", {"kurtosis": 7.0}) == False

# Test semicolons as AND
def test_semicolon_conjunction():
    expr = "rms > 4.5; crest_factor > 3.0; kurtosis > 5.0"
    data = {"rms": 5.0, "crest_factor": 3.5, "kurtosis": 6.0}
    assert evaluate(expr, data) == True

# Test parenthetical grouping
def test_parentheses():
    expr = "(rms > 4.5 AND peak > 10.0) OR kurtosis > 8.0"
    data = {"rms": 3.0, "peak": 5.0, "kurtosis": 9.0}
    assert evaluate(expr, data) == True  # kurtosis branch

# Edge cases
def test_missing_sensor_returns_false():
    assert evaluate("rms > 4.5", {}) == False

def test_null_value_returns_false():
    assert evaluate("rms > 4.5", {"rms": None}) == False

SEDL Entropy Engine

The Spectral Entropy + Distribution Lag engine computes three entropy components (SE, TE, DE) and classifies signal stability state:

@pytest.fixture
def uniform_signal():
    """All FFT bins equal -- maximum entropy."""
    return np.ones(512) / 512

@pytest.fixture
def spike_signal():
    """Single dominant bin -- minimum entropy."""
    signal = np.zeros(512)
    signal[42] = 1.0
    return signal

def test_spectral_entropy_uniform(uniform_signal):
    se = compute_spectral_entropy(uniform_signal)
    assert se == pytest.approx(1.0, abs=0.01)

def test_spectral_entropy_spike(spike_signal):
    se = compute_spectral_entropy(spike_signal)
    assert se == pytest.approx(0.0, abs=0.01)

def test_all_zeros_returns_zero_entropy():
    se = compute_spectral_entropy(np.zeros(512))
    assert se == 0.0  # guard against log(0)

def test_sedl_state_classification():
    # Low SE + low TE + low DE = stable
    assert classify_sedl_state(0.1, 0.1, 0.05) == "stable"
    # High SE + high TE = chaotic
    assert classify_sedl_state(0.9, 0.8, 0.3) == "chaotic"

Fusion Engine (Module C)

Tests cover system profile loading, block score computation (BSR001-BSR022 3-pass evaluation), SSI weighted aggregation, and override logic:

def test_block_score_all_pass():
    """When all evidence blocks pass, SSI should be high."""
    block_results = {f"BSR{i:03d}": 1.0 for i in range(1, 23)}
    ssi = compute_ssi(block_results, profile="centrifugal_pump")
    assert ssi > 0.85

def test_block_score_critical_failure():
    """A single critical block failure should dominate SSI."""
    block_results = {f"BSR{i:03d}": 1.0 for i in range(1, 23)}
    block_results["BSR001"] = 0.0  # bearing health block fails
    ssi = compute_ssi(block_results, profile="centrifugal_pump")
    assert ssi < 0.50

def test_profile_weights_sum_to_one():
    """System profile weights must sum to 1.0 for valid SSI."""
    profile = load_profile("centrifugal_pump")
    total = sum(profile.weights.values())
    assert total == pytest.approx(1.0, abs=0.001)

def test_override_logic():
    """Manual override should replace computed SSI."""
    result = compute_ssi_with_override(
        block_results={...},
        profile="centrifugal_pump",
        override={"ssi": 0.3, "reason": "Known defect, awaiting parts"}
    )
    assert result.ssi == 0.3
    assert result.override_active == True

RUL Engine (Module D)

Tests cover all three RUL models (F001 linear degradation, F002 exponential, F003 Weibull), boundary conditions, and zero-slope guards:

def test_rul_f001_linear_degradation():
    trend = [0.5, 0.6, 0.7, 0.8, 0.9]  # linear increase
    rul = estimate_rul(trend, model="F001", threshold=2.0)
    assert 8 < rul < 15  # reasonable remaining intervals

def test_rul_zero_slope_guard():
    """Flat trend should not predict imminent failure."""
    trend = [1.0, 1.0, 1.0, 1.0, 1.0]
    rul = estimate_rul(trend, model="F001", threshold=2.0)
    assert rul == float('inf')  # or sentinel value for "no degradation"

def test_rul_f003_weibull_shape_parameter():
    """Weibull beta > 1 = wear-out, beta < 1 = infant mortality."""
    result = fit_weibull([100, 200, 150, 180, 220])
    assert result.beta > 1.0  # wear-out pattern

CDE and Causal Engines

def test_cde_trigger_evaluation():
    """CDE trigger fires when SSI drops below threshold."""
    assert cde_should_trigger(ssi=0.35, threshold=0.40) == True
    assert cde_should_trigger(ssi=0.55, threshold=0.40) == False

def test_causal_keyword_matching():
    """Causal engine ranks causes by keyword relevance."""
    causes = rank_causes(
        failure_mode="bearing outer race spalling",
        evidence=["high_frequency_vibration", "elevated_temperature"]
    )
    assert causes[0].cause == "lubrication failure"  # most relevant
    assert causes[0].confidence > 0.6

def test_confidence_label_mapping():
    assert confidence_label(0.95) == "very_high"
    assert confidence_label(0.75) == "high"
    assert confidence_label(0.55) == "moderate"
    assert confidence_label(0.35) == "low"
    assert confidence_label(0.15) == "very_low"

24.3 Integration Testing

Integration tests verify that modules work together with real infrastructure — a test PostgreSQL database, actual SQL queries, and the full diagnostic pipeline.

Full Pipeline Test

The most critical integration test runs the complete diagnostic chain: sensor data in, diagnostic result out.

@pytest.fixture(scope="session")
def test_db():
    """Spin up a test database, apply migrations, load seed data."""
    # Use testcontainers or a dedicated test database
    engine = create_test_engine()
    apply_migrations(engine)
    load_seed_data(engine, "platform/data/00_run_all_seed_inserts.sql")
    yield engine
    engine.dispose()

async def test_full_diagnostic_pipeline(test_db):
    """End-to-end: sensor payload -> all 5 modules -> diagnostic result."""
    payload = load_fixture("ims_scenario_001_bearing_spalling.json")

    result = await run_full_pipeline(
        asset_id="AST-PUMP-001",
        sensor_data=payload,
        db=test_db
    )

    assert result.module_a.quality_score > 0.8
    assert len(result.module_b.failure_modes) > 0
    assert 0.0 <= result.module_c.ssi <= 1.0
    assert result.module_d.health_stage in ["normal", "watch", "alert", "critical"]
    assert len(result.module_e.maintenance_tasks) > 0

IMS Ground Truth Validation

Use the 100 IMS rows as integration test fixtures. For each row, construct a sensor payload that should trigger that specific failure mode, run it through the pipeline, and verify the output matches the expected diagnostic chain:

@pytest.mark.parametrize("ims_id", [f"IMS{i:03d}" for i in range(1, 101)])
async def test_ims_scenario(test_db, ims_id):
    scenario = load_ims_scenario(ims_id)
    result = await run_full_pipeline(
        asset_id=scenario.asset_id,
        sensor_data=scenario.sensor_payload,
        db=test_db
    )
    assert scenario.expected_failure_mode in [
        fm.failure_mode for fm in result.module_b.failure_modes
    ]

API Contract Testing

async def test_diagnose_endpoint_contract(client):
    response = await client.post(
        "/rapid-ai/v1/assets/AST-PUMP-001/diagnose",
        json={"sensor_data": {...}, "timestamp": "2026-03-17T14:00:00Z"}
    )
    assert response.status_code == 200
    body = response.json()
    assert "ssi_score" in body
    assert "failure_modes" in body
    assert "health_stage" in body
    assert "confidence" in body

async def test_diagnose_invalid_asset_returns_404(client):
    response = await client.post(
        "/rapid-ai/v1/assets/NONEXISTENT/diagnose",
        json={"sensor_data": {...}}
    )
    assert response.status_code == 404

async def test_diagnose_missing_payload_returns_422(client):
    response = await client.post(
        "/rapid-ai/v1/assets/AST-PUMP-001/diagnose",
        json={}
    )
    assert response.status_code == 422

Auth Integration

async def test_unauthenticated_request_returns_401(client):
    response = await client.get("/rapid-ai/v1/assets/AST-PUMP-001")
    assert response.status_code == 401

async def test_operator_cannot_access_admin_routes(operator_client):
    response = await operator_client.post("/rapid-ai/v1/admin/schema/reload")
    assert response.status_code == 403

async def test_admin_can_reload_schema(admin_client):
    response = await admin_client.post("/rapid-ai/v1/admin/schema/reload")
    assert response.status_code == 200

24.4 End-to-End Testing

E2E tests drive the actual browser through the SvelteKit frontend, verifying that the full stack works from the user’s perspective.

Framework

Playwright with TypeScript. Tests run against a fully deployed stack (backend + frontend + database with seed data).

Critical User Flows

// Flow 1: Dashboard to Diagnostic Result
test("operator views asset health and runs diagnostic", async ({ page }) => {
  await page.goto("/login");
  await page.fill('[name="email"]', "operator@plant.com");
  await page.fill('[name="password"]', "test-password");
  await page.click('button[type="submit"]');

  // Dashboard loads
  await expect(page.locator("h1")).toContainText("Plant Overview");

  // Navigate to asset
  await page.click('text=AST-PUMP-001');
  await expect(page.locator(".health-card")).toBeVisible();

  // Run diagnostic
  await page.click('button:text("Run Diagnostic")');
  await expect(page.locator(".diagnostic-result")).toBeVisible({ timeout: 10000 });
  await expect(page.locator(".ssi-score")).not.toBeEmpty();
  await expect(page.locator(".failure-modes")).toBeVisible();
});

// Flow 2: RCM Workbook Update
test("engineer updates RCM task and verifies RPN recalculation", async ({ page }) => {
  await loginAsEngineer(page);
  await page.goto("/rcm/AST-PUMP-001");

  // Edit a maintenance task
  await page.click('tr:has-text("bearing inspection") >> button:text("Edit")');
  await page.fill('[name="severity"]', "8");
  await page.click('button:text("Save")');

  // Verify RPN recalculated
  const rpn = page.locator('tr:has-text("bearing inspection") >> .rpn-value');
  await expect(rpn).not.toHaveText("0");
});

// Flow 3: Copilot Interaction
test("operator asks copilot a diagnostic question", async ({ page }) => {
  await loginAsOperator(page);
  await page.goto("/copilot");

  await page.fill('[name="question"]', "What causes high vibration in centrifugal pumps?");
  await page.click('button:text("Ask")');

  // Response streams in
  const response = page.locator(".copilot-response");
  await expect(response).toBeVisible({ timeout: 15000 });
  // Response should cite rules
  await expect(response).toContainText(/[A-Z]{2,3}\d{3}/);  // rule ID pattern
});

Visual Regression Testing

Capture screenshots of key pages and compare against baselines:

test("dashboard visual regression", async ({ page }) => {
  await loginAsOperator(page);
  await page.goto("/dashboard");
  await page.waitForLoadState("networkidle");
  await expect(page).toHaveScreenshot("dashboard.png", { maxDiffPixels: 100 });
});

Performance Testing

test("diagnostic endpoint handles concurrent load", async () => {
  const payload = loadFixture("ims_scenario_001.json");
  const requests = Array.from({ length: 50 }, () =>
    fetch("http://localhost:8000/rapid-ai/v1/assets/AST-PUMP-001/diagnose", {
      method: "POST",
      headers: { "Content-Type": "application/json", "Authorization": "Bearer ..." },
      body: JSON.stringify(payload),
    })
  );

  const responses = await Promise.all(requests);
  const allOk = responses.every((r) => r.status === 200);
  assert(allOk, "All concurrent requests should succeed");

  // p95 latency check
  const durations = responses.map((r) => parseInt(r.headers.get("x-response-time") || "0"));
  const p95 = durations.sort((a, b) => a - b)[Math.floor(durations.length * 0.95)];
  assert(p95 < 500, `p95 latency ${p95}ms exceeds 500ms target`);
});

24.5 Diagnostic Accuracy Testing

This is the most important testing layer. It validates that the system’s diagnostic outputs are correct — not just structurally valid, but factually right.

Ground Truth Dataset

Curate a ground truth dataset from Dibyendu’s 4,000+ validated diagnostic cases. Each entry contains:

Asset type and configuration
Raw sensor readings (or synthetic equivalents)
Known failure mode (confirmed by inspection, teardown, or field validation)
Expected confidence range
Expected health stage

Start with the 100 IMS scenarios as the initial ground truth set. Expand to 500+ as field data accumulates.

Accuracy Metrics

def test_diagnostic_accuracy():
    results = run_all_ground_truth_scenarios()

    # Per-failure-mode metrics
    for failure_mode in unique_failure_modes:
        tp = count_true_positives(results, failure_mode)
        fp = count_false_positives(results, failure_mode)
        fn = count_false_negatives(results, failure_mode)

        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

        assert precision > 0.80, f"{failure_mode}: precision {precision:.2f} < 0.80"
        assert recall > 0.75, f"{failure_mode}: recall {recall:.2f} < 0.75"

Confusion Matrix

Generate a confusion matrix showing which failure modes get misdiagnosed as which. Common confusions to watch for:

Bearing outer race vs. inner race (similar vibration signatures at different frequencies)
Misalignment vs. unbalance (both produce 1x RPM vibration)
Cavitation vs. recirculation (both cause broadband noise in pumps)

The confusion matrix is regenerated on every CI run and stored as a test artifact. Any new off-diagonal entry above 5% triggers a review.

Confidence Calibration

def test_confidence_calibration():
    """Verify that stated confidence matches actual accuracy."""
    results = run_all_ground_truth_scenarios()

    # Bucket results by confidence range
    buckets = {
        "0.8-1.0": [], "0.6-0.8": [], "0.4-0.6": [],
        "0.2-0.4": [], "0.0-0.2": []
    }
    for r in results:
        bucket = get_bucket(r.confidence)
        buckets[bucket].append(r.is_correct)

    # 80-100% confidence should be correct 75-85% of the time
    high_conf = buckets["0.8-1.0"]
    if len(high_conf) > 10:  # need sufficient sample size
        actual_accuracy = sum(high_conf) / len(high_conf)
        assert 0.70 < actual_accuracy < 0.95, \
            f"High confidence calibration off: {actual_accuracy:.2f}"

Regression Guard

Every time a new rule is added or an engine is modified, the full ground truth suite runs. The test fails if any previously passing scenario now fails:

def test_no_accuracy_regression():
    current = run_accuracy_suite()
    baseline = load_baseline("accuracy_baseline.json")

    for scenario_id in baseline:
        if baseline[scenario_id].passed and not current[scenario_id].passed:
            pytest.fail(
                f"Regression: {scenario_id} was correct, now incorrect. "
                f"Expected: {baseline[scenario_id].failure_mode}, "
                f"Got: {current[scenario_id].failure_mode}"
            )

24.6 Test Data Management

Synthetic Sensor Data Generators

Each asset type has a data generator that produces realistic sensor payloads:

def generate_pump_sensor_data(
    failure_mode: str | None = None,
    severity: float = 0.5,
    noise_level: float = 0.1,
    sample_rate: int = 25600,
    duration_seconds: float = 1.0,
) -> SensorPayload:
    """Generate synthetic vibration data for a centrifugal pump.

    If failure_mode is specified, inject the corresponding spectral
    signature (e.g., BPFO harmonics for outer race spalling).
    """
    ...

Generators exist for: centrifugal pump, electric motor, gearbox, compressor, fan, turbine, conveyor, agitator, cooling tower fan, and all other 19 asset types in the IMS.

Known-Good Scenario Fixtures

Stored as JSON files in tests/fixtures/scenarios/:

tests/fixtures/scenarios/
├── IMS001_pump_bearing_spalling.json
├── IMS002_motor_stator_winding.json
├── ...
├── IMS100_conveyor_belt_tracking.json
├── edge_case_missing_sensors.json
├── edge_case_extreme_values.json
├── edge_case_contradictory_evidence.json
└── manifest.json                        # maps scenario -> expected result

Each fixture includes the sensor payload, expected failure modes, expected SSI range, expected health stage, and expected confidence range.

Edge Cases

Case	Description	Expected Behavior
Missing sensors	Only 2 of 5 expected sensor channels present	Reduced confidence, partial diagnosis
Extreme values	RMS = 999.9 mm/s (sensor malfunction)	Guard rule DG003 blocks or penalizes
All zeros	Flatline signal on all channels	Guard rule DG001 blocks with “flatline”
Contradictory evidence	Temperature normal but vibration critical	Both reported, confidence reduced, CDE flags contradiction
All equal FFT bins	Uniform spectrum (white noise)	Maximum spectral entropy, classified as “chaotic”

Performance Benchmarks

Track diagnostic time as a function of rule count:

@pytest.mark.benchmark
def test_diagnostic_performance_scaling(benchmark):
    """Diagnostic time should scale linearly with rule count."""
    for rule_count in [50, 100, 200, 500]:
        rules = generate_synthetic_rules(rule_count)
        result = benchmark(run_diagnostic, rules=rules, sensor_data=fixture)
        assert result.duration_ms < rule_count * 0.5  # 0.5ms per rule max

24.7 CI/CD Integration

Pipeline Stages

commit → pre-commit → PR checks → merge → staging → production

Pre-Commit Hooks

Run on every commit, locally and in CI:

repos:
  - repo: local
    hooks:
      - id: ruff-lint
        name: Python linting (ruff)
        entry: ruff check --fix
        types: [python]
      - id: ruff-format
        name: Python formatting (ruff)
        entry: ruff format --check
        types: [python]
      - id: pyright
        name: Python type checking (pyright)
        entry: pyright
        types: [python]
      - id: svelte-check
        name: SvelteKit type checking
        entry: bun run check
        types_or: [ts, svelte]

PR Checks (Required to Merge)

# GitHub Actions
pr-checks:
  steps:
    - name: Python unit tests
      run: pytest tests/unit/ -v --cov=app --cov-report=xml --cov-fail-under=80

    - name: Python integration tests
      run: pytest tests/integration/ -v --timeout=120
      services:
        postgres:
          image: pgvector/pgvector:pg17

    - name: Frontend unit tests
      run: bun run test:unit -- --coverage

    - name: Frontend type check
      run: bun run check

Merge to Main: Full Suite

merge-checks:
  steps:
    - name: Full test suite
      run: pytest tests/ -v --cov=app --cov-report=xml

    - name: Diagnostic accuracy regression
      run: pytest tests/accuracy/ -v --tb=long
      # Fails if any previously passing ground truth scenario now fails

    - name: Confusion matrix generation
      run: python scripts/generate_confusion_matrix.py
      # Uploads matrix as build artifact

Deploy to Staging: E2E

staging-deploy:
  needs: merge-checks
  steps:
    - name: Deploy to staging
      run: ./scripts/deploy-staging.sh

    - name: Playwright E2E tests
      run: bun run test:e2e
      env:
        BASE_URL: https://staging.rapidai.example.com

    - name: Performance smoke test
      run: python scripts/load_test.py --target staging --concurrent 20 --duration 60

Deploy to Production: Smoke + Canary

production-deploy:
  needs: staging-deploy
  steps:
    - name: Deploy canary (10% traffic)
      run: ./scripts/deploy-canary.sh

    - name: Smoke tests against canary
      run: |
        curl -f https://rapidai.example.com/rapid-ai/v1/health
        python scripts/smoke_test.py --target production

    - name: Monitor error rate (5 minutes)
      run: python scripts/check_error_rate.py --threshold 0.01 --window 300

    - name: Promote to full deployment
      run: ./scripts/promote-canary.sh

Test Result Tracking

All test runs produce artifacts stored for 90 days:

Coverage reports (XML and HTML)
Confusion matrices (PNG and CSV)
Performance benchmark history (JSON, graphed in CI dashboard)
Playwright screenshots and traces (for failed E2E tests)
Accuracy baseline snapshots (updated on each merge to main)

A test that was green yesterday and red today is a regression, not a flaky test. Investigate immediately.

Standards Alignment

Standard	Relevance to This Chapter
ISO 13374 — Condition monitoring and diagnostics of machines	The testing strategy validates that each ISO 13374 processing level (L2 through L6) produces correct outputs for known diagnostic scenarios, with regression tests covering the full processing chain.
ISO 17359 — General guidelines for condition monitoring	The diagnostic accuracy benchmarking (known-good scenarios with expected outputs) implements ISO 17359’s requirement for validated, repeatable condition monitoring system performance.

Changelog

Version	Date	Author	Changes
2.1.0	2026-03-17	Rick D	Added standards alignment, living doc metadata, changelog
2.0.0	2026-03-17	Rick D	Enriched with production codebase content
1.0.0	2026-03-17	Rick D	Initial chapter creation