Testing Strategy
Chapter 24 — Testing Strategy
Section titled “Chapter 24 — Testing Strategy”RAPID AI is a safety-adjacent system. When it tells a reliability engineer that a pump bearing has outer race spalling with 85% confidence, that engineer may shut down a process line, order a $40,000 replacement, and mobilize a maintenance crew. A false positive wastes money. A false negative risks catastrophic failure. The testing strategy exists to make the diagnostic engines trustworthy — not through hope, but through systematic verification at every layer.
24.1 Testing Philosophy
Section titled “24.1 Testing Philosophy”Coverage Target: 80% Minimum
Section titled “Coverage Target: 80% Minimum”Every module in the codebase must maintain 80% or higher line coverage. The diagnostic engines (Modules A through E) must exceed 90%. Coverage is measured by pytest-cov for Python and vitest for SvelteKit, and enforced in CI — a PR that drops coverage below threshold cannot merge.
TDD Cycle: Red, Green, Refactor
Section titled “TDD Cycle: Red, Green, Refactor”All new diagnostic logic follows test-driven development:
- Red: Write a test that exercises the expected behavior. Run it. Watch it fail. This confirms the test is actually testing something.
- Green: Write the minimum code to make the test pass. No cleverness, no optimization — just make it work.
- Refactor: Clean up the implementation without changing behavior. The tests guard against regressions during refactoring.
Rules Are Data; Test the Engine
Section titled “Rules Are Data; Test the Engine”RAPID AI contains hundreds of rules: 16 guard rules (DG001-DG019), 119 component failure mode rules, 50 signal feature rules (SF001-SF051), 22 block scoring rules (BSR001-BSR022), 10 health stage rules (HSR001-HSR010), and 10 priority window rules (PWR001-PWR010). Testing each rule individually would create a brittle, unmaintainable test suite that breaks every time Dibyendu adds a new failure mode.
Instead, the strategy is: test the engine, not individual rules. If the rule evaluator correctly handles every operator type (>, >=, <, <=, ==, !=, BETWEEN, IN, LIKE), correctly parses parenthetical grouping, and correctly treats semicolons as AND conjunctions, then any well-formed rule will evaluate correctly. Engine correctness guarantees rule correctness.
The exception is regression tests for specific known-good diagnostic scenarios (Section 24.5), which validate the combined behavior of rules plus engine.
24.2 Unit Testing
Section titled “24.2 Unit Testing”Unit tests verify individual functions and modules in isolation, with no database, no network, and no filesystem access. All external dependencies are mocked or stubbed.
Framework
Section titled “Framework”- Python: pytest with fixtures, parametrize for data-driven tests, unittest.mock for stubs
- TypeScript: vitest with SvelteKit testing utilities
Rule Evaluator
Section titled “Rule Evaluator”The rule evaluator parses conditional expressions from the IMS and evaluates them against sensor data. It must handle:
# Test every comparison operatordef test_greater_than(): assert evaluate("rms > 4.5", {"rms": 5.0}) == True assert evaluate("rms > 4.5", {"rms": 4.5}) == False
def test_between(): assert evaluate("kurtosis BETWEEN 3.0 AND 6.0", {"kurtosis": 4.5}) == True assert evaluate("kurtosis BETWEEN 3.0 AND 6.0", {"kurtosis": 7.0}) == False
# Test semicolons as ANDdef test_semicolon_conjunction(): expr = "rms > 4.5; crest_factor > 3.0; kurtosis > 5.0" data = {"rms": 5.0, "crest_factor": 3.5, "kurtosis": 6.0} assert evaluate(expr, data) == True
# Test parenthetical groupingdef test_parentheses(): expr = "(rms > 4.5 AND peak > 10.0) OR kurtosis > 8.0" data = {"rms": 3.0, "peak": 5.0, "kurtosis": 9.0} assert evaluate(expr, data) == True # kurtosis branch
# Edge casesdef test_missing_sensor_returns_false(): assert evaluate("rms > 4.5", {}) == False
def test_null_value_returns_false(): assert evaluate("rms > 4.5", {"rms": None}) == FalseSEDL Entropy Engine
Section titled “SEDL Entropy Engine”The Spectral Entropy + Distribution Lag engine computes three entropy components (SE, TE, DE) and classifies signal stability state:
@pytest.fixturedef uniform_signal(): """All FFT bins equal -- maximum entropy.""" return np.ones(512) / 512
@pytest.fixturedef spike_signal(): """Single dominant bin -- minimum entropy.""" signal = np.zeros(512) signal[42] = 1.0 return signal
def test_spectral_entropy_uniform(uniform_signal): se = compute_spectral_entropy(uniform_signal) assert se == pytest.approx(1.0, abs=0.01)
def test_spectral_entropy_spike(spike_signal): se = compute_spectral_entropy(spike_signal) assert se == pytest.approx(0.0, abs=0.01)
def test_all_zeros_returns_zero_entropy(): se = compute_spectral_entropy(np.zeros(512)) assert se == 0.0 # guard against log(0)
def test_sedl_state_classification(): # Low SE + low TE + low DE = stable assert classify_sedl_state(0.1, 0.1, 0.05) == "stable" # High SE + high TE = chaotic assert classify_sedl_state(0.9, 0.8, 0.3) == "chaotic"Fusion Engine (Module C)
Section titled “Fusion Engine (Module C)”Tests cover system profile loading, block score computation (BSR001-BSR022 3-pass evaluation), SSI weighted aggregation, and override logic:
def test_block_score_all_pass(): """When all evidence blocks pass, SSI should be high.""" block_results = {f"BSR{i:03d}": 1.0 for i in range(1, 23)} ssi = compute_ssi(block_results, profile="centrifugal_pump") assert ssi > 0.85
def test_block_score_critical_failure(): """A single critical block failure should dominate SSI.""" block_results = {f"BSR{i:03d}": 1.0 for i in range(1, 23)} block_results["BSR001"] = 0.0 # bearing health block fails ssi = compute_ssi(block_results, profile="centrifugal_pump") assert ssi < 0.50
def test_profile_weights_sum_to_one(): """System profile weights must sum to 1.0 for valid SSI.""" profile = load_profile("centrifugal_pump") total = sum(profile.weights.values()) assert total == pytest.approx(1.0, abs=0.001)
def test_override_logic(): """Manual override should replace computed SSI.""" result = compute_ssi_with_override( block_results={...}, profile="centrifugal_pump", override={"ssi": 0.3, "reason": "Known defect, awaiting parts"} ) assert result.ssi == 0.3 assert result.override_active == TrueRUL Engine (Module D)
Section titled “RUL Engine (Module D)”Tests cover all three RUL models (F001 linear degradation, F002 exponential, F003 Weibull), boundary conditions, and zero-slope guards:
def test_rul_f001_linear_degradation(): trend = [0.5, 0.6, 0.7, 0.8, 0.9] # linear increase rul = estimate_rul(trend, model="F001", threshold=2.0) assert 8 < rul < 15 # reasonable remaining intervals
def test_rul_zero_slope_guard(): """Flat trend should not predict imminent failure.""" trend = [1.0, 1.0, 1.0, 1.0, 1.0] rul = estimate_rul(trend, model="F001", threshold=2.0) assert rul == float('inf') # or sentinel value for "no degradation"
def test_rul_f003_weibull_shape_parameter(): """Weibull beta > 1 = wear-out, beta < 1 = infant mortality.""" result = fit_weibull([100, 200, 150, 180, 220]) assert result.beta > 1.0 # wear-out patternCDE and Causal Engines
Section titled “CDE and Causal Engines”def test_cde_trigger_evaluation(): """CDE trigger fires when SSI drops below threshold.""" assert cde_should_trigger(ssi=0.35, threshold=0.40) == True assert cde_should_trigger(ssi=0.55, threshold=0.40) == False
def test_causal_keyword_matching(): """Causal engine ranks causes by keyword relevance.""" causes = rank_causes( failure_mode="bearing outer race spalling", evidence=["high_frequency_vibration", "elevated_temperature"] ) assert causes[0].cause == "lubrication failure" # most relevant assert causes[0].confidence > 0.6
def test_confidence_label_mapping(): assert confidence_label(0.95) == "very_high" assert confidence_label(0.75) == "high" assert confidence_label(0.55) == "moderate" assert confidence_label(0.35) == "low" assert confidence_label(0.15) == "very_low"24.3 Integration Testing
Section titled “24.3 Integration Testing”Integration tests verify that modules work together with real infrastructure — a test PostgreSQL database, actual SQL queries, and the full diagnostic pipeline.
Full Pipeline Test
Section titled “Full Pipeline Test”The most critical integration test runs the complete diagnostic chain: sensor data in, diagnostic result out.
@pytest.fixture(scope="session")def test_db(): """Spin up a test database, apply migrations, load seed data.""" # Use testcontainers or a dedicated test database engine = create_test_engine() apply_migrations(engine) load_seed_data(engine, "platform/data/00_run_all_seed_inserts.sql") yield engine engine.dispose()
async def test_full_diagnostic_pipeline(test_db): """End-to-end: sensor payload -> all 5 modules -> diagnostic result.""" payload = load_fixture("ims_scenario_001_bearing_spalling.json")
result = await run_full_pipeline( asset_id="AST-PUMP-001", sensor_data=payload, db=test_db )
assert result.module_a.quality_score > 0.8 assert len(result.module_b.failure_modes) > 0 assert 0.0 <= result.module_c.ssi <= 1.0 assert result.module_d.health_stage in ["normal", "watch", "alert", "critical"] assert len(result.module_e.maintenance_tasks) > 0IMS Ground Truth Validation
Section titled “IMS Ground Truth Validation”Use the 100 IMS rows as integration test fixtures. For each row, construct a sensor payload that should trigger that specific failure mode, run it through the pipeline, and verify the output matches the expected diagnostic chain:
@pytest.mark.parametrize("ims_id", [f"IMS{i:03d}" for i in range(1, 101)])async def test_ims_scenario(test_db, ims_id): scenario = load_ims_scenario(ims_id) result = await run_full_pipeline( asset_id=scenario.asset_id, sensor_data=scenario.sensor_payload, db=test_db ) assert scenario.expected_failure_mode in [ fm.failure_mode for fm in result.module_b.failure_modes ]API Contract Testing
Section titled “API Contract Testing”async def test_diagnose_endpoint_contract(client): response = await client.post( "/rapid-ai/v1/assets/AST-PUMP-001/diagnose", json={"sensor_data": {...}, "timestamp": "2026-03-17T14:00:00Z"} ) assert response.status_code == 200 body = response.json() assert "ssi_score" in body assert "failure_modes" in body assert "health_stage" in body assert "confidence" in body
async def test_diagnose_invalid_asset_returns_404(client): response = await client.post( "/rapid-ai/v1/assets/NONEXISTENT/diagnose", json={"sensor_data": {...}} ) assert response.status_code == 404
async def test_diagnose_missing_payload_returns_422(client): response = await client.post( "/rapid-ai/v1/assets/AST-PUMP-001/diagnose", json={} ) assert response.status_code == 422Auth Integration
Section titled “Auth Integration”async def test_unauthenticated_request_returns_401(client): response = await client.get("/rapid-ai/v1/assets/AST-PUMP-001") assert response.status_code == 401
async def test_operator_cannot_access_admin_routes(operator_client): response = await operator_client.post("/rapid-ai/v1/admin/schema/reload") assert response.status_code == 403
async def test_admin_can_reload_schema(admin_client): response = await admin_client.post("/rapid-ai/v1/admin/schema/reload") assert response.status_code == 20024.4 End-to-End Testing
Section titled “24.4 End-to-End Testing”E2E tests drive the actual browser through the SvelteKit frontend, verifying that the full stack works from the user’s perspective.
Framework
Section titled “Framework”Playwright with TypeScript. Tests run against a fully deployed stack (backend + frontend + database with seed data).
Critical User Flows
Section titled “Critical User Flows”// Flow 1: Dashboard to Diagnostic Resulttest("operator views asset health and runs diagnostic", async ({ page }) => { await page.goto("/login"); await page.fill('[name="email"]', "operator@plant.com"); await page.fill('[name="password"]', "test-password"); await page.click('button[type="submit"]');
// Dashboard loads await expect(page.locator("h1")).toContainText("Plant Overview");
// Navigate to asset await page.click('text=AST-PUMP-001'); await expect(page.locator(".health-card")).toBeVisible();
// Run diagnostic await page.click('button:text("Run Diagnostic")'); await expect(page.locator(".diagnostic-result")).toBeVisible({ timeout: 10000 }); await expect(page.locator(".ssi-score")).not.toBeEmpty(); await expect(page.locator(".failure-modes")).toBeVisible();});
// Flow 2: RCM Workbook Updatetest("engineer updates RCM task and verifies RPN recalculation", async ({ page }) => { await loginAsEngineer(page); await page.goto("/rcm/AST-PUMP-001");
// Edit a maintenance task await page.click('tr:has-text("bearing inspection") >> button:text("Edit")'); await page.fill('[name="severity"]', "8"); await page.click('button:text("Save")');
// Verify RPN recalculated const rpn = page.locator('tr:has-text("bearing inspection") >> .rpn-value'); await expect(rpn).not.toHaveText("0");});
// Flow 3: Copilot Interactiontest("operator asks copilot a diagnostic question", async ({ page }) => { await loginAsOperator(page); await page.goto("/copilot");
await page.fill('[name="question"]', "What causes high vibration in centrifugal pumps?"); await page.click('button:text("Ask")');
// Response streams in const response = page.locator(".copilot-response"); await expect(response).toBeVisible({ timeout: 15000 }); // Response should cite rules await expect(response).toContainText(/[A-Z]{2,3}\d{3}/); // rule ID pattern});Visual Regression Testing
Section titled “Visual Regression Testing”Capture screenshots of key pages and compare against baselines:
test("dashboard visual regression", async ({ page }) => { await loginAsOperator(page); await page.goto("/dashboard"); await page.waitForLoadState("networkidle"); await expect(page).toHaveScreenshot("dashboard.png", { maxDiffPixels: 100 });});Performance Testing
Section titled “Performance Testing”test("diagnostic endpoint handles concurrent load", async () => { const payload = loadFixture("ims_scenario_001.json"); const requests = Array.from({ length: 50 }, () => fetch("http://localhost:8000/rapid-ai/v1/assets/AST-PUMP-001/diagnose", { method: "POST", headers: { "Content-Type": "application/json", "Authorization": "Bearer ..." }, body: JSON.stringify(payload), }) );
const responses = await Promise.all(requests); const allOk = responses.every((r) => r.status === 200); assert(allOk, "All concurrent requests should succeed");
// p95 latency check const durations = responses.map((r) => parseInt(r.headers.get("x-response-time") || "0")); const p95 = durations.sort((a, b) => a - b)[Math.floor(durations.length * 0.95)]; assert(p95 < 500, `p95 latency ${p95}ms exceeds 500ms target`);});24.5 Diagnostic Accuracy Testing
Section titled “24.5 Diagnostic Accuracy Testing”This is the most important testing layer. It validates that the system’s diagnostic outputs are correct — not just structurally valid, but factually right.
Ground Truth Dataset
Section titled “Ground Truth Dataset”Curate a ground truth dataset from Dibyendu’s 4,000+ validated diagnostic cases. Each entry contains:
- Asset type and configuration
- Raw sensor readings (or synthetic equivalents)
- Known failure mode (confirmed by inspection, teardown, or field validation)
- Expected confidence range
- Expected health stage
Start with the 100 IMS scenarios as the initial ground truth set. Expand to 500+ as field data accumulates.
Accuracy Metrics
Section titled “Accuracy Metrics”def test_diagnostic_accuracy(): results = run_all_ground_truth_scenarios()
# Per-failure-mode metrics for failure_mode in unique_failure_modes: tp = count_true_positives(results, failure_mode) fp = count_false_positives(results, failure_mode) fn = count_false_negatives(results, failure_mode)
precision = tp / (tp + fp) if (tp + fp) > 0 else 0 recall = tp / (tp + fn) if (tp + fn) > 0 else 0 f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
assert precision > 0.80, f"{failure_mode}: precision {precision:.2f} < 0.80" assert recall > 0.75, f"{failure_mode}: recall {recall:.2f} < 0.75"Confusion Matrix
Section titled “Confusion Matrix”Generate a confusion matrix showing which failure modes get misdiagnosed as which. Common confusions to watch for:
- Bearing outer race vs. inner race (similar vibration signatures at different frequencies)
- Misalignment vs. unbalance (both produce 1x RPM vibration)
- Cavitation vs. recirculation (both cause broadband noise in pumps)
The confusion matrix is regenerated on every CI run and stored as a test artifact. Any new off-diagonal entry above 5% triggers a review.
Confidence Calibration
Section titled “Confidence Calibration”def test_confidence_calibration(): """Verify that stated confidence matches actual accuracy.""" results = run_all_ground_truth_scenarios()
# Bucket results by confidence range buckets = { "0.8-1.0": [], "0.6-0.8": [], "0.4-0.6": [], "0.2-0.4": [], "0.0-0.2": [] } for r in results: bucket = get_bucket(r.confidence) buckets[bucket].append(r.is_correct)
# 80-100% confidence should be correct 75-85% of the time high_conf = buckets["0.8-1.0"] if len(high_conf) > 10: # need sufficient sample size actual_accuracy = sum(high_conf) / len(high_conf) assert 0.70 < actual_accuracy < 0.95, \ f"High confidence calibration off: {actual_accuracy:.2f}"Regression Guard
Section titled “Regression Guard”Every time a new rule is added or an engine is modified, the full ground truth suite runs. The test fails if any previously passing scenario now fails:
def test_no_accuracy_regression(): current = run_accuracy_suite() baseline = load_baseline("accuracy_baseline.json")
for scenario_id in baseline: if baseline[scenario_id].passed and not current[scenario_id].passed: pytest.fail( f"Regression: {scenario_id} was correct, now incorrect. " f"Expected: {baseline[scenario_id].failure_mode}, " f"Got: {current[scenario_id].failure_mode}" )24.6 Test Data Management
Section titled “24.6 Test Data Management”Synthetic Sensor Data Generators
Section titled “Synthetic Sensor Data Generators”Each asset type has a data generator that produces realistic sensor payloads:
def generate_pump_sensor_data( failure_mode: str | None = None, severity: float = 0.5, noise_level: float = 0.1, sample_rate: int = 25600, duration_seconds: float = 1.0,) -> SensorPayload: """Generate synthetic vibration data for a centrifugal pump.
If failure_mode is specified, inject the corresponding spectral signature (e.g., BPFO harmonics for outer race spalling). """ ...Generators exist for: centrifugal pump, electric motor, gearbox, compressor, fan, turbine, conveyor, agitator, cooling tower fan, and all other 19 asset types in the IMS.
Known-Good Scenario Fixtures
Section titled “Known-Good Scenario Fixtures”Stored as JSON files in tests/fixtures/scenarios/:
tests/fixtures/scenarios/├── IMS001_pump_bearing_spalling.json├── IMS002_motor_stator_winding.json├── ...├── IMS100_conveyor_belt_tracking.json├── edge_case_missing_sensors.json├── edge_case_extreme_values.json├── edge_case_contradictory_evidence.json└── manifest.json # maps scenario -> expected resultEach fixture includes the sensor payload, expected failure modes, expected SSI range, expected health stage, and expected confidence range.
Edge Cases
Section titled “Edge Cases”| Case | Description | Expected Behavior |
|---|---|---|
| Missing sensors | Only 2 of 5 expected sensor channels present | Reduced confidence, partial diagnosis |
| Extreme values | RMS = 999.9 mm/s (sensor malfunction) | Guard rule DG003 blocks or penalizes |
| All zeros | Flatline signal on all channels | Guard rule DG001 blocks with “flatline” |
| Contradictory evidence | Temperature normal but vibration critical | Both reported, confidence reduced, CDE flags contradiction |
| All equal FFT bins | Uniform spectrum (white noise) | Maximum spectral entropy, classified as “chaotic” |
Performance Benchmarks
Section titled “Performance Benchmarks”Track diagnostic time as a function of rule count:
@pytest.mark.benchmarkdef test_diagnostic_performance_scaling(benchmark): """Diagnostic time should scale linearly with rule count.""" for rule_count in [50, 100, 200, 500]: rules = generate_synthetic_rules(rule_count) result = benchmark(run_diagnostic, rules=rules, sensor_data=fixture) assert result.duration_ms < rule_count * 0.5 # 0.5ms per rule max24.7 CI/CD Integration
Section titled “24.7 CI/CD Integration”Pipeline Stages
Section titled “Pipeline Stages”commit → pre-commit → PR checks → merge → staging → productionPre-Commit Hooks
Section titled “Pre-Commit Hooks”Run on every commit, locally and in CI:
repos: - repo: local hooks: - id: ruff-lint name: Python linting (ruff) entry: ruff check --fix types: [python] - id: ruff-format name: Python formatting (ruff) entry: ruff format --check types: [python] - id: pyright name: Python type checking (pyright) entry: pyright types: [python] - id: svelte-check name: SvelteKit type checking entry: bun run check types_or: [ts, svelte]PR Checks (Required to Merge)
Section titled “PR Checks (Required to Merge)”# GitHub Actionspr-checks: steps: - name: Python unit tests run: pytest tests/unit/ -v --cov=app --cov-report=xml --cov-fail-under=80
- name: Python integration tests run: pytest tests/integration/ -v --timeout=120 services: postgres: image: pgvector/pgvector:pg17
- name: Frontend unit tests run: bun run test:unit -- --coverage
- name: Frontend type check run: bun run checkMerge to Main: Full Suite
Section titled “Merge to Main: Full Suite”merge-checks: steps: - name: Full test suite run: pytest tests/ -v --cov=app --cov-report=xml
- name: Diagnostic accuracy regression run: pytest tests/accuracy/ -v --tb=long # Fails if any previously passing ground truth scenario now fails
- name: Confusion matrix generation run: python scripts/generate_confusion_matrix.py # Uploads matrix as build artifactDeploy to Staging: E2E
Section titled “Deploy to Staging: E2E”staging-deploy: needs: merge-checks steps: - name: Deploy to staging run: ./scripts/deploy-staging.sh
- name: Playwright E2E tests run: bun run test:e2e env: BASE_URL: https://staging.rapidai.example.com
- name: Performance smoke test run: python scripts/load_test.py --target staging --concurrent 20 --duration 60Deploy to Production: Smoke + Canary
Section titled “Deploy to Production: Smoke + Canary”production-deploy: needs: staging-deploy steps: - name: Deploy canary (10% traffic) run: ./scripts/deploy-canary.sh
- name: Smoke tests against canary run: | curl -f https://rapidai.example.com/rapid-ai/v1/health python scripts/smoke_test.py --target production
- name: Monitor error rate (5 minutes) run: python scripts/check_error_rate.py --threshold 0.01 --window 300
- name: Promote to full deployment run: ./scripts/promote-canary.shTest Result Tracking
Section titled “Test Result Tracking”All test runs produce artifacts stored for 90 days:
- Coverage reports (XML and HTML)
- Confusion matrices (PNG and CSV)
- Performance benchmark history (JSON, graphed in CI dashboard)
- Playwright screenshots and traces (for failed E2E tests)
- Accuracy baseline snapshots (updated on each merge to main)
A test that was green yesterday and red today is a regression, not a flaky test. Investigate immediately.
Standards Alignment
Section titled “Standards Alignment”| Standard | Relevance to This Chapter |
|---|---|
| ISO 13374 — Condition monitoring and diagnostics of machines | The testing strategy validates that each ISO 13374 processing level (L2 through L6) produces correct outputs for known diagnostic scenarios, with regression tests covering the full processing chain. |
| ISO 17359 — General guidelines for condition monitoring | The diagnostic accuracy benchmarking (known-good scenarios with expected outputs) implements ISO 17359’s requirement for validated, repeatable condition monitoring system performance. |
Changelog
Section titled “Changelog”| Version | Date | Author | Changes |
|---|---|---|---|
| 2.1.0 | 2026-03-17 | Rick D | Added standards alignment, living doc metadata, changelog |
| 2.0.0 | 2026-03-17 | Rick D | Enriched with production codebase content |
| 1.0.0 | 2026-03-17 | Rick D | Initial chapter creation |