System Architecture

Chapter 14 — System Architecture

System Overview

RAPID AI is built as a monolith-first, API-first engineering intelligence platform. The decision to avoid a microservices architecture at this stage is deliberate: microservices impose network boundaries, deployment orchestration, and distributed debugging overhead that a young product cannot afford. The domain logic of industrial diagnostics is complex enough without scattering it across a dozen independently failing processes.

Instead, RAPID AI adopts a modular monolith. The codebase is organized into concentric rings with strict dependency direction — inner rings know nothing about outer rings — so that services can be extracted later if scale demands it, but today they communicate through function calls, not network hops.

The platform consists of three runtime services:

FastAPI Engine (Python 3.13, port 8000). This is the physics pipeline, the diagnostic intelligence, the swarm coordination, the knowledge search, and every computation that touches RAPID AI’s intellectual property. It runs on uvicorn, uses Pydantic for schema validation, and connects to PostgreSQL via SQLAlchemy Core for read/write operations. The engine owns Ring 0 (pure physics — no I/O) and Ring 1 (orchestration), with Ring 2 (infrastructure) handling all external boundaries: database, AI providers, and agent tools.

SvelteKit Explorer (Bun 1.3, port 5173). This is the consolidated frontend application — previously five separate apps (Explorer, Invent, Monitor, Suite, Workshop), now unified into a single SvelteKit application with route groups for isolation. It serves as both the user-facing dashboard and the backend-for-frontend (BFF) layer that proxies API calls, manages authentication via better-auth, and owns the PostgreSQL schema through Drizzle ORM.

PostgreSQL 17 with pgvector (port 5432). A single database instance serves as the single source of truth (SSOT) for asset hierarchy, analysis results, vector embeddings, sensor readings, work orders, failure history, alerts, and audit logs. pgvector enables semantic similarity search over 768-dimensional embeddings without requiring a separate vector database.

Every capability in the system is exposed as a REST endpoint. The API prefix is /rapid-ai/v1/. This API-first discipline ensures that the SvelteKit frontend, future mobile clients, CMMS integrations, and third-party consumers all access the same capabilities through the same contracts.

Schema ownership follows a strict rule: Drizzle owns the DDL. The SvelteKit application manages all table creation, migrations, and schema changes through drizzle-kit. Python mirrors the schema through SQLAlchemy Table objects that are used for type-safe queries but never create or alter tables. This prevents two ORMs from fighting over the same database.

Architecture Diagram

+------------------------------------------------------------------+
|                        CLIENT LAYER                               |
|                                                                   |
|   +------------------+    +------------------+    +----------+    |
|   | SvelteKit        |    | API Consumers    |    | CMMS     |    |
|   | Explorer         |    | (REST clients,   |    | Export   |    |
|   | Dashboard        |    |  mobile, CLI)    |    | Targets  |    |
|   +--------+---------+    +--------+---------+    +----+-----+    |
|            |                       |                    |         |
+------------|------- --------------|--------------------|---------+
             |                       |                    |
             v                       v                    v
+------------------------------------------------------------------+
|                   API GATEWAY LAYER (FastAPI)                      |
|                   Prefix: /rapid-ai/v1/                           |
|                                                                   |
|   Auth (better-auth)  |  Rate Limiting  |  API Versioning         |
|   Feature Flags       |  CORS           |  Request Validation     |
|                                                                   |
|   +-----------+  +-----------+  +-----------+  +-------------+   |
|   | /evaluate |  | /diagnose |  | /assets/* |  | /swarm/*    |   |
|   | /moduleA  |  | /copilot  |  | /health   |  | /knowledge  |   |
|   | /moduleB  |  |           |  |           |  | /stream     |   |
|   | /moduleC  |  |           |  |           |  |             |   |
|   | /moduleD  |  |           |  |           |  |             |   |
|   | /moduleE  |  |           |  |           |  |             |   |
|   +-----------+  +-----------+  +-----------+  +-------------+   |
+------------------------------------------------------------------+
             |
             v
+------------------------------------------------------------------+
|                      SERVICE LAYER                                |
|                                                                   |
|  +-------------------------------------------------------------+ |
|  |                  AnalysisService (Pipeline)                  | |
|  |                                                              | |
|  |  mA: analyze_signal()                                        | |
|  |    Guard rules (DG001-DG019) --> Signal features             | |
|  |    (RMS, peak, crest, kurtosis) --> ISO zone classification  | |
|  |                    |                                         | |
|  |                    v                                         | |
|  |  mB: detect_faults()                                        | |
|  |    Component rules (121 rules, 12 types)                    | |
|  |    Signal feature rules (SF001-SF051)                       | |
|  |    Trend analysis (Step/Chaotic/Accel/Drift/Stable)         | |
|  |    SEDL entropy (SE + TE + DE)                              | |
|  |                    |                                         | |
|  |                    v                                         | |
|  |  mC: fuse_ssi()                                             | |
|  |    BSR001-BSR022 block scoring (3-pass)                     | |
|  |    SSI weighted aggregation + system profile weights (YAML) | |
|  |                    |                                         | |
|  |                    v                                         | |
|  |  mD: predict_prognostics()                                  | |
|  |    HSR001-HSR010 health stage rules                         | |
|  |    RUL estimation (Weibull-adjusted log-slope)              | |
|  |                    |                                         | |
|  |                    v                                         | |
|  |  mE: plan_maintenance()                                     | |
|  |    PWR001-PWR010 priority windows (2-pass)                  | |
|  |    Priority = 100 x (0.45S + 0.25C + 0.20K + 0.10U)        | |
|  |    Action ranking with boost                                | |
|  +-------------------------------------------------------------+ |
|                                                                   |
|  +-----------------+  +-----------------+  +------------------+  |
|  | Swarm Engine    |  | AI Diagnostician|  | Knowledge (RAG)  |  |
|  | EngineSentinel  |  | Agent loop      |  | Rule embeddings  |  |
|  | TaskPlanner     |  | 5 tools, max 5  |  | Analysis vectors |  |
|  | 4 Workers:      |  | iterations      |  | Cosine search    |  |
|  |  Analyst        |  | FRETTLSM-aware  |  |                  |  |
|  |  Diagnostician  |  |                 |  |                  |  |
|  |  BriefWriter    |  |                 |  |                  |  |
|  |  Knowledge      |  |                 |  |                  |  |
|  +-----------------+  +-----------------+  +------------------+  |
|                                                                   |
|  +------------------+  +-------------------+                     |
|  | Confidence       |  | Provider Registry |                     |
|  | Scoring Module   |  | Gemini > OpenAI > |                     |
|  | 0.0-1.0 range    |  | Cloudflare >      |                     |
|  | canonical std    |  | Template fallback |                     |
|  +------------------+  +-------------------+                     |
+------------------------------------------------------------------+
             |
             v
+------------------------------------------------------------------+
|                        DATA LAYER                                 |
|                                                                   |
|  +-------------------------+  +------------------------------+   |
|  | PostgreSQL 17 + pgvector|  | Object Storage (future)      |   |
|  |                         |  | Waveforms, attachments,      |   |
|  | Asset hierarchy:        |  | inspection images            |   |
|  |   organization          |  +------------------------------+   |
|  |   locations             |                                     |
|  |   areas                 |  +------------------------------+   |
|  |   equipment             |  | Rule Store (YAML)            |   |
|  |   sub_assemblies        |  |   actions.yaml               |   |
|  |   measurement_points    |  |   profiles.yaml              |   |
|  |   spares                |  |   block_scores.yaml          |   |
|  |                         |  |   fusion.yaml                |   |
|  | Intelligence:           |  +------------------------------+   |
|  |   analysis_results      |                                     |
|  |   analysis_vectors      |  +------------------------------+   |
|  |   rule_vectors          |  | Seed Data                    |   |
|  |   sensor_readings       |  |   IMS failure modes (119)    |   |
|  |                         |  |   Signal feature rules (50)  |   |
|  | Maintenance:            |  |   Guard rules (16)           |   |
|  |   work_orders           |  |   System profiles            |   |
|  |   pm_schedules          |  +------------------------------+   |
|  |   failure_history       |                                     |
|  |                         |                                     |
|  | Observability:          |                                     |
|  |   alerts                |                                     |
|  |   audit_log             |                                     |
|  +-------------------------+                                     |
+------------------------------------------------------------------+
             |
             v
+------------------------------------------------------------------+
|                   EXTERNAL INTEGRATIONS                           |
|                                                                   |
|  +-------------------+  +------------------+  +---------------+  |
|  | Sensor Data       |  | CMMS Export      |  | Notification  |  |
|  | Ingestion         |  | Work orders,     |  | Services      |  |
|  | POST /evaluate    |  | spare requests,  |  | SSE stream,   |  |
|  | POST /data/sensor |  | maintenance logs |  | alerts,       |  |
|  |                   |  |                  |  | webhooks      |  |
|  +-------------------+  +------------------+  +---------------+  |
|                                                                   |
|  +-------------------+  +------------------+                     |
|  | AI Providers      |  | Embedding API    |                     |
|  | Gemini, OpenAI,   |  | 768-dim vectors  |                     |
|  | Cloudflare, local |  | for RAG search   |                     |
|  +-------------------+  +------------------+                     |
+------------------------------------------------------------------+

Database Architecture

The database is a single PostgreSQL 17 instance with the pgvector extension. This is a deliberate simplicity: one database to back up, one connection string to configure, one set of migrations to manage.

Schema Design

Drizzle ORM defines the schema in TypeScript. The tables are organized into four functional groups:

Asset Hierarchy. The plant structure follows a strict parent-child chain: organization (owned by better-auth) > locations > areas > equipment > sub_assemblies, which then branch into measurement_points and spares. Each level carries its own metadata: locations have geographic coordinates, equipment has machine type and criticality scores, measurement points have signal type and direction, spares have stock quantities and lead times.

Analysis and Intelligence. analysis_results stores the complete pipeline output for every evaluation: the original request, the full response (as JSONB), the computed SSI, and the severity level. sensor_readings holds raw sensor data keyed by measurement point and timestamp, with values stored as JSONB to accommodate different sensor types without schema changes.

Maintenance. work_orders track maintenance tasks through their lifecycle (status, priority, assignee, parts used). pm_schedules define preventive maintenance intervals with both time-based and condition-based triggers. failure_history stores FMEA records with severity, occurrence, and detection ratings.

Observability. alerts capture system alerts with type, severity, and acknowledgment status. audit_log records every significant user action with the entity type, action taken, and a JSONB details payload for forensic analysis.

pgvector for Semantic Search

Two vector tables store 768-dimensional embeddings:

rule_vectors — Synchronized on startup when the RAG_RULES feature flag is enabled. Each of the 119 component fault rules is embedded as a document containing: the diagnosis, the underlying physics, the FRETTLSM root cause category, the severity progression from early to late stage, and the recommended corrective actions. Content-hash deduplication ensures zero redundant API calls after the initial sync.

analysis_vectors — Embedded as a background task after every successful POST /evaluate. The document includes: asset identity, health stage, SSI score, top faults detected, top recommended actions, and remaining useful life estimate.

Search is performed via GET /rapid-ai/v1/knowledge/search?q=...&top_k=5, which runs cosine similarity queries across both tables, merges the results, and returns them sorted by relevance. This creates a knowledge growth loop: every analysis enriches the search corpus, making future diagnoses more informed.

Key Tables (MVP Set)

Table	PK Type	Key Columns	Purpose
`organization`	text	name, slug	Top-level tenant
`locations`	uuid	organisation_id, country, geo_lat/lon	Plant sites
`areas`	uuid	location_id, area_type	Functional areas
`equipment`	uuid	area_id, machine_type, criticality	Machines
`sub_assemblies`	uuid	equipment_id, component_type, position	Components
`measurement_points`	uuid	sub_assembly_id, direction, signal_type	Sensor locations
`spares`	uuid	sub_assembly_id, part_number, quantity_on_hand	Inventory
`analysis_results`	uuid	request/response JSONB, ssi, severity	Pipeline outputs
`sensor_readings`	uuid	measurement_point_id, timestamp, values JSONB	Raw data
`work_orders`	uuid	status, priority, assignee, parts_used	Maintenance tasks
`failure_history`	uuid	failure_mode, cause, severity/occurrence/detection	FMEA records

Indexing Strategy

High-priority indexes target the most common query patterns:

sensor_readings(measurement_point_id, timestamp) — Time-series lookups by sensor, ordered by time. This is the most critical index in the system; every trend query, every baseline comparison, every dashboard chart hits it.
analysis_results(equipment_id, created_at) — Retrieving analysis history for a specific machine.
equipment(area_id, status) — Listing active machines in a plant area.
work_orders(equipment_id, status) — Finding open work orders for a machine.
rule_vectors and analysis_vectors use IVFFlat indexes on their vector columns for approximate nearest-neighbor search.

For time-series data at scale, PostgreSQL’s native table partitioning (by month on timestamp) is the planned approach before considering a dedicated time-series database. The principle: exhaust PostgreSQL before adding infrastructure.

API Architecture

Endpoint Design

All endpoints are prefixed /rapid-ai/v1/ and follow pragmatic REST conventions. The API surface is organized into five functional groups:

Physics Pipeline — The core diagnostic capability.

Method	Path	Purpose
`POST`	`/evaluate`	Full 5-module pipeline (A > B > C > D > E)
`POST`	`/moduleA`	Signal analysis only
`POST`	`/moduleB`	Fault detection only
`POST`	`/moduleC`	SSI fusion only
`POST`	`/moduleD`	Prognostics only
`POST`	`/moduleE`	Maintenance planning only
`POST`	`/generate-signal`	Synthetic signal generation
`POST`	`/diagnose`	AI diagnostician (feature-gated)

Asset Hierarchy — CRUD operations on plant structure.

Method	Path	Purpose
`GET`	`/assets/organisations`	List all organizations
`GET`	`/assets/locations/{org_id}`	Locations in an organization
`GET`	`/assets/areas/{location_id}`	Areas in a location
`GET`	`/assets/equipment/{area_id}`	Equipment in an area
`GET`	`/assets/equipment/{id}/context`	Equipment with full context
`GET`	`/assets/sub-assemblies/{equipment_id}`	Sub-assemblies
`GET`	`/assets/measurement-points/{sub_id}`	Measurement points
`GET/PATCH`	`/assets/spares/*`	Spare parts and stock management

Swarm (Agent Coordination) — Multi-agent diagnostic intelligence.

Method	Path	Purpose
`POST`	`/swarm/task`	Submit a pre-built AgentTask
`POST`	`/swarm/dispatch`	Intent-based dispatch (plans then executes)
`GET`	`/swarm/capabilities`	List all worker capabilities
`GET`	`/swarm/status`	Worker count, active tasks
`GET`	`/swarm/stream`	SSE stream (heartbeats + events)

Knowledge (RAG) — Semantic search across rules and past analyses.

Method	Path	Purpose
`GET`	`/knowledge/search?q=...&top_k=5`	Cosine similarity search

Health — System status and feature flag reporting.

Method	Path	Purpose
`GET`	`/health`	Status, version, feature flags, providers

Request/Response Contracts

All request bodies are validated by Pydantic models in the engine’s domain layer. A typical pipeline evaluation request:

{
  "equipment_id": "uuid-of-equipment",
  "system_type": "centrifugal_pump",
  "operating_speed_hz": 29.5,
  "signal": {
    "waveform": [0.12, -0.08, 0.15, ...],
    "sampling_rate_hz": 8192,
    "signal_type": "acceleration"
  }
}

The response returns the full pipeline output: Module A features, Module B fault detections, Module C SSI score, Module D health stage and RUL, Module E maintenance actions — all in a single FullAnalysisResponse object. After the response is sent, a background task embeds the result into pgvector for future RAG search.

Versioning Strategy

The current version prefix /rapid-ai/v1/ is baked into all routes. When breaking changes are introduced, a /v2/ prefix will run alongside /v1/ with a deprecation window. Non-breaking additions (new fields, new endpoints) are added to the current version without incrementing.

Service Layer Architecture

The service layer is the intellectual core of RAPID AI. It lives in engine/domain/ (Ring 0, pure physics, no I/O) and engine/services/ (Ring 1, orchestration). Here is what each engine does and how.

Rule Evaluator: Safe Recursive-Descent Parser

RAPID AI contains 119 component fault rules, 50 signal feature rules, and 16 data guard rules. These rules are not executed via eval() or any dynamic code interpretation. They are expressed as structured condition/action pairs — either as Python dataclasses or YAML configuration — and evaluated by a safe recursive-descent evaluator that walks condition trees without ever executing arbitrary code.

The guard rules (DG001-DG019) run first, checking data quality before any diagnostic computation begins. If the signal fails data validation — missing samples, unreasonable amplitudes, sampling rate below Nyquist — the pipeline stops early with an explanation, not a crash.

Diagnostic Engine: IMS-Driven, Data-Not-Code

The Integrated Master Schema (IMS) is a database of 119 failure mode signatures across 12 component types (antifriction bearings, gears, journal bearings, motors, pumps, fans, compressors, couplings, shafts, seals, structures, and belts). Each entry maps a failure mechanism to its expected spectral signature, FRETTLSM root cause category, severity progression, and recommended corrective actions.

The diagnostic engine does not hardcode diagnostic logic. It reads IMS entries and compares them against extracted signal features. When sensor features match an IMS pattern — for example, elevated BPFO harmonics with increasing kurtosis — the engine scores the match and ranks it by confidence. The intelligence is in the data, not the code. Adding a new failure mode means adding an IMS row, not writing new software.

SEDL Engine: Shannon Entropy Across Three Domains

The Spectral-Entropy-Directional-Lens (SEDL) applies information-theoretic analysis to detect stability degradation before threshold-based alarms fire. It computes entropy across three domains:

Spectral Entropy (SE): Measures disorder in the frequency spectrum. A healthy machine concentrates energy in narrow bands; a degrading machine spreads energy across wider bands, increasing spectral entropy.
Temporal Entropy (TE): Measures disorder in the time-domain signal. Erratic amplitude variation indicates instability.
Directional Entropy (DE): Measures disorder across measurement axes. When vibration energy shifts unpredictably between horizontal, vertical, and axial directions, it signals mechanical instability.

SEDL runs as part of Module B (fault detection) and feeds into Module C (fusion).

Fusion Engine: Profile-Weighted Block Aggregation

Module C (fuse_ssi) computes the System Stability Index through a three-pass block scoring process using rules BSR001 through BSR022. Each block represents a diagnostic dimension — spectral health, trend stability, entropy state, component-specific risk. Blocks are scored individually, then aggregated using weights defined in profiles.yaml that vary by system type. A centrifugal pump has different weight profiles than a gearbox.

The SSI is not an average. It is a weighted aggregation that emphasizes the most diagnostically significant dimensions for each machine type. The formula for the final priority score in Module E is:

Canonical reference: See Chapter 26 for the authoritative priority formula.

P = 100 x (0.45 x Severity + 0.25 x Criticality + 0.20 x Kurtosis_factor + 0.10 x Urgency)

RUL Engine: Three Weibull/Log-Slope Models

Module D estimates Remaining Useful Life through three parallel approaches:

Weibull-adjusted decay — Maps the current health stage (from HSR001-HSR010 health stage rules) to a Weibull survival curve. The rul_multiplier from the matched health stage rule adjusts the baseline estimate.
Log-slope trend projection — Extrapolates the rate of degradation from trend analysis to estimate when a threshold will be crossed.
Envelope estimation — Provides optimistic and pessimistic bounds based on confidence intervals.

The three estimates are combined with weights that favor the trend-based projection when sufficient historical data exists.

CDE Engine: Two-Phase Trigger-Then-Evaluate

The Contradiction-Driven Engineering engine identifies situations where improving one engineering parameter necessarily worsens another — the kind of trade-off that separates root cause treatment from symptom management.

Phase 1 (Trigger): The engine scans the diagnostic output for contradiction triggers — cases where a recommended action would create a new problem. For example, increasing bearing clearance to reduce thermal preload worsens the machine’s tolerance to unbalance forces.

Phase 2 (Evaluate): Once a contradiction is triggered, the engine retrieves the relevant contradiction template, evaluates the trade-off severity, and generates resolution alternatives ranked by engineering impact.

Causal Engine: FRETTLSM Keyword-Matching

The causal engine classifies root causes using the FRETTLSM taxonomy developed by Dibyendu De:

Letter	Category	Examples
F	Force	Preload, misalignment, unbalance
R	Reactive	Resonance, structural looseness
E	Environment	Contamination, corrosion
T	Temperature	Thermal expansion, overheating
T	Tribology	Surface fatigue, pitting
L	Lubrication	Starvation, wrong viscosity
S	Surface	Spalling, brinelling, erosion
M	Man	Wrong clearance, installation error

Each of the 121 rules is classified by FRETTLSM category. The causal engine matches diagnostic findings to their FRETTLSM classification and builds cause chains: trigger (the initiator) > accelerator (what made it worse) > retarder (what could have slowed it).

Confidence Module: Canonical Scoring Standard

All confidence scores across the system follow a single canonical standard:

Format: Numeric float, range 0.0 to 1.0
Field name: confidence_score everywhere

Qualitative labels are mapped to fixed numeric values:

Label	Value	Application
High	0.85	Direct sensor confirmation
Medium-high	0.75	Strong single-source evidence
Medium	0.60	Ambiguous/single source
Low	0.40	Weak signal/noisy data
Insufficient	0.00	Contradictory evidence

Decision thresholds: RCM activation requires >= 0.70. Safety escalation requires >= 0.80. Dashboard display suppresses anything below 0.50 to prevent noise from reaching operators.

Data Flow Architecture

A complete diagnostic request traces the following path through the system:

1. Sensor data arrives via POST /rapid-ai/v1/evaluate
   Request includes: equipment_id, system_type, operating_speed_hz,
   signal waveform, sampling_rate_hz, signal_type.

2. _resolve_asset_context() enriches the request with machine_type
   and criticality from the database.

3. Module A: analyze_signal()
   - Guard rules (DG001-DG019) validate data quality.
   - Signal features extracted: RMS, peak, crest factor, kurtosis.
   - ISO zone classification applied.
   - If data fails guard rules, pipeline stops with explanation.

4. Module B: detect_faults()  [sequential evaluation]
   - 119 component fault rules matched against features.
   - 50 signal feature rules (SF001-SF051) evaluated.
   - Trend analysis classifies pattern: Step/Chaotic/Accel/Drift/Stable.
   - SEDL entropy computed: SE + TE + DE.
   - Matched faults ranked by confidence score.

5. Module C: fuse_ssi()
   - BSR001-BSR022 block scoring in 3 passes.
   - System profile weights loaded from profiles.yaml.
   - SSI (System Stability Index) computed as weighted aggregation.

6. Module D: predict_prognostics()
   - HSR001-HSR010 health stage rules determine current stage.
   - RUL estimated via Weibull-adjusted log-slope model.
   - rul_multiplier from matched health stage adjusts baseline.

7. Module E: plan_maintenance()
   - PWR001-PWR010 priority windows evaluated in 2 passes.
   - Priority score computed: P = 100 x (0.45S + 0.25C + 0.20K + 0.10U).
   - Actions ranked with boost factors for urgent conditions.

8. FullAnalysisResponse returned to client.

9. Background task: embed_analysis() generates a 768-dim vector
   and stores it in analysis_vectors table (pgvector).

10. If SSI exceeds alert thresholds, an alert record is created.
    SSE stream pushes event to connected Explorer clients.

11. Explorer dashboard polls /assets/{id}/context for updated
    analysis history and displays trend charts.

SSOT Chain

The type contract chain ensures consistency from database to browser:

Python Pydantic (Ring 0)  -->  REST API (JSON)  -->  @rapidai/contracts (TypeScript)  -->  App imports
        |                                                        |
  settings.py (thresholds)                              hierarchy.ts, analysis.ts
  hierarchy.py (schemas)                                base.ts, rules.ts
  swarm.py (protocol)                                   diagnostics-types.ts

Pydantic models are the source of truth. TypeScript contracts in @rapidai/contracts mirror the Pydantic shapes. The Explorer app imports only from the contracts package, never from the engine directly. This ensures that if a schema changes in Python, the TypeScript build breaks immediately — before a runtime error can reach a user.

Scaling Considerations

RAPID AI is designed to run on a single machine today and scale horizontally when demand requires it.

Stateless Services

Both the FastAPI engine and SvelteKit Explorer are stateless. No request depends on server-local state. Session data lives in PostgreSQL (via better-auth), analysis results live in PostgreSQL, configuration lives in YAML files baked into the container image. This means multiple engine instances can run behind a load balancer with no session affinity required.

Database Scaling

Read replicas: Dashboard queries (trend charts, asset lists, analysis history) are read-heavy and can be directed to a PostgreSQL read replica.
Connection pooling: SQLAlchemy async with asyncpg uses pool_size=10 and max_overflow=20. For higher concurrency, PgBouncer can be added as a connection multiplexer.
Partitioning: sensor_readings and analysis_results are candidates for time-based partitioning when row counts reach the tens of millions.
Graceful degradation: If DATABASE_URL is not set, the engine runs in pure-compute mode — the pipeline works, persistence does not. This allows the physics engine to function independently during database maintenance.

Caching

Rule store: The 119 component rules, 50 signal feature rules, and YAML configuration files (profiles, block scores, fusion weights, actions) are loaded into memory on startup. They change infrequently and are small enough to hold entirely in RAM.
IMS cache: Failure mode signatures are loaded once and cached for the lifetime of the process. A restart picks up any changes.
Vector embeddings: Rule vectors are synced on startup with content-hash deduplication. After the initial sync, subsequent startups require zero embedding API calls.

Asynchronous Processing

Trend analysis (Module B.2) and SEDL entropy (Module B.3) are evaluated sequentially within Module B. A previous ThreadPoolExecutor(max_workers=4) implementation was removed because the GIL prevents true parallelism for CPU-bound Python work.
Vector embedding after analysis is already a background task — the response returns to the client before the embedding is complete.
Swarm tasks: The swarm engine dispatches agent tasks asynchronously. Long-running diagnostic reasoning (up to 5 LLM iterations) runs without blocking the HTTP response thread.
SSE streaming: Server-sent events provide real-time updates to the dashboard without polling. The Explorer’s /api/stream route proxies the engine’s SSE stream and merges it with local event channels (alerts, sensors, health, swarm).

Security Architecture

Authentication

better-auth manages user authentication with admin and organization plugins. It creates and manages its own tables (user, session, account, verification, member, invitation). The Explorer’s auth.ts serves as the BFF bridge — the SvelteKit server validates sessions before proxying requests to the FastAPI engine.

Session tokens are signed with BETTER_AUTH_SECRET. In production, this must be a cryptographically random string, not the development default.

Authorization

Role-based access control is enforced at the SvelteKit BFF layer:

Role	Permissions
Viewer	Read dashboards, view analysis history
Engineer	Run analyses, create work orders, manage spares
Admin	Manage users, organizations, system configuration
Approver	Approve engineering rule changes (governance)

The FastAPI engine currently trusts requests from the SvelteKit server (internal network boundary). API key authentication for external consumers is planned for the multi-tenant phase.

Data Protection

TLS in transit: All external communication over HTTPS. Internal service-to-service communication (SvelteKit to FastAPI) uses HTTP within the Docker network; TLS termination at the load balancer for production.
Encryption at rest: Managed PostgreSQL providers (Supabase, Neon) encrypt storage by default. Self-hosted deployments should enable PostgreSQL’s native encryption or use encrypted volumes.
Secret management: API keys (GEMINI_API_KEY, OPENAI_API_KEY, CLOUDFLARE_API_TOKEN, BETTER_AUTH_SECRET) are injected via environment variables, never committed to source control.

Input Validation and Safety

Pydantic models: Every API input is validated by a Pydantic model before reaching the service layer. Invalid requests are rejected with structured error responses.
Parameterized SQL: All database queries use SQLAlchemy’s parameterized query builder. No string interpolation in SQL.
No eval(): Rule evaluation uses a safe recursive-descent parser. No user-supplied input is ever executed as code.
IP protection: Engineering rules are executed server-side only. The API returns diagnostic results and explanations, never the raw rules or their internal logic. Explainability summaries replace full logic disclosure.

Deployment Topology

Development: Local Docker Compose

Three services defined in infra/docker-compose.yml:

services:
  db:      pgvector/pgvector:pg17        :5432
  engine:  Dockerfile.api (Python 3.13)  :8000  (depends: db)
  app:     Dockerfile.app (SvelteKit)    :5173  (depends: db + engine)

The Dockerfile.api uses a three-stage build: python:3.13-slim base, install dependencies via uv, copy application code, run as non-root user with uvicorn. The Dockerfile.app builds on oven/bun:alpine.

Quick start for local development:

make dev-db       # Start PostgreSQL in Docker
make dev-api      # Start FastAPI engine (uvicorn at :8000)
make dev-web # Start SvelteKit app (Bun at :5173)
# Or all at once:
make dev          # API + Explorer (DB must be running)

Database management:

make db-push      # Push Drizzle schema to DB
make db-generate  # Generate migration files
make db-migrate   # Run pending migrations
make db-studio    # Open Drizzle Studio for visual inspection

Production: Containerized Deployment

Option A: Simple container hosting — Docker Compose on a single VPS with nginx as reverse proxy and TLS terminator. Suitable for early customers and proof-of-concept deployments.

Option B: Container orchestration — Kubernetes or Docker Swarm for horizontal scaling. The stateless services scale horizontally; the database runs as a managed service (Supabase, Neon, or AWS RDS with pgvector).

Option C: Edge-compatible — For plants with restricted connectivity, the FastAPI engine can run locally with periodic sync to a central database. The system degrades gracefully: without DATABASE_URL, the physics pipeline still functions in pure-compute mode.

Frontend Hosting

The SvelteKit Explorer can be deployed in three configurations:

Self-hosted: Node.js/Bun process behind a reverse proxy. Required when the BFF layer needs direct PostgreSQL access.
Vercel/Cloudflare Pages: SvelteKit’s adapter system supports edge deployment. The BFF functions run as serverless functions.
Static export: For dashboard-only deployments that consume the API without server-side rendering.

Health Checks

Service	Method	Interval
PostgreSQL	`pg_isready`	5s
Engine	`GET /health` (urllib, no external deps)	10s
Explorer	`curl http://localhost:5173/`	15s

The /health endpoint returns the engine version, all feature flag states, and available AI provider capabilities — giving operators a single endpoint to verify that the full stack is operational.

Environment Variables

Engine (Python):

Variable	Required	Purpose
`DATABASE_URL`	For persistence	PostgreSQL connection string
`GEMINI_API_KEY`	For AI features	Google Gemini API key
`OPENAI_API_KEY`	Fallback	OpenAI API key
`RAPID_AI_PROVIDER_CHAIN`	No (default: `gemini,openai,cloudflare,template`)	AI provider priority
`RAPID_FEATURE_*`	No (all default ON)	Feature flags (set `0` to disable)
`LOG_LEVEL`	No (default: `info`)	Structured logging level

Explorer (SvelteKit):

Variable	Required	Purpose
`DATABASE_URL`	Yes	PostgreSQL for Drizzle + better-auth
`BETTER_AUTH_SECRET`	Yes	Session signing secret
`ORIGIN`	Yes	SvelteKit origin URL
`RAPID_AI_ENGINE_URL`	Yes	Python API URL for BFF proxy

Architectural Decisions and Trade-offs

Several deliberate trade-offs shape this architecture:

Monolith over microservices. The diagnostic domain is deeply interconnected — Module B’s faults feed Module C’s fusion, which feeds Module D’s prognostics. Splitting these across network boundaries would add latency, complexity, and failure modes without adding value at current scale. The ring architecture ensures that extraction to services is possible later without rewriting domain logic.

PostgreSQL over specialized stores. A single PostgreSQL instance with pgvector handles relational data, vector search, and (via JSONB) semi-structured data. This eliminates the operational burden of managing a separate vector database, time-series database, or document store. When any single concern outgrows PostgreSQL, it can be extracted independently.

Drizzle over SQLAlchemy for schema ownership. The frontend team (SvelteKit) owns the schema because they are closest to the user-facing data model. Python mirrors the schema for queries but never modifies structure. This prevents migration conflicts between two ORMs.

YAML over database for rule configuration. The 119 fault rules, system profiles, and scoring criteria live in YAML files within the codebase. This makes them version-controlled, diff-able, and deployable alongside the code that interprets them. When rule governance requires runtime editing, a database-backed rule store can layer on top.

Feature flags over feature branches. Six feature flags (AI_BRIEF, AI_DIAGNOSTICIAN, RAG_RULES, V2_PIPELINE, SWARM_ENGINE, SWARM_EXPLORER) allow capabilities to be toggled at runtime via environment variables. This enables gradual rollout, A/B testing, and graceful degradation when external services (AI providers, database) are unavailable.

These decisions optimize for the current stage of the product: a small team building a deep product, where developer velocity and debuggability matter more than theoretical scalability. The architecture is designed so that every decision can be revisited without a rewrite.

Standards Alignment

Standard	Relevance to This Chapter
ISO 13374 — Condition monitoring and diagnostics of machines	The three-service architecture (FastAPI engine, SvelteKit Explorer, PostgreSQL) implements ISO 13374’s processing chain as a production system, with strict separation between data acquisition, processing, and presentation layers.
MIMOSA OSA-CBM — Open System Architecture for CBM	The API-first design with REST endpoints under /rapid-ai/v1/ follows OSA-CBM’s open architecture principles, enabling interoperability with existing plant historians, CMMS, and SCADA systems.
IEC 62443 — Industrial cybersecurity	The concentric ring architecture (Ring 0 pure physics, Ring 1 orchestration, Ring 2 infrastructure) implements IEC 62443’s defense-in-depth model with strict dependency direction and network boundary separation.
OWASP Top 10 — Web application security	The schema ownership rule (Drizzle owns DDL) and the BFF pattern prevent common web application vulnerabilities by isolating the diagnostic engine from direct public internet exposure.

Changelog

Version	Date	Author	Changes
2.1.0	2026-03-17	Rick D	Added standards alignment, living doc metadata, changelog
2.0.0	2026-03-17	Rick D	Enriched with production codebase content
1.0.0	2026-03-17	Rick D	Initial chapter creation