System Architecture
Chapter 14 — System Architecture
Section titled “Chapter 14 — System Architecture”System Overview
Section titled “System Overview”RAPID AI is built as a monolith-first, API-first engineering intelligence platform. The decision to avoid a microservices architecture at this stage is deliberate: microservices impose network boundaries, deployment orchestration, and distributed debugging overhead that a young product cannot afford. The domain logic of industrial diagnostics is complex enough without scattering it across a dozen independently failing processes.
Instead, RAPID AI adopts a modular monolith. The codebase is organized into concentric rings with strict dependency direction — inner rings know nothing about outer rings — so that services can be extracted later if scale demands it, but today they communicate through function calls, not network hops.
The platform consists of three runtime services:
FastAPI Engine (Python 3.13, port 8000). This is the physics pipeline, the diagnostic intelligence, the swarm coordination, the knowledge search, and every computation that touches RAPID AI’s intellectual property. It runs on uvicorn, uses Pydantic for schema validation, and connects to PostgreSQL via SQLAlchemy Core for read/write operations. The engine owns Ring 0 (pure physics — no I/O) and Ring 1 (orchestration), with Ring 2 (infrastructure) handling all external boundaries: database, AI providers, and agent tools.
SvelteKit Explorer (Bun 1.3, port 5173). This is the consolidated frontend application — previously five separate apps (Explorer, Invent, Monitor, Suite, Workshop), now unified into a single SvelteKit application with route groups for isolation. It serves as both the user-facing dashboard and the backend-for-frontend (BFF) layer that proxies API calls, manages authentication via better-auth, and owns the PostgreSQL schema through Drizzle ORM.
PostgreSQL 17 with pgvector (port 5432). A single database instance serves as the single source of truth (SSOT) for asset hierarchy, analysis results, vector embeddings, sensor readings, work orders, failure history, alerts, and audit logs. pgvector enables semantic similarity search over 768-dimensional embeddings without requiring a separate vector database.
Every capability in the system is exposed as a REST endpoint. The API prefix is /rapid-ai/v1/. This API-first discipline ensures that the SvelteKit frontend, future mobile clients, CMMS integrations, and third-party consumers all access the same capabilities through the same contracts.
Schema ownership follows a strict rule: Drizzle owns the DDL. The SvelteKit application manages all table creation, migrations, and schema changes through drizzle-kit. Python mirrors the schema through SQLAlchemy Table objects that are used for type-safe queries but never create or alter tables. This prevents two ORMs from fighting over the same database.
Architecture Diagram
Section titled “Architecture Diagram”+------------------------------------------------------------------+| CLIENT LAYER || || +------------------+ +------------------+ +----------+ || | SvelteKit | | API Consumers | | CMMS | || | Explorer | | (REST clients, | | Export | || | Dashboard | | mobile, CLI) | | Targets | || +--------+---------+ +--------+---------+ +----+-----+ || | | | |+------------|------- --------------|--------------------|---------+ | | | v v v+------------------------------------------------------------------+| API GATEWAY LAYER (FastAPI) || Prefix: /rapid-ai/v1/ || || Auth (better-auth) | Rate Limiting | API Versioning || Feature Flags | CORS | Request Validation || || +-----------+ +-----------+ +-----------+ +-------------+ || | /evaluate | | /diagnose | | /assets/* | | /swarm/* | || | /moduleA | | /copilot | | /health | | /knowledge | || | /moduleB | | | | | | /stream | || | /moduleC | | | | | | | || | /moduleD | | | | | | | || | /moduleE | | | | | | | || +-----------+ +-----------+ +-----------+ +-------------+ |+------------------------------------------------------------------+ | v+------------------------------------------------------------------+| SERVICE LAYER || || +-------------------------------------------------------------+ || | AnalysisService (Pipeline) | || | | || | mA: analyze_signal() | || | Guard rules (DG001-DG019) --> Signal features | || | (RMS, peak, crest, kurtosis) --> ISO zone classification | || | | | || | v | || | mB: detect_faults() | || | Component rules (121 rules, 12 types) | || | Signal feature rules (SF001-SF051) | || | Trend analysis (Step/Chaotic/Accel/Drift/Stable) | || | SEDL entropy (SE + TE + DE) | || | | | || | v | || | mC: fuse_ssi() | || | BSR001-BSR022 block scoring (3-pass) | || | SSI weighted aggregation + system profile weights (YAML) | || | | | || | v | || | mD: predict_prognostics() | || | HSR001-HSR010 health stage rules | || | RUL estimation (Weibull-adjusted log-slope) | || | | | || | v | || | mE: plan_maintenance() | || | PWR001-PWR010 priority windows (2-pass) | || | Priority = 100 x (0.45S + 0.25C + 0.20K + 0.10U) | || | Action ranking with boost | || +-------------------------------------------------------------+ || || +-----------------+ +-----------------+ +------------------+ || | Swarm Engine | | AI Diagnostician| | Knowledge (RAG) | || | EngineSentinel | | Agent loop | | Rule embeddings | || | TaskPlanner | | 5 tools, max 5 | | Analysis vectors | || | 4 Workers: | | iterations | | Cosine search | || | Analyst | | FRETTLSM-aware | | | || | Diagnostician | | | | | || | BriefWriter | | | | | || | Knowledge | | | | | || +-----------------+ +-----------------+ +------------------+ || || +------------------+ +-------------------+ || | Confidence | | Provider Registry | || | Scoring Module | | Gemini > OpenAI > | || | 0.0-1.0 range | | Cloudflare > | || | canonical std | | Template fallback | || +------------------+ +-------------------+ |+------------------------------------------------------------------+ | v+------------------------------------------------------------------+| DATA LAYER || || +-------------------------+ +------------------------------+ || | PostgreSQL 17 + pgvector| | Object Storage (future) | || | | | Waveforms, attachments, | || | Asset hierarchy: | | inspection images | || | organization | +------------------------------+ || | locations | || | areas | +------------------------------+ || | equipment | | Rule Store (YAML) | || | sub_assemblies | | actions.yaml | || | measurement_points | | profiles.yaml | || | spares | | block_scores.yaml | || | | | fusion.yaml | || | Intelligence: | +------------------------------+ || | analysis_results | || | analysis_vectors | +------------------------------+ || | rule_vectors | | Seed Data | || | sensor_readings | | IMS failure modes (119) | || | | | Signal feature rules (50) | || | Maintenance: | | Guard rules (16) | || | work_orders | | System profiles | || | pm_schedules | +------------------------------+ || | failure_history | || | | || | Observability: | || | alerts | || | audit_log | || +-------------------------+ |+------------------------------------------------------------------+ | v+------------------------------------------------------------------+| EXTERNAL INTEGRATIONS || || +-------------------+ +------------------+ +---------------+ || | Sensor Data | | CMMS Export | | Notification | || | Ingestion | | Work orders, | | Services | || | POST /evaluate | | spare requests, | | SSE stream, | || | POST /data/sensor | | maintenance logs | | alerts, | || | | | | | webhooks | || +-------------------+ +------------------+ +---------------+ || || +-------------------+ +------------------+ || | AI Providers | | Embedding API | || | Gemini, OpenAI, | | 768-dim vectors | || | Cloudflare, local | | for RAG search | || +-------------------+ +------------------+ |+------------------------------------------------------------------+Database Architecture
Section titled “Database Architecture”The database is a single PostgreSQL 17 instance with the pgvector extension. This is a deliberate simplicity: one database to back up, one connection string to configure, one set of migrations to manage.
Schema Design
Section titled “Schema Design”Drizzle ORM defines the schema in TypeScript. The tables are organized into four functional groups:
Asset Hierarchy. The plant structure follows a strict parent-child chain: organization (owned by better-auth) > locations > areas > equipment > sub_assemblies, which then branch into measurement_points and spares. Each level carries its own metadata: locations have geographic coordinates, equipment has machine type and criticality scores, measurement points have signal type and direction, spares have stock quantities and lead times.
Analysis and Intelligence. analysis_results stores the complete pipeline output for every evaluation: the original request, the full response (as JSONB), the computed SSI, and the severity level. sensor_readings holds raw sensor data keyed by measurement point and timestamp, with values stored as JSONB to accommodate different sensor types without schema changes.
Maintenance. work_orders track maintenance tasks through their lifecycle (status, priority, assignee, parts used). pm_schedules define preventive maintenance intervals with both time-based and condition-based triggers. failure_history stores FMEA records with severity, occurrence, and detection ratings.
Observability. alerts capture system alerts with type, severity, and acknowledgment status. audit_log records every significant user action with the entity type, action taken, and a JSONB details payload for forensic analysis.
pgvector for Semantic Search
Section titled “pgvector for Semantic Search”Two vector tables store 768-dimensional embeddings:
rule_vectors — Synchronized on startup when the RAG_RULES feature flag is enabled. Each of the 119 component fault rules is embedded as a document containing: the diagnosis, the underlying physics, the FRETTLSM root cause category, the severity progression from early to late stage, and the recommended corrective actions. Content-hash deduplication ensures zero redundant API calls after the initial sync.
analysis_vectors — Embedded as a background task after every successful POST /evaluate. The document includes: asset identity, health stage, SSI score, top faults detected, top recommended actions, and remaining useful life estimate.
Search is performed via GET /rapid-ai/v1/knowledge/search?q=...&top_k=5, which runs cosine similarity queries across both tables, merges the results, and returns them sorted by relevance. This creates a knowledge growth loop: every analysis enriches the search corpus, making future diagnoses more informed.
Key Tables (MVP Set)
Section titled “Key Tables (MVP Set)”| Table | PK Type | Key Columns | Purpose |
|---|---|---|---|
organization | text | name, slug | Top-level tenant |
locations | uuid | organisation_id, country, geo_lat/lon | Plant sites |
areas | uuid | location_id, area_type | Functional areas |
equipment | uuid | area_id, machine_type, criticality | Machines |
sub_assemblies | uuid | equipment_id, component_type, position | Components |
measurement_points | uuid | sub_assembly_id, direction, signal_type | Sensor locations |
spares | uuid | sub_assembly_id, part_number, quantity_on_hand | Inventory |
analysis_results | uuid | request/response JSONB, ssi, severity | Pipeline outputs |
sensor_readings | uuid | measurement_point_id, timestamp, values JSONB | Raw data |
work_orders | uuid | status, priority, assignee, parts_used | Maintenance tasks |
failure_history | uuid | failure_mode, cause, severity/occurrence/detection | FMEA records |
Indexing Strategy
Section titled “Indexing Strategy”High-priority indexes target the most common query patterns:
sensor_readings(measurement_point_id, timestamp)— Time-series lookups by sensor, ordered by time. This is the most critical index in the system; every trend query, every baseline comparison, every dashboard chart hits it.analysis_results(equipment_id, created_at)— Retrieving analysis history for a specific machine.equipment(area_id, status)— Listing active machines in a plant area.work_orders(equipment_id, status)— Finding open work orders for a machine.rule_vectorsandanalysis_vectorsuse IVFFlat indexes on their vector columns for approximate nearest-neighbor search.
For time-series data at scale, PostgreSQL’s native table partitioning (by month on timestamp) is the planned approach before considering a dedicated time-series database. The principle: exhaust PostgreSQL before adding infrastructure.
API Architecture
Section titled “API Architecture”Endpoint Design
Section titled “Endpoint Design”All endpoints are prefixed /rapid-ai/v1/ and follow pragmatic REST conventions. The API surface is organized into five functional groups:
Physics Pipeline — The core diagnostic capability.
| Method | Path | Purpose |
|---|---|---|
POST | /evaluate | Full 5-module pipeline (A > B > C > D > E) |
POST | /moduleA | Signal analysis only |
POST | /moduleB | Fault detection only |
POST | /moduleC | SSI fusion only |
POST | /moduleD | Prognostics only |
POST | /moduleE | Maintenance planning only |
POST | /generate-signal | Synthetic signal generation |
POST | /diagnose | AI diagnostician (feature-gated) |
Asset Hierarchy — CRUD operations on plant structure.
| Method | Path | Purpose |
|---|---|---|
GET | /assets/organisations | List all organizations |
GET | /assets/locations/{org_id} | Locations in an organization |
GET | /assets/areas/{location_id} | Areas in a location |
GET | /assets/equipment/{area_id} | Equipment in an area |
GET | /assets/equipment/{id}/context | Equipment with full context |
GET | /assets/sub-assemblies/{equipment_id} | Sub-assemblies |
GET | /assets/measurement-points/{sub_id} | Measurement points |
GET/PATCH | /assets/spares/* | Spare parts and stock management |
Swarm (Agent Coordination) — Multi-agent diagnostic intelligence.
| Method | Path | Purpose |
|---|---|---|
POST | /swarm/task | Submit a pre-built AgentTask |
POST | /swarm/dispatch | Intent-based dispatch (plans then executes) |
GET | /swarm/capabilities | List all worker capabilities |
GET | /swarm/status | Worker count, active tasks |
GET | /swarm/stream | SSE stream (heartbeats + events) |
Knowledge (RAG) — Semantic search across rules and past analyses.
| Method | Path | Purpose |
|---|---|---|
GET | /knowledge/search?q=...&top_k=5 | Cosine similarity search |
Health — System status and feature flag reporting.
| Method | Path | Purpose |
|---|---|---|
GET | /health | Status, version, feature flags, providers |
Request/Response Contracts
Section titled “Request/Response Contracts”All request bodies are validated by Pydantic models in the engine’s domain layer. A typical pipeline evaluation request:
{ "equipment_id": "uuid-of-equipment", "system_type": "centrifugal_pump", "operating_speed_hz": 29.5, "signal": { "waveform": [0.12, -0.08, 0.15, ...], "sampling_rate_hz": 8192, "signal_type": "acceleration" }}The response returns the full pipeline output: Module A features, Module B fault detections, Module C SSI score, Module D health stage and RUL, Module E maintenance actions — all in a single FullAnalysisResponse object. After the response is sent, a background task embeds the result into pgvector for future RAG search.
Versioning Strategy
Section titled “Versioning Strategy”The current version prefix /rapid-ai/v1/ is baked into all routes. When breaking changes are introduced, a /v2/ prefix will run alongside /v1/ with a deprecation window. Non-breaking additions (new fields, new endpoints) are added to the current version without incrementing.
Service Layer Architecture
Section titled “Service Layer Architecture”The service layer is the intellectual core of RAPID AI. It lives in engine/domain/ (Ring 0, pure physics, no I/O) and engine/services/ (Ring 1, orchestration). Here is what each engine does and how.
Rule Evaluator: Safe Recursive-Descent Parser
Section titled “Rule Evaluator: Safe Recursive-Descent Parser”RAPID AI contains 119 component fault rules, 50 signal feature rules, and 16 data guard rules. These rules are not executed via eval() or any dynamic code interpretation. They are expressed as structured condition/action pairs — either as Python dataclasses or YAML configuration — and evaluated by a safe recursive-descent evaluator that walks condition trees without ever executing arbitrary code.
The guard rules (DG001-DG019) run first, checking data quality before any diagnostic computation begins. If the signal fails data validation — missing samples, unreasonable amplitudes, sampling rate below Nyquist — the pipeline stops early with an explanation, not a crash.
Diagnostic Engine: IMS-Driven, Data-Not-Code
Section titled “Diagnostic Engine: IMS-Driven, Data-Not-Code”The Integrated Master Schema (IMS) is a database of 119 failure mode signatures across 12 component types (antifriction bearings, gears, journal bearings, motors, pumps, fans, compressors, couplings, shafts, seals, structures, and belts). Each entry maps a failure mechanism to its expected spectral signature, FRETTLSM root cause category, severity progression, and recommended corrective actions.
The diagnostic engine does not hardcode diagnostic logic. It reads IMS entries and compares them against extracted signal features. When sensor features match an IMS pattern — for example, elevated BPFO harmonics with increasing kurtosis — the engine scores the match and ranks it by confidence. The intelligence is in the data, not the code. Adding a new failure mode means adding an IMS row, not writing new software.
SEDL Engine: Shannon Entropy Across Three Domains
Section titled “SEDL Engine: Shannon Entropy Across Three Domains”The Spectral-Entropy-Directional-Lens (SEDL) applies information-theoretic analysis to detect stability degradation before threshold-based alarms fire. It computes entropy across three domains:
- Spectral Entropy (SE): Measures disorder in the frequency spectrum. A healthy machine concentrates energy in narrow bands; a degrading machine spreads energy across wider bands, increasing spectral entropy.
- Temporal Entropy (TE): Measures disorder in the time-domain signal. Erratic amplitude variation indicates instability.
- Directional Entropy (DE): Measures disorder across measurement axes. When vibration energy shifts unpredictably between horizontal, vertical, and axial directions, it signals mechanical instability.
SEDL runs as part of Module B (fault detection) and feeds into Module C (fusion).
Fusion Engine: Profile-Weighted Block Aggregation
Section titled “Fusion Engine: Profile-Weighted Block Aggregation”Module C (fuse_ssi) computes the System Stability Index through a three-pass block scoring process using rules BSR001 through BSR022. Each block represents a diagnostic dimension — spectral health, trend stability, entropy state, component-specific risk. Blocks are scored individually, then aggregated using weights defined in profiles.yaml that vary by system type. A centrifugal pump has different weight profiles than a gearbox.
The SSI is not an average. It is a weighted aggregation that emphasizes the most diagnostically significant dimensions for each machine type. The formula for the final priority score in Module E is:
Canonical reference: See Chapter 26 for the authoritative priority formula.
P = 100 x (0.45 x Severity + 0.25 x Criticality + 0.20 x Kurtosis_factor + 0.10 x Urgency)RUL Engine: Three Weibull/Log-Slope Models
Section titled “RUL Engine: Three Weibull/Log-Slope Models”Module D estimates Remaining Useful Life through three parallel approaches:
- Weibull-adjusted decay — Maps the current health stage (from HSR001-HSR010 health stage rules) to a Weibull survival curve. The
rul_multiplierfrom the matched health stage rule adjusts the baseline estimate. - Log-slope trend projection — Extrapolates the rate of degradation from trend analysis to estimate when a threshold will be crossed.
- Envelope estimation — Provides optimistic and pessimistic bounds based on confidence intervals.
The three estimates are combined with weights that favor the trend-based projection when sufficient historical data exists.
CDE Engine: Two-Phase Trigger-Then-Evaluate
Section titled “CDE Engine: Two-Phase Trigger-Then-Evaluate”The Contradiction-Driven Engineering engine identifies situations where improving one engineering parameter necessarily worsens another — the kind of trade-off that separates root cause treatment from symptom management.
Phase 1 (Trigger): The engine scans the diagnostic output for contradiction triggers — cases where a recommended action would create a new problem. For example, increasing bearing clearance to reduce thermal preload worsens the machine’s tolerance to unbalance forces.
Phase 2 (Evaluate): Once a contradiction is triggered, the engine retrieves the relevant contradiction template, evaluates the trade-off severity, and generates resolution alternatives ranked by engineering impact.
Causal Engine: FRETTLSM Keyword-Matching
Section titled “Causal Engine: FRETTLSM Keyword-Matching”The causal engine classifies root causes using the FRETTLSM taxonomy developed by Dibyendu De:
| Letter | Category | Examples |
|---|---|---|
| F | Force | Preload, misalignment, unbalance |
| R | Reactive | Resonance, structural looseness |
| E | Environment | Contamination, corrosion |
| T | Temperature | Thermal expansion, overheating |
| T | Tribology | Surface fatigue, pitting |
| L | Lubrication | Starvation, wrong viscosity |
| S | Surface | Spalling, brinelling, erosion |
| M | Man | Wrong clearance, installation error |
Each of the 121 rules is classified by FRETTLSM category. The causal engine matches diagnostic findings to their FRETTLSM classification and builds cause chains: trigger (the initiator) > accelerator (what made it worse) > retarder (what could have slowed it).
Confidence Module: Canonical Scoring Standard
Section titled “Confidence Module: Canonical Scoring Standard”All confidence scores across the system follow a single canonical standard:
- Format: Numeric float, range
0.0to1.0 - Field name:
confidence_scoreeverywhere
Qualitative labels are mapped to fixed numeric values:
| Label | Value | Application |
|---|---|---|
| High | 0.85 | Direct sensor confirmation |
| Medium-high | 0.75 | Strong single-source evidence |
| Medium | 0.60 | Ambiguous/single source |
| Low | 0.40 | Weak signal/noisy data |
| Insufficient | 0.00 | Contradictory evidence |
Decision thresholds: RCM activation requires >= 0.70. Safety escalation requires >= 0.80. Dashboard display suppresses anything below 0.50 to prevent noise from reaching operators.
Data Flow Architecture
Section titled “Data Flow Architecture”A complete diagnostic request traces the following path through the system:
1. Sensor data arrives via POST /rapid-ai/v1/evaluate Request includes: equipment_id, system_type, operating_speed_hz, signal waveform, sampling_rate_hz, signal_type.
2. _resolve_asset_context() enriches the request with machine_type and criticality from the database.
3. Module A: analyze_signal() - Guard rules (DG001-DG019) validate data quality. - Signal features extracted: RMS, peak, crest factor, kurtosis. - ISO zone classification applied. - If data fails guard rules, pipeline stops with explanation.
4. Module B: detect_faults() [sequential evaluation] - 119 component fault rules matched against features. - 50 signal feature rules (SF001-SF051) evaluated. - Trend analysis classifies pattern: Step/Chaotic/Accel/Drift/Stable. - SEDL entropy computed: SE + TE + DE. - Matched faults ranked by confidence score.
5. Module C: fuse_ssi() - BSR001-BSR022 block scoring in 3 passes. - System profile weights loaded from profiles.yaml. - SSI (System Stability Index) computed as weighted aggregation.
6. Module D: predict_prognostics() - HSR001-HSR010 health stage rules determine current stage. - RUL estimated via Weibull-adjusted log-slope model. - rul_multiplier from matched health stage adjusts baseline.
7. Module E: plan_maintenance() - PWR001-PWR010 priority windows evaluated in 2 passes. - Priority score computed: P = 100 x (0.45S + 0.25C + 0.20K + 0.10U). - Actions ranked with boost factors for urgent conditions.
8. FullAnalysisResponse returned to client.
9. Background task: embed_analysis() generates a 768-dim vector and stores it in analysis_vectors table (pgvector).
10. If SSI exceeds alert thresholds, an alert record is created. SSE stream pushes event to connected Explorer clients.
11. Explorer dashboard polls /assets/{id}/context for updated analysis history and displays trend charts.SSOT Chain
Section titled “SSOT Chain”The type contract chain ensures consistency from database to browser:
Python Pydantic (Ring 0) --> REST API (JSON) --> @rapidai/contracts (TypeScript) --> App imports | | settings.py (thresholds) hierarchy.ts, analysis.ts hierarchy.py (schemas) base.ts, rules.ts swarm.py (protocol) diagnostics-types.tsPydantic models are the source of truth. TypeScript contracts in @rapidai/contracts mirror the Pydantic shapes. The Explorer app imports only from the contracts package, never from the engine directly. This ensures that if a schema changes in Python, the TypeScript build breaks immediately — before a runtime error can reach a user.
Scaling Considerations
Section titled “Scaling Considerations”RAPID AI is designed to run on a single machine today and scale horizontally when demand requires it.
Stateless Services
Section titled “Stateless Services”Both the FastAPI engine and SvelteKit Explorer are stateless. No request depends on server-local state. Session data lives in PostgreSQL (via better-auth), analysis results live in PostgreSQL, configuration lives in YAML files baked into the container image. This means multiple engine instances can run behind a load balancer with no session affinity required.
Database Scaling
Section titled “Database Scaling”- Read replicas: Dashboard queries (trend charts, asset lists, analysis history) are read-heavy and can be directed to a PostgreSQL read replica.
- Connection pooling: SQLAlchemy async with asyncpg uses pool_size=10 and max_overflow=20. For higher concurrency, PgBouncer can be added as a connection multiplexer.
- Partitioning:
sensor_readingsandanalysis_resultsare candidates for time-based partitioning when row counts reach the tens of millions. - Graceful degradation: If
DATABASE_URLis not set, the engine runs in pure-compute mode — the pipeline works, persistence does not. This allows the physics engine to function independently during database maintenance.
Caching
Section titled “Caching”- Rule store: The 119 component rules, 50 signal feature rules, and YAML configuration files (profiles, block scores, fusion weights, actions) are loaded into memory on startup. They change infrequently and are small enough to hold entirely in RAM.
- IMS cache: Failure mode signatures are loaded once and cached for the lifetime of the process. A restart picks up any changes.
- Vector embeddings: Rule vectors are synced on startup with content-hash deduplication. After the initial sync, subsequent startups require zero embedding API calls.
Asynchronous Processing
Section titled “Asynchronous Processing”- Trend analysis (Module B.2) and SEDL entropy (Module B.3) are evaluated sequentially within Module B. A previous
ThreadPoolExecutor(max_workers=4)implementation was removed because the GIL prevents true parallelism for CPU-bound Python work. - Vector embedding after analysis is already a background task — the response returns to the client before the embedding is complete.
- Swarm tasks: The swarm engine dispatches agent tasks asynchronously. Long-running diagnostic reasoning (up to 5 LLM iterations) runs without blocking the HTTP response thread.
- SSE streaming: Server-sent events provide real-time updates to the dashboard without polling. The Explorer’s
/api/streamroute proxies the engine’s SSE stream and merges it with local event channels (alerts, sensors, health, swarm).
Security Architecture
Section titled “Security Architecture”Authentication
Section titled “Authentication”better-auth manages user authentication with admin and organization plugins. It creates and manages its own tables (user, session, account, verification, member, invitation). The Explorer’s auth.ts serves as the BFF bridge — the SvelteKit server validates sessions before proxying requests to the FastAPI engine.
Session tokens are signed with BETTER_AUTH_SECRET. In production, this must be a cryptographically random string, not the development default.
Authorization
Section titled “Authorization”Role-based access control is enforced at the SvelteKit BFF layer:
| Role | Permissions |
|---|---|
| Viewer | Read dashboards, view analysis history |
| Engineer | Run analyses, create work orders, manage spares |
| Admin | Manage users, organizations, system configuration |
| Approver | Approve engineering rule changes (governance) |
The FastAPI engine currently trusts requests from the SvelteKit server (internal network boundary). API key authentication for external consumers is planned for the multi-tenant phase.
Data Protection
Section titled “Data Protection”- TLS in transit: All external communication over HTTPS. Internal service-to-service communication (SvelteKit to FastAPI) uses HTTP within the Docker network; TLS termination at the load balancer for production.
- Encryption at rest: Managed PostgreSQL providers (Supabase, Neon) encrypt storage by default. Self-hosted deployments should enable PostgreSQL’s native encryption or use encrypted volumes.
- Secret management: API keys (
GEMINI_API_KEY,OPENAI_API_KEY,CLOUDFLARE_API_TOKEN,BETTER_AUTH_SECRET) are injected via environment variables, never committed to source control.
Input Validation and Safety
Section titled “Input Validation and Safety”- Pydantic models: Every API input is validated by a Pydantic model before reaching the service layer. Invalid requests are rejected with structured error responses.
- Parameterized SQL: All database queries use SQLAlchemy’s parameterized query builder. No string interpolation in SQL.
- No
eval(): Rule evaluation uses a safe recursive-descent parser. No user-supplied input is ever executed as code. - IP protection: Engineering rules are executed server-side only. The API returns diagnostic results and explanations, never the raw rules or their internal logic. Explainability summaries replace full logic disclosure.
Deployment Topology
Section titled “Deployment Topology”Development: Local Docker Compose
Section titled “Development: Local Docker Compose”Three services defined in infra/docker-compose.yml:
services: db: pgvector/pgvector:pg17 :5432 engine: Dockerfile.api (Python 3.13) :8000 (depends: db) app: Dockerfile.app (SvelteKit) :5173 (depends: db + engine)The Dockerfile.api uses a three-stage build: python:3.13-slim base, install dependencies via uv, copy application code, run as non-root user with uvicorn. The Dockerfile.app builds on oven/bun:alpine.
Quick start for local development:
make dev-db # Start PostgreSQL in Dockermake dev-api # Start FastAPI engine (uvicorn at :8000)make dev-web # Start SvelteKit app (Bun at :5173)# Or all at once:make dev # API + Explorer (DB must be running)Database management:
make db-push # Push Drizzle schema to DBmake db-generate # Generate migration filesmake db-migrate # Run pending migrationsmake db-studio # Open Drizzle Studio for visual inspectionProduction: Containerized Deployment
Section titled “Production: Containerized Deployment”Option A: Simple container hosting — Docker Compose on a single VPS with nginx as reverse proxy and TLS terminator. Suitable for early customers and proof-of-concept deployments.
Option B: Container orchestration — Kubernetes or Docker Swarm for horizontal scaling. The stateless services scale horizontally; the database runs as a managed service (Supabase, Neon, or AWS RDS with pgvector).
Option C: Edge-compatible — For plants with restricted connectivity, the FastAPI engine can run locally with periodic sync to a central database. The system degrades gracefully: without DATABASE_URL, the physics pipeline still functions in pure-compute mode.
Frontend Hosting
Section titled “Frontend Hosting”The SvelteKit Explorer can be deployed in three configurations:
- Self-hosted: Node.js/Bun process behind a reverse proxy. Required when the BFF layer needs direct PostgreSQL access.
- Vercel/Cloudflare Pages: SvelteKit’s adapter system supports edge deployment. The BFF functions run as serverless functions.
- Static export: For dashboard-only deployments that consume the API without server-side rendering.
Health Checks
Section titled “Health Checks”| Service | Method | Interval |
|---|---|---|
| PostgreSQL | pg_isready | 5s |
| Engine | GET /health (urllib, no external deps) | 10s |
| Explorer | curl http://localhost:5173/ | 15s |
The /health endpoint returns the engine version, all feature flag states, and available AI provider capabilities — giving operators a single endpoint to verify that the full stack is operational.
Environment Variables
Section titled “Environment Variables”Engine (Python):
| Variable | Required | Purpose |
|---|---|---|
DATABASE_URL | For persistence | PostgreSQL connection string |
GEMINI_API_KEY | For AI features | Google Gemini API key |
OPENAI_API_KEY | Fallback | OpenAI API key |
RAPID_AI_PROVIDER_CHAIN | No (default: gemini,openai,cloudflare,template) | AI provider priority |
RAPID_FEATURE_* | No (all default ON) | Feature flags (set 0 to disable) |
LOG_LEVEL | No (default: info) | Structured logging level |
Explorer (SvelteKit):
| Variable | Required | Purpose |
|---|---|---|
DATABASE_URL | Yes | PostgreSQL for Drizzle + better-auth |
BETTER_AUTH_SECRET | Yes | Session signing secret |
ORIGIN | Yes | SvelteKit origin URL |
RAPID_AI_ENGINE_URL | Yes | Python API URL for BFF proxy |
Architectural Decisions and Trade-offs
Section titled “Architectural Decisions and Trade-offs”Several deliberate trade-offs shape this architecture:
Monolith over microservices. The diagnostic domain is deeply interconnected — Module B’s faults feed Module C’s fusion, which feeds Module D’s prognostics. Splitting these across network boundaries would add latency, complexity, and failure modes without adding value at current scale. The ring architecture ensures that extraction to services is possible later without rewriting domain logic.
PostgreSQL over specialized stores. A single PostgreSQL instance with pgvector handles relational data, vector search, and (via JSONB) semi-structured data. This eliminates the operational burden of managing a separate vector database, time-series database, or document store. When any single concern outgrows PostgreSQL, it can be extracted independently.
Drizzle over SQLAlchemy for schema ownership. The frontend team (SvelteKit) owns the schema because they are closest to the user-facing data model. Python mirrors the schema for queries but never modifies structure. This prevents migration conflicts between two ORMs.
YAML over database for rule configuration. The 119 fault rules, system profiles, and scoring criteria live in YAML files within the codebase. This makes them version-controlled, diff-able, and deployable alongside the code that interprets them. When rule governance requires runtime editing, a database-backed rule store can layer on top.
Feature flags over feature branches. Six feature flags (AI_BRIEF, AI_DIAGNOSTICIAN, RAG_RULES, V2_PIPELINE, SWARM_ENGINE, SWARM_EXPLORER) allow capabilities to be toggled at runtime via environment variables. This enables gradual rollout, A/B testing, and graceful degradation when external services (AI providers, database) are unavailable.
These decisions optimize for the current stage of the product: a small team building a deep product, where developer velocity and debuggability matter more than theoretical scalability. The architecture is designed so that every decision can be revisited without a rewrite.
Standards Alignment
Section titled “Standards Alignment”| Standard | Relevance to This Chapter |
|---|---|
| ISO 13374 — Condition monitoring and diagnostics of machines | The three-service architecture (FastAPI engine, SvelteKit Explorer, PostgreSQL) implements ISO 13374’s processing chain as a production system, with strict separation between data acquisition, processing, and presentation layers. |
| MIMOSA OSA-CBM — Open System Architecture for CBM | The API-first design with REST endpoints under /rapid-ai/v1/ follows OSA-CBM’s open architecture principles, enabling interoperability with existing plant historians, CMMS, and SCADA systems. |
| IEC 62443 — Industrial cybersecurity | The concentric ring architecture (Ring 0 pure physics, Ring 1 orchestration, Ring 2 infrastructure) implements IEC 62443’s defense-in-depth model with strict dependency direction and network boundary separation. |
| OWASP Top 10 — Web application security | The schema ownership rule (Drizzle owns DDL) and the BFF pattern prevent common web application vulnerabilities by isolating the diagnostic engine from direct public internet exposure. |
Changelog
Section titled “Changelog”| Version | Date | Author | Changes |
|---|---|---|---|
| 2.1.0 | 2026-03-17 | Rick D | Added standards alignment, living doc metadata, changelog |
| 2.0.0 | 2026-03-17 | Rick D | Enriched with production codebase content |
| 1.0.0 | 2026-03-17 | Rick D | Initial chapter creation |