Architecture Decision Records

Chapter 21: Architecture Decision Records

Every system is the sum of its decisions. Most documentation explains what was built; this chapter explains why. When a future engineer asks “why didn’t you just…?” the answer lives here. Records are never deleted; if reversed, the status changes to “Superseded by ADR-XXX.”

ADR-001: Physics-Based Rules Over Machine Learning

Status: Accepted Context: RAPID AI must diagnose 19+ rotating-asset types (pumps, gearboxes, motors, fans, compressors, turbines, etc.) across vibration, thermal, electrical, and process domains. The system must be explainable to plant engineers, auditable by reliability managers, and deployable at sites with zero historical failure data.

Most diagnostic platforms default to supervised ML — train a model on labeled failure data, then classify new readings. This requires thousands of labeled examples per failure mode per asset type. In heavy industry, catastrophic failures are rare by design. A plant may see one inner-race bearing failure per decade. There is no dataset to train on.

Decision: Encode diagnostic intelligence as 451+ parseable physics-based rules, authored by domain experts, evaluated at runtime by a safe expression parser. Each rule maps directly to mechanical causality: “if 1x radial vibration exceeds 4.5 mm/s AND 2x axial is dominant, suspect angular misalignment.” The rules ARE the model.

Dibyendu De’s 28 years of field experience across hundreds of industrial sites is the training data — crystallized into deterministic, auditable logic rather than opaque weight matrices.

Alternatives Considered:

Random Forest / XGBoost — require labeled failure datasets that do not exist for most asset-failure combinations.
LSTM / sequence models — black boxes; cannot explain WHY a diagnosis was reached, which is unacceptable for safety-critical rotating equipment.
Transformer-based anomaly detection — massive compute and training data; overkill for structured physics problems where causality is known.
Hybrid rules + ML — planned for future phases (see White Paper Section 7), deferred until the deterministic foundation is proven.

Consequences: Fully explainable: every diagnosis traces back to named physical phenomena. Auditable: plant managers can inspect any rule. No cold-start problem: works on day one at a new site. Trade-off: adding new patterns requires domain expertise, creating a bottleneck on Dibyendu’s time until rule-authoring is systematized. Future risk: purely rule-based systems cannot discover unknown failure modes — unsupervised anomaly clustering is on the roadmap.

ADR-002: FastAPI Over Django/Express

Status: Accepted Context: The backend serves computation-heavy diagnostic pipelines while exposing REST APIs. The science stack is Python (NumPy, SciPy, pandas), so the framework must be Python-native. Decision: FastAPI for native async/await, Pydantic validation, and auto-generated OpenAPI docs that stay in sync with code. Alternatives Considered:

Django REST Framework — ORM opinions, admin overhead, synchronous-first. Too heavy.
Express.js — forces a language boundary with Python science libraries.
Flask — no async, no built-in validation, no auto-docs. Consequences: Pydantic catches malformed requests at the boundary. OpenAPI spec is always current. Async enables concurrent diagnostics. Trade-off: younger ecosystem than Django; some enterprise patterns require more manual setup.

ADR-003: PostgreSQL + pgvector Over MongoDB/TimescaleDB

Status: Accepted Context: Three data needs: relational (assets, rules, diagnostics), vector similarity (semantic search for copilot), and time-series (sensor readings). Running three databases is operationally expensive for a small team. Decision: PostgreSQL as the single engine, extended with pgvector for vector search. Time-series handled via partitioned tables. Alternatives Considered:

MongoDB — no relations; forcing relational queries through aggregation is worse.
TimescaleDB — PostgreSQL extension; adds management complexity alongside pgvector.
Pinecone / Weaviate — external dependency, vendor lock-in for current scale.
Separate databases per concern — three backup strategies, three failure modes. Consequences: Single deployment, ACID guarantees for audit trails. Trade-off: PostgreSQL is not optimized for IoT-scale time-series; TimescaleDB can be added later.

ADR-004: SvelteKit Over React/Next.js

Status: Accepted Context: Real-time diagnostic dashboards with streaming sensor data on aging control-room workstations. Performance matters. Decision: SvelteKit with Svelte 5 runes. Compiles to minimal vanilla JavaScript — no virtual DOM diffing, no framework overhead in the browser. Alternatives Considered:

Next.js (React) — virtual DOM overhead on every re-render accumulates at 1s update intervals across dozens of components. Larger bundle size.
Remix — still React-based with the same runtime overhead.
Vue / Nuxt — smaller ecosystem than React without Svelte’s compile-time advantage. Consequences: Smaller bundles, faster updates. Svelte 5 runes provide fine-grained reactivity. Trade-off: smaller talent pool, fewer component libraries.

ADR-005: Safe Expression Parser Over eval()

Status: Accepted Context: The rule evaluator must parse and execute 451+ rule expressions at runtime. Rules are stored as strings in the database/CSV (e.g., "vibration_1x_radial > 4.5 AND vibration_2x_axial > 3.0"). These strings must be evaluated against live sensor data dictionaries. The obvious Python shortcut is eval(). The obvious shortcut is also the most dangerous function in the language. This is a critical security decision.

Decision: Custom recursive-descent parser (tokenize -> AST -> evaluate) implemented in rule_evaluator.py. Supports AND/OR logic, parentheses for grouping, comparison operators (>, <, >=, <=, ==, !=), semicolons-as-AND for compact notation, and computed vibration ratios via /. No user-supplied string is ever executed as Python code. Results are frozen dataclasses (see ADR-011).

Alternatives Considered:

Python eval() — one line of code, catastrophic security vulnerability. Any rule string reaching eval() can execute arbitrary Python including os.system(). Rejected unconditionally.
numexpr — safe for numeric expressions, but lacks boolean logic, string comparisons, and domain-specific syntax (semicolons-as-AND, ratio operators).
ANTLR / PLY compiler toolchain — correct in principle, but over-engineering for an expression language that fits in a recursive-descent parser under 300 lines.
Lua / embedded scripting — foreign runtime dependency for a pure-Python problem.

Consequences: Zero code-execution surface: the parser can only compare values, combine booleans, and compute ratios. Each evaluation returns a frozen RuleResult with matched/unmatched terms for full explainability. Trade-off: new syntax features require parser modifications — this is intentional; every new capability must be explicitly designed and reviewed.

ADR-006: Monolith-First Over Microservices

Status: Accepted Context: The System Architecture Blueprint (see Chapter 14) describes 17 microservices — the correct target architecture. But RAPID AI is early-stage with a small team. Premature decomposition creates distributed debugging, network-boundary serialization, deployment orchestration, and the overhead of maintaining 17 separate CI/CD pipelines.

Decision: Build v1 as a single FastAPI application with a modular service layer. Each “service” from the Blueprint exists as a Python module within the monolith, communicating via function calls rather than HTTP. The module boundaries are designed so that extraction into independent services is a deployment decision, not an architectural refactoring.

Alternatives Considered:

Microservices from day one — architecturally pure but operationally premature. Network calls add latency, partial-failure handling, and observability requirements that a two-person team cannot sustain.
Event-driven architecture (Kafka/NATS) — excellent for decoupling at scale, but adds infrastructure (broker, schema registry, dead-letter queues) unjustified before product-market fit.
Serverless functions — cold-start latency unacceptable for real-time diagnostics.

Consequences: Single deployment simplifies CI/CD, debugging, and local development. Shared database eliminates distributed transaction complexity. Trade-off: cannot scale modules independently — acceptable at current volume. Extraction path is clear: wrap a module in its own FastAPI app + add HTTP/gRPC at the boundary.

ADR-007: Drizzle ORM Over Prisma/SQLAlchemy

Status: Accepted Context: SvelteKit needs type-safe database access for SSR and API routes with PostgreSQL-specific features (pgvector, JSONB). Decision: Drizzle ORM — SQL-like API, TypeScript-native, types generated from schema, lightweight runtime with no code generation step. Alternatives Considered:

Prisma — heavier, schema-first; pgvector requires raw SQL escape hatches.
SQLAlchemy — Python only; forces every frontend data need through an API call.
Raw SQL — no type safety; every result is any unless manually typed. Consequences: Type safety from schema to UI component. SQL-like API means no ORM-specific language to learn. Trade-off: newer, smaller community than Prisma.

ADR-008: better-auth Over NextAuth/Clerk

Status: Accepted Context: Authentication must work with SvelteKit, support multiple strategies (email/password, OAuth, API keys), and be self-hosted for enterprise customers with air-gapped networks. Decision: better-auth — framework-agnostic, self-hosted, full control over auth flow, session storage, and token lifecycle. Alternatives Considered:

Clerk — requires sending auth data to third-party servers. Enterprise customers with data-sovereignty requirements will reject this. Vendor lock-in.
Auth.js — SvelteKit adapter historically unstable; optimized for Next.js.
Lucia — maintainer deprecated the project. Consequences: Works in air-gapped environments. Framework-agnostic. Trade-off: less polished out-of-the-box UI; auth screens must be custom-built.

ADR-009: Data-Driven Rules Over Hardcoded Logic

Status: Accepted Context: The earliest prototype used hardcoded if/elif chains — one function per asset type, one branch per failure mode. At 451+ rules across 19+ types, this does not scale. Decision: Store rules as structured data (JSON/CSV), loaded at runtime, evaluated by a generic engine (ADR-005). Each rule is a row: asset type, failure mode, expression, severity, confidence weight, explanation text. Alternatives Considered:

Hardcoded if/elif — combinatorial explosion; every new type needs a new function.
YAML/TOML configs — lack structure and validation of a proper rule schema.
Database with admin UI — target state, deferred until rule schema stabilizes. Consequences: Adding a diagnostic = adding a row, not code. Domain experts can author rules in spreadsheet format. Migration path: CSV/JSON moves to PostgreSQL with a rule-management UI (Blueprint Section 4.17).

ADR-010: IMS as Central Knowledge Base

Status: Accepted Context: Sensor evidence, failure modes, and actions must be connected. This mapping could live in code, separate tables per module, or a unified structure. Decision: The 100x34 IMS (Integrated Matrix Structure) maps sensor evidence (rows) to failure modes (columns). Intersection cells encode confidence weights, severity levels, and action references. Every diagnostic module reads from this single structure. Alternatives Considered:

Distributed per-module tables — consistency risks when failure modes are renamed.
Ontology / knowledge graph (RDF) — specialized stack exceeding current needs. Consequences: Single source of truth, human-reviewable in a spreadsheet. Trade-off: flat matrix cannot represent complex conditionals (handled by rule expressions instead).

ADR-011: Frozen Dataclasses for Engine Outputs

Status: Accepted Context: Multiple engines (SEDL, Fusion, RUL, CDE) produce results consumed downstream. Accidental mutation after creation causes silently incorrect diagnostics. Decision: All engine outputs use @dataclass(frozen=True). Reassignment raises FrozenInstanceError. The RuleResult in rule_evaluator.py exemplifies this pattern. Alternatives Considered:

Mutable dataclasses — no mutation protection in a multi-module pipeline.
Named tuples — lack default values and dataclass ergonomics.
Pydantic frozen models — adds dependency in engine internals; Pydantic belongs at API boundaries (ADR-002). Consequences: Thread-safe by construction. Modification requires explicit dataclasses.replace(), making intent visible. Trade-off: slightly more verbose.

ADR-012: Confidence Scoring Standard

Status: Accepted Context: Modules used incompatible confidence formats — text labels, 0-100 integers, 0.0-1.0 floats. Fusion across modules required ad-hoc conversion. Decision: Canonical float in [0.0, 1.0] with shared confidence.py defining thresholds: >= 0.85 HIGH, >= 0.60 MEDIUM, >= 0.30 LOW, < 0.30 NEGLIGIBLE. Alternatives Considered:

Text labels only — not computable; cannot average or threshold.
Integer 0-100 — ambiguous (85 means 0.85 or 85%?).
Per-module standards — O(n^2) conversion paths across 7+ modules. Consequences: Fusion engine aggregates scores directly. Labels derived from scores, not the reverse. Trade-off: some granularity loss for categorical-output modules.

ADR-013: ISO 13374 Alignment

Status: Accepted Context: Enterprise buyers evaluate vendors against industry standards. ISO 13374 defines a six-level condition monitoring hierarchy recognized across reliability engineering. Decision: Align modules to ISO 13374: L1 (Ingestion) -> L2 (Signal Features, Module A) -> L3 (Condition Detection, Module B) -> L4 (Diagnostics, Modules C/D) -> L5 (Prognostics, Module E) -> L6 (Decision Support, Modules F/G). Alternatives Considered:

Custom taxonomy — loses credibility and shared vocabulary with enterprise buyers.
MIMOSA OSA-CBM — adds complexity beyond module boundary definition needs. Consequences: Reduces sales friction; industry-recognized vocabulary. Trade-off: some modules (CDE) span multiple ISO levels; pragmatic deviation is documented.

ADR-014: Module Pipeline Over Event-Driven Architecture

Status: Accepted Context: Diagnostic modules must pass results from upstream (signal processing) to downstream (planning). Two patterns: strict forward pipeline or event-driven pub/sub. Decision: Strict forward pipeline: A -> B -> C -> D -> E -> F -> G. Each module reads only the previous module’s output. No backward feedback loops in v1. Alternatives Considered:

Event bus (pub/sub) — non-deterministic execution order; debugging complexity.
Shared mutable state — hidden coupling, race conditions, violates ADR-011.
DAG scheduler — over-engineered for a linear pipeline. Consequences: Deterministic, debuggable, partial results on failure. Trade-off: no backward feedback (Module B cannot adapt based on Module D). No parallelism between independent modules. Both are deliberate v1 simplifications.

ADR-015: Three-Track Product Model

Status: Accepted Context: Diagnostic capability spans basic monitoring to design-out elimination. Different customers have different budgets and maturity. A single tier either prices out small plants or undervalues advanced capabilities. Decision: Three solution tracks mapping to maintenance philosophies:

Track 1: CBM — Modules A+B. Signal processing, anomaly detection. SaaS pricing.
Track 2: RCM — Adds C+D+E. Fusion, root-cause, RUL. Premium subscription.
Track 3: Design-Out — Adds F+G. CDE, design recommendations. Consulting + platform. Alternatives Considered:
Single all-inclusive tier — prices out small plants; Track 3 needs consulting.
Per-module pricing — confusing menu of 7+ modules.
Usage-based pricing — unpredictable revenue, incentivizes minimal usage. Consequences: Natural upsell path (Track 1 -> 2 -> 3). Each track maps to a recognized methodology. Track 3 creates high-margin consulting revenue. Trade-off: three onboarding flows and support tiers.

Index of Decisions

ADR	Title	Status
001	Physics-Based Rules Over Machine Learning	Accepted
002	FastAPI Over Django/Express	Accepted
003	PostgreSQL + pgvector Over MongoDB/TimescaleDB	Accepted
004	SvelteKit Over React/Next.js	Accepted
005	Safe Expression Parser Over eval()	Accepted
006	Monolith-First Over Microservices	Accepted
007	Drizzle ORM Over Prisma/SQLAlchemy	Accepted
008	better-auth Over NextAuth/Clerk	Accepted
009	Data-Driven Rules Over Hardcoded Logic	Accepted
010	IMS as Central Knowledge Base	Accepted
011	Frozen Dataclasses for Engine Outputs	Accepted
012	Confidence Scoring Standard	Accepted
013	ISO 13374 Alignment	Accepted
014	Module Pipeline Over Event-Driven Architecture	Accepted
015	Three-Track Product Model	Accepted

When adding new ADRs, append with the next sequential number. Never delete or renumber existing records. If a decision is reversed, change its Status to “Superseded by ADR-XXX” and create the new ADR explaining the reversal.

Standards Alignment

Standard	Relevance to This Chapter
ISO 13374 — Condition monitoring and diagnostics of machines	ADR-001 (Physics-Based Rules Over Machine Learning) establishes the foundational architectural decision that enables ISO 13374 compliance through deterministic, auditable diagnostic processing at every level.
OWASP Top 10 — Web application security	The ADR for safe rule evaluation (no eval()) directly addresses OWASP A03:2021 (Injection), documenting the security rationale as a permanent architectural record.
IEC 62443 — Industrial cybersecurity	Architecture decisions regarding network boundaries, service isolation, and the BFF pattern implement IEC 62443’s defense-in-depth requirements for industrial control system security.

Changelog

Version	Date	Author	Changes
2.1.0	2026-03-17	Rick D	Added standards alignment, living doc metadata, changelog
2.0.0	2026-03-17	Rick D	Enriched with production codebase content
1.0.0	2026-03-17	Rick D	Initial chapter creation