Operations and Deployment

Chapter 23 — Operations & Deployment Guide

Operating RAPID AI in production means keeping a diagnostic intelligence engine healthy so it can keep industrial equipment healthy. This chapter covers everything from cloning the repository to handling a 3 AM database failure. The system is designed to be stateless at the application layer and data-heavy at the database layer — which simplifies deployment but demands disciplined database operations.

23.1 Development Environment Setup

Prerequisites

Dependency	Version	Purpose
Python	3.13+	FastAPI engine, diagnostic pipeline, swarm agents
uv	Latest	Fast Python package installer (Astral)
Bun	1.3+	JS runtime, package manager, build tool, production server
PostgreSQL	17 with pgvector	Primary data store, vector similarity search
Docker & Docker Compose	Latest stable	PostgreSQL for dev, full-stack containerization
Git	2.40+	Source control

Clone and Install

# Clone the monorepo
git clone https://github.com/your-org/rapid-ai.git
cd rapid-ai

# Install all JS workspaces (bun resolves workspace:* automatically)
bun install

# Build shared packages (required before first dev run)
make build-packages     # builds @rapidai/contracts + @rapidai/agents

# Install Python backend dependencies
cd apps/engine
uv pip install -e ".[dev]"

Environment Variables

Engine (rapid/.env):

Variable	Required	Default	Purpose
`DATABASE_URL`	Yes	—	PostgreSQL connection string
`GEMINI_API_KEY`	No	—	Google Gemini API key
`OPENAI_API_KEY`	No	—	OpenAI fallback
`CLOUDFLARE_ACCOUNT_ID`	No	—	CF account for AI Gateway
`CLOUDFLARE_AI_GATEWAY_ID`	No	—	AI Gateway identifier
`CLOUDFLARE_AIG_TOKEN`	No	—	AI Gateway auth token
`LOG_LEVEL`	No	`info`	Structured logging level
`RAPID_FEATURE_*`	No	`1`	Feature flags (set `0` to disable)

Horizon (horizon/.env):

Variable	Required	Default	Purpose
`DATABASE_URL`	Yes	—	PostgreSQL (Drizzle + better-auth)
`BETTER_AUTH_SECRET`	Yes	—	Auth session secret (32+ chars)
`ORIGIN`	No	`http://localhost:5173`	SvelteKit origin
`RAPID_AI_ENGINE_URL`	No	`http://localhost:8000`	Python API base URL

Never commit .env files. The repository includes .env.example templates.

Running Locally (Makefile Targets)

# Start only PostgreSQL (Docker)
make dev-db

# Start Python API (native)
make dev-rapid        # uvicorn at :8000

# Start SvelteKit (native)
make dev-horizon      # dev server at :5173

# Start both API + Explorer
make dev

# Full Docker stack with hot-reload
make dev-docker       # docker compose watch

# Database management
make db-push          # Push Drizzle schema
make db-studio        # Open Drizzle Studio
make db-generate      # Generate migration files
make db-migrate       # Run pending migrations

# Testing
make test             # All tests (rapid + horizon)
make test-rapid       # Python pytest (488 tests)
make test-horizon     # SvelteKit type check
make lint-api         # Python ruff

# Build
make build-packages   # Build @rapidai/* packages
make build            # Production Explorer build
make build-docker     # Build all Docker images

# Docker lifecycle
make up               # Start all services
make up-build         # Rebuild and start
make down             # Stop all services
make logs             # Tail all service logs
make clean            # Remove build artifacts
make help             # Show all targets

The engine starts on http://localhost:8000. Swagger UI at http://localhost:8000/docs. Health check at http://localhost:8000/health.

SvelteKit Frontend:

cd frontend
bun run dev                      # or: npm run dev

The frontend starts on http://localhost:5173.

Seed Data Loading

After PostgreSQL is running with pgvector enabled, create the schema and load the IMS seed data:

# Run Drizzle migrations first (creates all tables)
cd frontend
bun run db:push                  # or: npx drizzle-kit push

# Load seed data (808 rows across 9 tables)
psql -U rapid -d rapidai -f platform/data/00_run_all_seed_inserts.sql

The master seed script loads tables in dependency-safe order: asset_master first (100 rows), then functional_failures, failure_modes, sensor_evidence_rules, rcm_rules, maintenance_tasks, dashboard_output_mappings, schema_relation_map (100 rows each), and finally table_dictionary (8 rows).

Verify the seed loaded correctly:

SELECT 'asset_master' AS tbl, count(*) FROM asset_master
UNION ALL SELECT 'schema_relation_map', count(*) FROM schema_relation_map;
-- Expected: 100, 100

23.2 Database Operations

Migration Strategy

Drizzle ORM owns all DDL. Python (SQLAlchemy Core) mirrors the schema with Table objects for type-safe reads and writes but never creates or alters tables. This prevents two ORMs from fighting over the same database.

# Generate migration from schema changes
bun run drizzle-kit generate

# Apply pending migrations
bun run drizzle-kit migrate

# Push schema directly (development only)
bun run drizzle-kit push

Migration files are committed to git and applied in order during deployment. Never hand-edit a migration file after it has been applied to any environment.

Seed Data Management

Seed data follows versioned SQL files. When Dibyendu provides updated IMS rows or new failure mode libraries:

Generate new SQL insert files from updated CSVs
Create a migration that truncates and re-inserts affected tables
Test against a local database before merging
Apply to staging, validate diagnostic outputs, then apply to production

Backup and Restore

# Full backup (daily, automated via cron)
pg_dump -U rapid -d rapidai -Fc -f rapidai_$(date +%Y%m%d_%H%M%S).dump

# WAL archiving (continuous, for point-in-time recovery)
# Configure in postgresql.conf:
#   archive_mode = on
#   archive_command = 'cp %p /backup/wal/%f'

# Restore from full backup
pg_restore -U rapid -d rapidai --clean --if-exists rapidai_20260317_030000.dump

# Point-in-time recovery
pg_restore -U rapid -d rapidai_recovery rapidai_20260317_030000.dump
# Then replay WAL files up to the target timestamp

pgvector Index Maintenance

RAPID AI stores 768-dimensional embeddings for rule vectors and analysis vectors. The IVFFlat index requires periodic maintenance:

-- Check index health
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexrelname LIKE '%vector%';

-- Rebuild index after large embedding batch inserts
REINDEX INDEX CONCURRENTLY idx_rule_vectors_embedding;
REINDEX INDEX CONCURRENTLY idx_analysis_vectors_embedding;

-- If switching to HNSW (recommended for >100K vectors):
CREATE INDEX CONCURRENTLY idx_rule_vectors_hnsw
  ON rule_vectors USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

Performance Monitoring

-- Enable slow query log (queries > 200ms)
ALTER SYSTEM SET log_min_duration_statement = 200;
SELECT pg_reload_conf();

-- Check index usage (low idx_scan = unused index)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;

-- Check table bloat
SELECT relname, n_dead_tup, n_live_tup,
       round(n_dead_tup::numeric / greatest(n_live_tup, 1) * 100, 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_live_tup > 0
ORDER BY dead_pct DESC;

23.3 Deployment

Docker Compose (Development and Staging)

services:
  postgres:
    image: pgvector/pgvector:pg17
    environment:
      POSTGRES_USER: rapid
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: rapidai
    volumes:
      - pgdata:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U rapid"]
      interval: 10s
      timeout: 5s
      retries: 5

  backend:
    build: ./backend
    ports:
      - "8000:8000"
    environment:
      DATABASE_URL: postgresql+asyncpg://rapid:${DB_PASSWORD}@postgres:5432/rapidai
    depends_on:
      postgres:
        condition: service_healthy

  frontend:
    build: ./frontend
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://rapid:${DB_PASSWORD}@postgres:5432/rapidai
      PUBLIC_API_BASE: http://backend:8000/rapid-ai/v1
    depends_on:
      - backend

volumes:
  pgdata:

Production Options

Option 1: Containerized VPS (Recommended for MVP)

Deploy Docker Compose to a single VPS (4 vCPU, 8 GB RAM minimum). Use a reverse proxy (Caddy or Traefik) for TLS termination. Suitable for single-plant deployments with up to 500 assets.

Option 2: Kubernetes

For multi-plant, multi-tenant deployments. Requires Helm charts with:

Resource limits: backend 1 CPU / 2 GB, frontend 0.5 CPU / 512 MB
Liveness probe: GET /health every 30s
Readiness probe: GET /health every 10s (backend must respond with database connectivity confirmation)
HPA: scale backend pods on CPU > 70%, targeting 2-8 replicas
PVC for PostgreSQL with regular snapshot backups

Option 3: Serverless Split

Frontend: Vercel (SvelteKit adapter-vercel) or Cloudflare Pages
Backend: Railway or Render (Docker container with persistent PostgreSQL add-on)
Database: Neon (serverless PostgreSQL with pgvector) or Supabase

This option works well for teams that want zero infrastructure management, but introduces latency between frontend and backend due to network hops.

Environment Configuration

Variable	Staging	Production
`LOG_LEVEL`	DEBUG	WARNING
`ENVIRONMENT`	staging	production
`CORS_ORIGINS`	staging domain	production domain
`DATABASE_URL`	staging DB	production DB (separate instance)
`RATE_LIMIT_RPM`	600	120

SSL/TLS and DNS

Use Caddy for automatic HTTPS via Let’s Encrypt:

# Caddyfile
rapidai.example.com {
    reverse_proxy /rapid-ai/v1/* backend:8000
    reverse_proxy /* frontend:3000
}

For Kubernetes, use cert-manager with a ClusterIssuer for Let’s Encrypt certificates.

23.4 Monitoring & Observability

Application Metrics

Metric	Target	Alert Threshold
Diagnostic request latency (p95)	< 200ms	> 500ms
Asset query latency (p95)	< 50ms	> 150ms
Swarm agent completion time	< 30s	> 60s
Rule evaluation throughput	> 500 rules/sec	< 200 rules/sec
Knowledge search latency (RAG)	< 300ms	> 800ms

Business Metrics

Track these in a separate dashboard from infrastructure metrics:

Total assets monitored (by plant, by type)
Diagnostics completed per day / per week
Failure modes detected (top 10 by frequency)
Accuracy feedback: operator confirmed vs. overridden diagnoses
RCM tasks generated vs. tasks completed
Copilot questions per day, response satisfaction

Logging Strategy

All services emit structured JSON logs with correlation IDs:

{
  "timestamp": "2026-03-17T14:22:01.332Z",
  "level": "INFO",
  "service": "backend",
  "correlation_id": "diag-7f3a9b2c",
  "asset_id": "AST-PUMP-001",
  "module": "mB",
  "message": "Fault detection complete",
  "failure_modes_matched": 3,
  "duration_ms": 47
}

Log levels: ERROR (system failures, data corruption), WARNING (degraded quality scores, provider fallbacks), INFO (diagnostic completions, API requests), DEBUG (rule evaluation traces, individual sensor readings).

Alerting

Configure PagerDuty or OpsGenie for:

P1 (page immediately): Database unreachable, all diagnostic requests failing, backend process crash
P2 (alert within 15 min): Diagnostic latency > 1s sustained, provider fallback chain exhausted, disk > 90%
P3 (daily digest): Elevated error rates, slow query warnings, certificate expiry within 14 days

Grafana Dashboards

Create three dashboards:

System Health: Request rates, latency histograms, error rates, database connections, CPU/memory
Diagnostic Pipeline: Module-by-module latency breakdown (mA through mE), rule match rates, confidence distributions
Business Overview: Assets monitored, diagnostics per day, failure mode frequency, RCM task completion rates

23.5 Performance Optimization

ReferenceLoader Caching

The ReferenceLoader loads IMS rows, failure mode libraries, system profiles, and block scoring parameters. These are effectively static between rule updates:

Strategy: Lazy-load on first request, cache in memory indefinitely
Invalidation: Clear cache on POST /admin/schema/reload or on application restart
Memory footprint: ~15 MB for the full IMS (100 rows x 34 columns) plus all rule libraries
Impact: Eliminates ~40ms of database reads per diagnostic request

Database Query Optimization

-- Essential indexes for diagnostic queries
CREATE INDEX idx_analysis_results_asset_created
  ON analysis_results (asset_id, created_at DESC);

CREATE INDEX idx_sensor_readings_asset_ts
  ON sensor_readings (asset_id, timestamp DESC);

-- Materialized view for dashboard plant overview
CREATE MATERIALIZED VIEW mv_plant_overview AS
SELECT
  a.plant_id,
  a.asset_type,
  count(*) AS asset_count,
  avg(ar.ssi_score) AS avg_ssi,
  count(*) FILTER (WHERE ar.health_stage = 'critical') AS critical_count
FROM assets a
LEFT JOIN LATERAL (
  SELECT ssi_score, health_stage
  FROM analysis_results
  WHERE asset_id = a.asset_id
  ORDER BY created_at DESC LIMIT 1
) ar ON true
GROUP BY a.plant_id, a.asset_type;

-- Refresh on schedule (every 5 minutes)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_plant_overview;

API Response Time Targets

Endpoint	Target	Optimization
`POST /assets/{id}/end-to-end-evaluate`	< 200ms	Cached ReferenceLoader, parallel mB workers
`GET /assets/{id}`	< 50ms	Indexed lookup, no joins
`GET /dashboard/plant-overview`	< 100ms	Materialized view
`POST /copilot/query`	< 3s	Streaming SSE response

Connection Pooling

# asyncpg pool configuration (backend)
engine = create_async_engine(
    DATABASE_URL,
    pool_size=20,          # concurrent connections
    max_overflow=10,       # burst capacity
    pool_timeout=30,       # wait for connection
    pool_recycle=3600,     # recycle connections hourly
    pool_pre_ping=True,    # verify connections before use
)

For production with multiple backend replicas, total pool size across all replicas must not exceed PostgreSQL’s max_connections (default 100). With 4 replicas at pool_size=20, that is 80 connections — leaving 20 for migrations, monitoring, and ad-hoc queries.

23.6 Security Operations

Dependency Vulnerability Scanning

# Python (weekly CI job)
pip-audit --strict --desc

# Node.js (weekly CI job)
bun audit               # or: npm audit --audit-level=high

# Container images (on every build)
trivy image rapidai-backend:latest
trivy image rapidai-frontend:latest

API Key Rotation

AI provider keys (Gemini, OpenAI, Cloudflare): rotate every 90 days
BETTER_AUTH_SECRET: rotate every 180 days, with a 24-hour overlap window where both old and new secrets are valid
Database credentials: rotate on security incidents or personnel changes
All rotations are logged in the audit trail

Access Logging and Audit Trail

Every state-changing API request is logged to the audit_log table:

-- Audit log schema
CREATE TABLE audit_log (
  id          BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  timestamp   TIMESTAMPTZ NOT NULL DEFAULT now(),
  user_id     UUID NOT NULL REFERENCES users(id),
  action      TEXT NOT NULL,          -- 'diagnose', 'update_rule', 'export_report'
  resource    TEXT NOT NULL,          -- 'asset:AST-PUMP-001', 'rule:DG005'
  details     JSONB,                  -- request payload summary
  ip_address  INET
);

Incident Response Procedure

Detect: Automated alert fires (PagerDuty)
Triage: On-call engineer assesses severity (P1-P3)
Contain: If data breach, revoke compromised credentials immediately
Investigate: Review audit logs, structured application logs, database query history
Resolve: Deploy fix, verify with smoke tests
Postmortem: Document within 48 hours, identify preventive measures

Penetration Testing

Schedule annual penetration tests covering: API endpoint fuzzing, SQL injection attempts against diagnostic payloads, authentication bypass attempts, and SSRF through the copilot’s knowledge retrieval.

23.7 Disaster Recovery

Recovery Objectives

Objective	Target	Rationale
RTO (Recovery Time Objective)	4 hours	Maximum acceptable downtime before plant operators lose diagnostic capability
RPO (Recovery Point Objective)	1 hour	Maximum data loss window; diagnostic results from the last hour may need re-running

Backup Schedule

Backup Type	Frequency	Retention	Storage
WAL archiving	Continuous	7 days	S3/GCS bucket
Full database dump	Daily at 02:00 UTC	30 days	S3/GCS bucket
Database snapshot (cloud)	Daily	14 days	Cloud provider

Recovery Procedures

Application failure (backend or frontend crash): Application is stateless. Restart the container or redeploy from the latest image. No data loss.

Database corruption or loss:

# 1. Provision new PostgreSQL instance with pgvector
# 2. Restore from latest full backup
pg_restore -U rapid -d rapidai rapidai_latest.dump

# 3. Replay WAL files to reach target recovery point
# (handled automatically by pg_basebackup + recovery.conf)

# 4. Verify seed data integrity
psql -U rapid -d rapidai -c "SELECT count(*) FROM schema_relation_map;"
# Expected: 100

# 5. Restart application services
docker compose up -d backend frontend

Rule data loss: All rule definitions (YAML profiles, block scores, fusion weights) are versioned in git. Redeploy from the repository. The POST /admin/schema/reload endpoint forces the ReferenceLoader to re-read from the database and filesystem.

Failure Scenario Runbook

Scenario	Symptom	Recovery Steps
Database unreachable	All API requests return 503	Check PostgreSQL process, restart if crashed, failover to replica if available
Corrupted embeddings	Knowledge search returns irrelevant results	Re-run embedding generation pipeline, rebuild pgvector index
AI provider outage	Copilot and swarm requests fail	Provider Registry auto-falls back: Gemini > OpenAI > Cloudflare > Template. If all fail, copilot returns “service degraded” message
Disk full	Write operations fail	Clear old WAL files, expand volume, verify `pg_wal` directory size
Memory exhaustion	Backend OOM kills	Reduce connection pool size, add swap temporarily, investigate query memory usage
Seed data mismatch	Diagnostic results nonsensical	Re-run `00_run_all_seed_inserts.sql`, clear ReferenceLoader cache via `/admin/schema/reload`

Disaster Recovery Testing

Test the full recovery procedure quarterly:

Take a fresh backup
Provision a clean environment
Restore from backup
Run the diagnostic test suite against restored data
Verify all 100 IMS rows, all rule libraries, and a sample diagnostic request
Document recovery time and any issues encountered

Standards Alignment

Standard	Relevance to This Chapter
IEC 62443 — Industrial cybersecurity	The deployment procedures, environment variable management, and database backup/recovery processes implement IEC 62443’s operational security requirements for industrial automation systems.
NIST SP 800-82 — Guide to Industrial Control Systems security	The operations guide follows NIST SP 800-82 guidelines for securing industrial control system components, including network segmentation, access control, and incident response procedures.
OWASP Top 10 — Web application security	The deployment configuration (CORS, rate limiting, authentication) addresses OWASP Top 10 vulnerabilities in the production environment through documented operational procedures.

Changelog

Version	Date	Author	Changes
2.1.0	2026-03-17	Rick D	Added standards alignment, living doc metadata, changelog
2.0.0	2026-03-17	Rick D	Enriched with production codebase content
1.0.0	2026-03-17	Rick D	Initial chapter creation