Skip to content

Operations and Deployment

Chapter 23 — Operations & Deployment Guide

Section titled “Chapter 23 — Operations & Deployment Guide”

Operating RAPID AI in production means keeping a diagnostic intelligence engine healthy so it can keep industrial equipment healthy. This chapter covers everything from cloning the repository to handling a 3 AM database failure. The system is designed to be stateless at the application layer and data-heavy at the database layer — which simplifies deployment but demands disciplined database operations.


DependencyVersionPurpose
Python3.13+FastAPI engine, diagnostic pipeline, swarm agents
uvLatestFast Python package installer (Astral)
Bun1.3+JS runtime, package manager, build tool, production server
PostgreSQL17 with pgvectorPrimary data store, vector similarity search
Docker & Docker ComposeLatest stablePostgreSQL for dev, full-stack containerization
Git2.40+Source control
Terminal window
# Clone the monorepo
git clone https://github.com/your-org/rapid-ai.git
cd rapid-ai
# Install all JS workspaces (bun resolves workspace:* automatically)
bun install
# Build shared packages (required before first dev run)
make build-packages # builds @rapidai/contracts + @rapidai/agents
# Install Python backend dependencies
cd apps/engine
uv pip install -e ".[dev]"

Engine (rapid/.env):

VariableRequiredDefaultPurpose
DATABASE_URLYesPostgreSQL connection string
GEMINI_API_KEYNoGoogle Gemini API key
OPENAI_API_KEYNoOpenAI fallback
CLOUDFLARE_ACCOUNT_IDNoCF account for AI Gateway
CLOUDFLARE_AI_GATEWAY_IDNoAI Gateway identifier
CLOUDFLARE_AIG_TOKENNoAI Gateway auth token
LOG_LEVELNoinfoStructured logging level
RAPID_FEATURE_*No1Feature flags (set 0 to disable)

Horizon (horizon/.env):

VariableRequiredDefaultPurpose
DATABASE_URLYesPostgreSQL (Drizzle + better-auth)
BETTER_AUTH_SECRETYesAuth session secret (32+ chars)
ORIGINNohttp://localhost:5173SvelteKit origin
RAPID_AI_ENGINE_URLNohttp://localhost:8000Python API base URL

Never commit .env files. The repository includes .env.example templates.

Terminal window
# Start only PostgreSQL (Docker)
make dev-db
# Start Python API (native)
make dev-rapid # uvicorn at :8000
# Start SvelteKit (native)
make dev-horizon # dev server at :5173
# Start both API + Explorer
make dev
# Full Docker stack with hot-reload
make dev-docker # docker compose watch
# Database management
make db-push # Push Drizzle schema
make db-studio # Open Drizzle Studio
make db-generate # Generate migration files
make db-migrate # Run pending migrations
# Testing
make test # All tests (rapid + horizon)
make test-rapid # Python pytest (488 tests)
make test-horizon # SvelteKit type check
make lint-api # Python ruff
# Build
make build-packages # Build @rapidai/* packages
make build # Production Explorer build
make build-docker # Build all Docker images
# Docker lifecycle
make up # Start all services
make up-build # Rebuild and start
make down # Stop all services
make logs # Tail all service logs
make clean # Remove build artifacts
make help # Show all targets

The engine starts on http://localhost:8000. Swagger UI at http://localhost:8000/docs. Health check at http://localhost:8000/health.

SvelteKit Frontend:

Terminal window
cd frontend
bun run dev # or: npm run dev

The frontend starts on http://localhost:5173.

After PostgreSQL is running with pgvector enabled, create the schema and load the IMS seed data:

Terminal window
# Run Drizzle migrations first (creates all tables)
cd frontend
bun run db:push # or: npx drizzle-kit push
# Load seed data (808 rows across 9 tables)
psql -U rapid -d rapidai -f platform/data/00_run_all_seed_inserts.sql

The master seed script loads tables in dependency-safe order: asset_master first (100 rows), then functional_failures, failure_modes, sensor_evidence_rules, rcm_rules, maintenance_tasks, dashboard_output_mappings, schema_relation_map (100 rows each), and finally table_dictionary (8 rows).

Verify the seed loaded correctly:

SELECT 'asset_master' AS tbl, count(*) FROM asset_master
UNION ALL SELECT 'schema_relation_map', count(*) FROM schema_relation_map;
-- Expected: 100, 100

Drizzle ORM owns all DDL. Python (SQLAlchemy Core) mirrors the schema with Table objects for type-safe reads and writes but never creates or alters tables. This prevents two ORMs from fighting over the same database.

Terminal window
# Generate migration from schema changes
bun run drizzle-kit generate
# Apply pending migrations
bun run drizzle-kit migrate
# Push schema directly (development only)
bun run drizzle-kit push

Migration files are committed to git and applied in order during deployment. Never hand-edit a migration file after it has been applied to any environment.

Seed data follows versioned SQL files. When Dibyendu provides updated IMS rows or new failure mode libraries:

  1. Generate new SQL insert files from updated CSVs
  2. Create a migration that truncates and re-inserts affected tables
  3. Test against a local database before merging
  4. Apply to staging, validate diagnostic outputs, then apply to production
Terminal window
# Full backup (daily, automated via cron)
pg_dump -U rapid -d rapidai -Fc -f rapidai_$(date +%Y%m%d_%H%M%S).dump
# WAL archiving (continuous, for point-in-time recovery)
# Configure in postgresql.conf:
# archive_mode = on
# archive_command = 'cp %p /backup/wal/%f'
# Restore from full backup
pg_restore -U rapid -d rapidai --clean --if-exists rapidai_20260317_030000.dump
# Point-in-time recovery
pg_restore -U rapid -d rapidai_recovery rapidai_20260317_030000.dump
# Then replay WAL files up to the target timestamp

RAPID AI stores 768-dimensional embeddings for rule vectors and analysis vectors. The IVFFlat index requires periodic maintenance:

-- Check index health
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexrelname LIKE '%vector%';
-- Rebuild index after large embedding batch inserts
REINDEX INDEX CONCURRENTLY idx_rule_vectors_embedding;
REINDEX INDEX CONCURRENTLY idx_analysis_vectors_embedding;
-- If switching to HNSW (recommended for >100K vectors):
CREATE INDEX CONCURRENTLY idx_rule_vectors_hnsw
ON rule_vectors USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Enable slow query log (queries > 200ms)
ALTER SYSTEM SET log_min_duration_statement = 200;
SELECT pg_reload_conf();
-- Check index usage (low idx_scan = unused index)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
-- Check table bloat
SELECT relname, n_dead_tup, n_live_tup,
round(n_dead_tup::numeric / greatest(n_live_tup, 1) * 100, 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_live_tup > 0
ORDER BY dead_pct DESC;

docker-compose.yml
services:
postgres:
image: pgvector/pgvector:pg17
environment:
POSTGRES_USER: rapid
POSTGRES_PASSWORD: ${DB_PASSWORD}
POSTGRES_DB: rapidai
volumes:
- pgdata:/var/lib/postgresql/data
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U rapid"]
interval: 10s
timeout: 5s
retries: 5
backend:
build: ./backend
ports:
- "8000:8000"
environment:
DATABASE_URL: postgresql+asyncpg://rapid:${DB_PASSWORD}@postgres:5432/rapidai
depends_on:
postgres:
condition: service_healthy
frontend:
build: ./frontend
ports:
- "3000:3000"
environment:
DATABASE_URL: postgresql://rapid:${DB_PASSWORD}@postgres:5432/rapidai
PUBLIC_API_BASE: http://backend:8000/rapid-ai/v1
depends_on:
- backend
volumes:
pgdata:

Option 1: Containerized VPS (Recommended for MVP)

Deploy Docker Compose to a single VPS (4 vCPU, 8 GB RAM minimum). Use a reverse proxy (Caddy or Traefik) for TLS termination. Suitable for single-plant deployments with up to 500 assets.

Option 2: Kubernetes

For multi-plant, multi-tenant deployments. Requires Helm charts with:

  • Resource limits: backend 1 CPU / 2 GB, frontend 0.5 CPU / 512 MB
  • Liveness probe: GET /health every 30s
  • Readiness probe: GET /health every 10s (backend must respond with database connectivity confirmation)
  • HPA: scale backend pods on CPU > 70%, targeting 2-8 replicas
  • PVC for PostgreSQL with regular snapshot backups

Option 3: Serverless Split

  • Frontend: Vercel (SvelteKit adapter-vercel) or Cloudflare Pages
  • Backend: Railway or Render (Docker container with persistent PostgreSQL add-on)
  • Database: Neon (serverless PostgreSQL with pgvector) or Supabase

This option works well for teams that want zero infrastructure management, but introduces latency between frontend and backend due to network hops.

VariableStagingProduction
LOG_LEVELDEBUGWARNING
ENVIRONMENTstagingproduction
CORS_ORIGINSstaging domainproduction domain
DATABASE_URLstaging DBproduction DB (separate instance)
RATE_LIMIT_RPM600120

Use Caddy for automatic HTTPS via Let’s Encrypt:

# Caddyfile
rapidai.example.com {
reverse_proxy /rapid-ai/v1/* backend:8000
reverse_proxy /* frontend:3000
}

For Kubernetes, use cert-manager with a ClusterIssuer for Let’s Encrypt certificates.


MetricTargetAlert Threshold
Diagnostic request latency (p95)< 200ms> 500ms
Asset query latency (p95)< 50ms> 150ms
Swarm agent completion time< 30s> 60s
Rule evaluation throughput> 500 rules/sec< 200 rules/sec
Knowledge search latency (RAG)< 300ms> 800ms

Track these in a separate dashboard from infrastructure metrics:

  • Total assets monitored (by plant, by type)
  • Diagnostics completed per day / per week
  • Failure modes detected (top 10 by frequency)
  • Accuracy feedback: operator confirmed vs. overridden diagnoses
  • RCM tasks generated vs. tasks completed
  • Copilot questions per day, response satisfaction

All services emit structured JSON logs with correlation IDs:

{
"timestamp": "2026-03-17T14:22:01.332Z",
"level": "INFO",
"service": "backend",
"correlation_id": "diag-7f3a9b2c",
"asset_id": "AST-PUMP-001",
"module": "mB",
"message": "Fault detection complete",
"failure_modes_matched": 3,
"duration_ms": 47
}

Log levels: ERROR (system failures, data corruption), WARNING (degraded quality scores, provider fallbacks), INFO (diagnostic completions, API requests), DEBUG (rule evaluation traces, individual sensor readings).

Configure PagerDuty or OpsGenie for:

  • P1 (page immediately): Database unreachable, all diagnostic requests failing, backend process crash
  • P2 (alert within 15 min): Diagnostic latency > 1s sustained, provider fallback chain exhausted, disk > 90%
  • P3 (daily digest): Elevated error rates, slow query warnings, certificate expiry within 14 days

Create three dashboards:

  1. System Health: Request rates, latency histograms, error rates, database connections, CPU/memory
  2. Diagnostic Pipeline: Module-by-module latency breakdown (mA through mE), rule match rates, confidence distributions
  3. Business Overview: Assets monitored, diagnostics per day, failure mode frequency, RCM task completion rates

The ReferenceLoader loads IMS rows, failure mode libraries, system profiles, and block scoring parameters. These are effectively static between rule updates:

  • Strategy: Lazy-load on first request, cache in memory indefinitely
  • Invalidation: Clear cache on POST /admin/schema/reload or on application restart
  • Memory footprint: ~15 MB for the full IMS (100 rows x 34 columns) plus all rule libraries
  • Impact: Eliminates ~40ms of database reads per diagnostic request
-- Essential indexes for diagnostic queries
CREATE INDEX idx_analysis_results_asset_created
ON analysis_results (asset_id, created_at DESC);
CREATE INDEX idx_sensor_readings_asset_ts
ON sensor_readings (asset_id, timestamp DESC);
-- Materialized view for dashboard plant overview
CREATE MATERIALIZED VIEW mv_plant_overview AS
SELECT
a.plant_id,
a.asset_type,
count(*) AS asset_count,
avg(ar.ssi_score) AS avg_ssi,
count(*) FILTER (WHERE ar.health_stage = 'critical') AS critical_count
FROM assets a
LEFT JOIN LATERAL (
SELECT ssi_score, health_stage
FROM analysis_results
WHERE asset_id = a.asset_id
ORDER BY created_at DESC LIMIT 1
) ar ON true
GROUP BY a.plant_id, a.asset_type;
-- Refresh on schedule (every 5 minutes)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_plant_overview;
EndpointTargetOptimization
POST /assets/{id}/end-to-end-evaluate< 200msCached ReferenceLoader, parallel mB workers
GET /assets/{id}< 50msIndexed lookup, no joins
GET /dashboard/plant-overview< 100msMaterialized view
POST /copilot/query< 3sStreaming SSE response
# asyncpg pool configuration (backend)
engine = create_async_engine(
DATABASE_URL,
pool_size=20, # concurrent connections
max_overflow=10, # burst capacity
pool_timeout=30, # wait for connection
pool_recycle=3600, # recycle connections hourly
pool_pre_ping=True, # verify connections before use
)

For production with multiple backend replicas, total pool size across all replicas must not exceed PostgreSQL’s max_connections (default 100). With 4 replicas at pool_size=20, that is 80 connections — leaving 20 for migrations, monitoring, and ad-hoc queries.


Terminal window
# Python (weekly CI job)
pip-audit --strict --desc
# Node.js (weekly CI job)
bun audit # or: npm audit --audit-level=high
# Container images (on every build)
trivy image rapidai-backend:latest
trivy image rapidai-frontend:latest
  • AI provider keys (Gemini, OpenAI, Cloudflare): rotate every 90 days
  • BETTER_AUTH_SECRET: rotate every 180 days, with a 24-hour overlap window where both old and new secrets are valid
  • Database credentials: rotate on security incidents or personnel changes
  • All rotations are logged in the audit trail

Every state-changing API request is logged to the audit_log table:

-- Audit log schema
CREATE TABLE audit_log (
id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL DEFAULT now(),
user_id UUID NOT NULL REFERENCES users(id),
action TEXT NOT NULL, -- 'diagnose', 'update_rule', 'export_report'
resource TEXT NOT NULL, -- 'asset:AST-PUMP-001', 'rule:DG005'
details JSONB, -- request payload summary
ip_address INET
);
  1. Detect: Automated alert fires (PagerDuty)
  2. Triage: On-call engineer assesses severity (P1-P3)
  3. Contain: If data breach, revoke compromised credentials immediately
  4. Investigate: Review audit logs, structured application logs, database query history
  5. Resolve: Deploy fix, verify with smoke tests
  6. Postmortem: Document within 48 hours, identify preventive measures

Schedule annual penetration tests covering: API endpoint fuzzing, SQL injection attempts against diagnostic payloads, authentication bypass attempts, and SSRF through the copilot’s knowledge retrieval.


ObjectiveTargetRationale
RTO (Recovery Time Objective)4 hoursMaximum acceptable downtime before plant operators lose diagnostic capability
RPO (Recovery Point Objective)1 hourMaximum data loss window; diagnostic results from the last hour may need re-running
Backup TypeFrequencyRetentionStorage
WAL archivingContinuous7 daysS3/GCS bucket
Full database dumpDaily at 02:00 UTC30 daysS3/GCS bucket
Database snapshot (cloud)Daily14 daysCloud provider

Application failure (backend or frontend crash): Application is stateless. Restart the container or redeploy from the latest image. No data loss.

Database corruption or loss:

Terminal window
# 1. Provision new PostgreSQL instance with pgvector
# 2. Restore from latest full backup
pg_restore -U rapid -d rapidai rapidai_latest.dump
# 3. Replay WAL files to reach target recovery point
# (handled automatically by pg_basebackup + recovery.conf)
# 4. Verify seed data integrity
psql -U rapid -d rapidai -c "SELECT count(*) FROM schema_relation_map;"
# Expected: 100
# 5. Restart application services
docker compose up -d backend frontend

Rule data loss: All rule definitions (YAML profiles, block scores, fusion weights) are versioned in git. Redeploy from the repository. The POST /admin/schema/reload endpoint forces the ReferenceLoader to re-read from the database and filesystem.

ScenarioSymptomRecovery Steps
Database unreachableAll API requests return 503Check PostgreSQL process, restart if crashed, failover to replica if available
Corrupted embeddingsKnowledge search returns irrelevant resultsRe-run embedding generation pipeline, rebuild pgvector index
AI provider outageCopilot and swarm requests failProvider Registry auto-falls back: Gemini > OpenAI > Cloudflare > Template. If all fail, copilot returns “service degraded” message
Disk fullWrite operations failClear old WAL files, expand volume, verify pg_wal directory size
Memory exhaustionBackend OOM killsReduce connection pool size, add swap temporarily, investigate query memory usage
Seed data mismatchDiagnostic results nonsensicalRe-run 00_run_all_seed_inserts.sql, clear ReferenceLoader cache via /admin/schema/reload

Test the full recovery procedure quarterly:

  1. Take a fresh backup
  2. Provision a clean environment
  3. Restore from backup
  4. Run the diagnostic test suite against restored data
  5. Verify all 100 IMS rows, all rule libraries, and a sample diagnostic request
  6. Document recovery time and any issues encountered

StandardRelevance to This Chapter
IEC 62443 — Industrial cybersecurityThe deployment procedures, environment variable management, and database backup/recovery processes implement IEC 62443’s operational security requirements for industrial automation systems.
NIST SP 800-82 — Guide to Industrial Control Systems securityThe operations guide follows NIST SP 800-82 guidelines for securing industrial control system components, including network segmentation, access control, and incident response procedures.
OWASP Top 10 — Web application securityThe deployment configuration (CORS, rate limiting, authentication) addresses OWASP Top 10 vulnerabilities in the production environment through documented operational procedures.
VersionDateAuthorChanges
2.1.02026-03-17Rick DAdded standards alignment, living doc metadata, changelog
2.0.02026-03-17Rick DEnriched with production codebase content
1.0.02026-03-17Rick DInitial chapter creation