Operations and Deployment
Chapter 23 — Operations & Deployment Guide
Section titled “Chapter 23 — Operations & Deployment Guide”Operating RAPID AI in production means keeping a diagnostic intelligence engine healthy so it can keep industrial equipment healthy. This chapter covers everything from cloning the repository to handling a 3 AM database failure. The system is designed to be stateless at the application layer and data-heavy at the database layer — which simplifies deployment but demands disciplined database operations.
23.1 Development Environment Setup
Section titled “23.1 Development Environment Setup”Prerequisites
Section titled “Prerequisites”| Dependency | Version | Purpose |
|---|---|---|
| Python | 3.13+ | FastAPI engine, diagnostic pipeline, swarm agents |
| uv | Latest | Fast Python package installer (Astral) |
| Bun | 1.3+ | JS runtime, package manager, build tool, production server |
| PostgreSQL | 17 with pgvector | Primary data store, vector similarity search |
| Docker & Docker Compose | Latest stable | PostgreSQL for dev, full-stack containerization |
| Git | 2.40+ | Source control |
Clone and Install
Section titled “Clone and Install”# Clone the monorepogit clone https://github.com/your-org/rapid-ai.gitcd rapid-ai
# Install all JS workspaces (bun resolves workspace:* automatically)bun install
# Build shared packages (required before first dev run)make build-packages # builds @rapidai/contracts + @rapidai/agents
# Install Python backend dependenciescd apps/engineuv pip install -e ".[dev]"Environment Variables
Section titled “Environment Variables”Engine (rapid/.env):
| Variable | Required | Default | Purpose |
|---|---|---|---|
DATABASE_URL | Yes | — | PostgreSQL connection string |
GEMINI_API_KEY | No | — | Google Gemini API key |
OPENAI_API_KEY | No | — | OpenAI fallback |
CLOUDFLARE_ACCOUNT_ID | No | — | CF account for AI Gateway |
CLOUDFLARE_AI_GATEWAY_ID | No | — | AI Gateway identifier |
CLOUDFLARE_AIG_TOKEN | No | — | AI Gateway auth token |
LOG_LEVEL | No | info | Structured logging level |
RAPID_FEATURE_* | No | 1 | Feature flags (set 0 to disable) |
Horizon (horizon/.env):
| Variable | Required | Default | Purpose |
|---|---|---|---|
DATABASE_URL | Yes | — | PostgreSQL (Drizzle + better-auth) |
BETTER_AUTH_SECRET | Yes | — | Auth session secret (32+ chars) |
ORIGIN | No | http://localhost:5173 | SvelteKit origin |
RAPID_AI_ENGINE_URL | No | http://localhost:8000 | Python API base URL |
Never commit .env files. The repository includes .env.example templates.
Running Locally (Makefile Targets)
Section titled “Running Locally (Makefile Targets)”# Start only PostgreSQL (Docker)make dev-db
# Start Python API (native)make dev-rapid # uvicorn at :8000
# Start SvelteKit (native)make dev-horizon # dev server at :5173
# Start both API + Explorermake dev
# Full Docker stack with hot-reloadmake dev-docker # docker compose watch
# Database managementmake db-push # Push Drizzle schemamake db-studio # Open Drizzle Studiomake db-generate # Generate migration filesmake db-migrate # Run pending migrations
# Testingmake test # All tests (rapid + horizon)make test-rapid # Python pytest (488 tests)make test-horizon # SvelteKit type checkmake lint-api # Python ruff
# Buildmake build-packages # Build @rapidai/* packagesmake build # Production Explorer buildmake build-docker # Build all Docker images
# Docker lifecyclemake up # Start all servicesmake up-build # Rebuild and startmake down # Stop all servicesmake logs # Tail all service logsmake clean # Remove build artifactsmake help # Show all targetsThe engine starts on http://localhost:8000. Swagger UI at http://localhost:8000/docs. Health check at http://localhost:8000/health.
SvelteKit Frontend:
cd frontendbun run dev # or: npm run devThe frontend starts on http://localhost:5173.
Seed Data Loading
Section titled “Seed Data Loading”After PostgreSQL is running with pgvector enabled, create the schema and load the IMS seed data:
# Run Drizzle migrations first (creates all tables)cd frontendbun run db:push # or: npx drizzle-kit push
# Load seed data (808 rows across 9 tables)psql -U rapid -d rapidai -f platform/data/00_run_all_seed_inserts.sqlThe master seed script loads tables in dependency-safe order: asset_master first (100 rows), then functional_failures, failure_modes, sensor_evidence_rules, rcm_rules, maintenance_tasks, dashboard_output_mappings, schema_relation_map (100 rows each), and finally table_dictionary (8 rows).
Verify the seed loaded correctly:
SELECT 'asset_master' AS tbl, count(*) FROM asset_masterUNION ALL SELECT 'schema_relation_map', count(*) FROM schema_relation_map;-- Expected: 100, 10023.2 Database Operations
Section titled “23.2 Database Operations”Migration Strategy
Section titled “Migration Strategy”Drizzle ORM owns all DDL. Python (SQLAlchemy Core) mirrors the schema with Table objects for type-safe reads and writes but never creates or alters tables. This prevents two ORMs from fighting over the same database.
# Generate migration from schema changesbun run drizzle-kit generate
# Apply pending migrationsbun run drizzle-kit migrate
# Push schema directly (development only)bun run drizzle-kit pushMigration files are committed to git and applied in order during deployment. Never hand-edit a migration file after it has been applied to any environment.
Seed Data Management
Section titled “Seed Data Management”Seed data follows versioned SQL files. When Dibyendu provides updated IMS rows or new failure mode libraries:
- Generate new SQL insert files from updated CSVs
- Create a migration that truncates and re-inserts affected tables
- Test against a local database before merging
- Apply to staging, validate diagnostic outputs, then apply to production
Backup and Restore
Section titled “Backup and Restore”# Full backup (daily, automated via cron)pg_dump -U rapid -d rapidai -Fc -f rapidai_$(date +%Y%m%d_%H%M%S).dump
# WAL archiving (continuous, for point-in-time recovery)# Configure in postgresql.conf:# archive_mode = on# archive_command = 'cp %p /backup/wal/%f'
# Restore from full backuppg_restore -U rapid -d rapidai --clean --if-exists rapidai_20260317_030000.dump
# Point-in-time recoverypg_restore -U rapid -d rapidai_recovery rapidai_20260317_030000.dump# Then replay WAL files up to the target timestamppgvector Index Maintenance
Section titled “pgvector Index Maintenance”RAPID AI stores 768-dimensional embeddings for rule vectors and analysis vectors. The IVFFlat index requires periodic maintenance:
-- Check index healthSELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetchFROM pg_stat_user_indexesWHERE indexrelname LIKE '%vector%';
-- Rebuild index after large embedding batch insertsREINDEX INDEX CONCURRENTLY idx_rule_vectors_embedding;REINDEX INDEX CONCURRENTLY idx_analysis_vectors_embedding;
-- If switching to HNSW (recommended for >100K vectors):CREATE INDEX CONCURRENTLY idx_rule_vectors_hnsw ON rule_vectors USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);Performance Monitoring
Section titled “Performance Monitoring”-- Enable slow query log (queries > 200ms)ALTER SYSTEM SET log_min_duration_statement = 200;SELECT pg_reload_conf();
-- Check index usage (low idx_scan = unused index)SELECT schemaname, tablename, indexname, idx_scanFROM pg_stat_user_indexesORDER BY idx_scan ASC;
-- Check table bloatSELECT relname, n_dead_tup, n_live_tup, round(n_dead_tup::numeric / greatest(n_live_tup, 1) * 100, 1) AS dead_pctFROM pg_stat_user_tablesWHERE n_live_tup > 0ORDER BY dead_pct DESC;23.3 Deployment
Section titled “23.3 Deployment”Docker Compose (Development and Staging)
Section titled “Docker Compose (Development and Staging)”services: postgres: image: pgvector/pgvector:pg17 environment: POSTGRES_USER: rapid POSTGRES_PASSWORD: ${DB_PASSWORD} POSTGRES_DB: rapidai volumes: - pgdata:/var/lib/postgresql/data ports: - "5432:5432" healthcheck: test: ["CMD-SHELL", "pg_isready -U rapid"] interval: 10s timeout: 5s retries: 5
backend: build: ./backend ports: - "8000:8000" environment: DATABASE_URL: postgresql+asyncpg://rapid:${DB_PASSWORD}@postgres:5432/rapidai depends_on: postgres: condition: service_healthy
frontend: build: ./frontend ports: - "3000:3000" environment: DATABASE_URL: postgresql://rapid:${DB_PASSWORD}@postgres:5432/rapidai PUBLIC_API_BASE: http://backend:8000/rapid-ai/v1 depends_on: - backend
volumes: pgdata:Production Options
Section titled “Production Options”Option 1: Containerized VPS (Recommended for MVP)
Deploy Docker Compose to a single VPS (4 vCPU, 8 GB RAM minimum). Use a reverse proxy (Caddy or Traefik) for TLS termination. Suitable for single-plant deployments with up to 500 assets.
Option 2: Kubernetes
For multi-plant, multi-tenant deployments. Requires Helm charts with:
- Resource limits: backend 1 CPU / 2 GB, frontend 0.5 CPU / 512 MB
- Liveness probe:
GET /healthevery 30s - Readiness probe:
GET /healthevery 10s (backend must respond with database connectivity confirmation) - HPA: scale backend pods on CPU > 70%, targeting 2-8 replicas
- PVC for PostgreSQL with regular snapshot backups
Option 3: Serverless Split
- Frontend: Vercel (SvelteKit adapter-vercel) or Cloudflare Pages
- Backend: Railway or Render (Docker container with persistent PostgreSQL add-on)
- Database: Neon (serverless PostgreSQL with pgvector) or Supabase
This option works well for teams that want zero infrastructure management, but introduces latency between frontend and backend due to network hops.
Environment Configuration
Section titled “Environment Configuration”| Variable | Staging | Production |
|---|---|---|
LOG_LEVEL | DEBUG | WARNING |
ENVIRONMENT | staging | production |
CORS_ORIGINS | staging domain | production domain |
DATABASE_URL | staging DB | production DB (separate instance) |
RATE_LIMIT_RPM | 600 | 120 |
SSL/TLS and DNS
Section titled “SSL/TLS and DNS”Use Caddy for automatic HTTPS via Let’s Encrypt:
# Caddyfilerapidai.example.com { reverse_proxy /rapid-ai/v1/* backend:8000 reverse_proxy /* frontend:3000}For Kubernetes, use cert-manager with a ClusterIssuer for Let’s Encrypt certificates.
23.4 Monitoring & Observability
Section titled “23.4 Monitoring & Observability”Application Metrics
Section titled “Application Metrics”| Metric | Target | Alert Threshold |
|---|---|---|
| Diagnostic request latency (p95) | < 200ms | > 500ms |
| Asset query latency (p95) | < 50ms | > 150ms |
| Swarm agent completion time | < 30s | > 60s |
| Rule evaluation throughput | > 500 rules/sec | < 200 rules/sec |
| Knowledge search latency (RAG) | < 300ms | > 800ms |
Business Metrics
Section titled “Business Metrics”Track these in a separate dashboard from infrastructure metrics:
- Total assets monitored (by plant, by type)
- Diagnostics completed per day / per week
- Failure modes detected (top 10 by frequency)
- Accuracy feedback: operator confirmed vs. overridden diagnoses
- RCM tasks generated vs. tasks completed
- Copilot questions per day, response satisfaction
Logging Strategy
Section titled “Logging Strategy”All services emit structured JSON logs with correlation IDs:
{ "timestamp": "2026-03-17T14:22:01.332Z", "level": "INFO", "service": "backend", "correlation_id": "diag-7f3a9b2c", "asset_id": "AST-PUMP-001", "module": "mB", "message": "Fault detection complete", "failure_modes_matched": 3, "duration_ms": 47}Log levels: ERROR (system failures, data corruption), WARNING (degraded quality scores, provider fallbacks), INFO (diagnostic completions, API requests), DEBUG (rule evaluation traces, individual sensor readings).
Alerting
Section titled “Alerting”Configure PagerDuty or OpsGenie for:
- P1 (page immediately): Database unreachable, all diagnostic requests failing, backend process crash
- P2 (alert within 15 min): Diagnostic latency > 1s sustained, provider fallback chain exhausted, disk > 90%
- P3 (daily digest): Elevated error rates, slow query warnings, certificate expiry within 14 days
Grafana Dashboards
Section titled “Grafana Dashboards”Create three dashboards:
- System Health: Request rates, latency histograms, error rates, database connections, CPU/memory
- Diagnostic Pipeline: Module-by-module latency breakdown (mA through mE), rule match rates, confidence distributions
- Business Overview: Assets monitored, diagnostics per day, failure mode frequency, RCM task completion rates
23.5 Performance Optimization
Section titled “23.5 Performance Optimization”ReferenceLoader Caching
Section titled “ReferenceLoader Caching”The ReferenceLoader loads IMS rows, failure mode libraries, system profiles, and block scoring parameters. These are effectively static between rule updates:
- Strategy: Lazy-load on first request, cache in memory indefinitely
- Invalidation: Clear cache on
POST /admin/schema/reloador on application restart - Memory footprint: ~15 MB for the full IMS (100 rows x 34 columns) plus all rule libraries
- Impact: Eliminates ~40ms of database reads per diagnostic request
Database Query Optimization
Section titled “Database Query Optimization”-- Essential indexes for diagnostic queriesCREATE INDEX idx_analysis_results_asset_created ON analysis_results (asset_id, created_at DESC);
CREATE INDEX idx_sensor_readings_asset_ts ON sensor_readings (asset_id, timestamp DESC);
-- Materialized view for dashboard plant overviewCREATE MATERIALIZED VIEW mv_plant_overview ASSELECT a.plant_id, a.asset_type, count(*) AS asset_count, avg(ar.ssi_score) AS avg_ssi, count(*) FILTER (WHERE ar.health_stage = 'critical') AS critical_countFROM assets aLEFT JOIN LATERAL ( SELECT ssi_score, health_stage FROM analysis_results WHERE asset_id = a.asset_id ORDER BY created_at DESC LIMIT 1) ar ON trueGROUP BY a.plant_id, a.asset_type;
-- Refresh on schedule (every 5 minutes)REFRESH MATERIALIZED VIEW CONCURRENTLY mv_plant_overview;API Response Time Targets
Section titled “API Response Time Targets”| Endpoint | Target | Optimization |
|---|---|---|
POST /assets/{id}/end-to-end-evaluate | < 200ms | Cached ReferenceLoader, parallel mB workers |
GET /assets/{id} | < 50ms | Indexed lookup, no joins |
GET /dashboard/plant-overview | < 100ms | Materialized view |
POST /copilot/query | < 3s | Streaming SSE response |
Connection Pooling
Section titled “Connection Pooling”# asyncpg pool configuration (backend)engine = create_async_engine( DATABASE_URL, pool_size=20, # concurrent connections max_overflow=10, # burst capacity pool_timeout=30, # wait for connection pool_recycle=3600, # recycle connections hourly pool_pre_ping=True, # verify connections before use)For production with multiple backend replicas, total pool size across all replicas must not exceed PostgreSQL’s max_connections (default 100). With 4 replicas at pool_size=20, that is 80 connections — leaving 20 for migrations, monitoring, and ad-hoc queries.
23.6 Security Operations
Section titled “23.6 Security Operations”Dependency Vulnerability Scanning
Section titled “Dependency Vulnerability Scanning”# Python (weekly CI job)pip-audit --strict --desc
# Node.js (weekly CI job)bun audit # or: npm audit --audit-level=high
# Container images (on every build)trivy image rapidai-backend:latesttrivy image rapidai-frontend:latestAPI Key Rotation
Section titled “API Key Rotation”- AI provider keys (Gemini, OpenAI, Cloudflare): rotate every 90 days
BETTER_AUTH_SECRET: rotate every 180 days, with a 24-hour overlap window where both old and new secrets are valid- Database credentials: rotate on security incidents or personnel changes
- All rotations are logged in the audit trail
Access Logging and Audit Trail
Section titled “Access Logging and Audit Trail”Every state-changing API request is logged to the audit_log table:
-- Audit log schemaCREATE TABLE audit_log ( id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, timestamp TIMESTAMPTZ NOT NULL DEFAULT now(), user_id UUID NOT NULL REFERENCES users(id), action TEXT NOT NULL, -- 'diagnose', 'update_rule', 'export_report' resource TEXT NOT NULL, -- 'asset:AST-PUMP-001', 'rule:DG005' details JSONB, -- request payload summary ip_address INET);Incident Response Procedure
Section titled “Incident Response Procedure”- Detect: Automated alert fires (PagerDuty)
- Triage: On-call engineer assesses severity (P1-P3)
- Contain: If data breach, revoke compromised credentials immediately
- Investigate: Review audit logs, structured application logs, database query history
- Resolve: Deploy fix, verify with smoke tests
- Postmortem: Document within 48 hours, identify preventive measures
Penetration Testing
Section titled “Penetration Testing”Schedule annual penetration tests covering: API endpoint fuzzing, SQL injection attempts against diagnostic payloads, authentication bypass attempts, and SSRF through the copilot’s knowledge retrieval.
23.7 Disaster Recovery
Section titled “23.7 Disaster Recovery”Recovery Objectives
Section titled “Recovery Objectives”| Objective | Target | Rationale |
|---|---|---|
| RTO (Recovery Time Objective) | 4 hours | Maximum acceptable downtime before plant operators lose diagnostic capability |
| RPO (Recovery Point Objective) | 1 hour | Maximum data loss window; diagnostic results from the last hour may need re-running |
Backup Schedule
Section titled “Backup Schedule”| Backup Type | Frequency | Retention | Storage |
|---|---|---|---|
| WAL archiving | Continuous | 7 days | S3/GCS bucket |
| Full database dump | Daily at 02:00 UTC | 30 days | S3/GCS bucket |
| Database snapshot (cloud) | Daily | 14 days | Cloud provider |
Recovery Procedures
Section titled “Recovery Procedures”Application failure (backend or frontend crash): Application is stateless. Restart the container or redeploy from the latest image. No data loss.
Database corruption or loss:
# 1. Provision new PostgreSQL instance with pgvector# 2. Restore from latest full backuppg_restore -U rapid -d rapidai rapidai_latest.dump
# 3. Replay WAL files to reach target recovery point# (handled automatically by pg_basebackup + recovery.conf)
# 4. Verify seed data integritypsql -U rapid -d rapidai -c "SELECT count(*) FROM schema_relation_map;"# Expected: 100
# 5. Restart application servicesdocker compose up -d backend frontendRule data loss:
All rule definitions (YAML profiles, block scores, fusion weights) are versioned in git. Redeploy from the repository. The POST /admin/schema/reload endpoint forces the ReferenceLoader to re-read from the database and filesystem.
Failure Scenario Runbook
Section titled “Failure Scenario Runbook”| Scenario | Symptom | Recovery Steps |
|---|---|---|
| Database unreachable | All API requests return 503 | Check PostgreSQL process, restart if crashed, failover to replica if available |
| Corrupted embeddings | Knowledge search returns irrelevant results | Re-run embedding generation pipeline, rebuild pgvector index |
| AI provider outage | Copilot and swarm requests fail | Provider Registry auto-falls back: Gemini > OpenAI > Cloudflare > Template. If all fail, copilot returns “service degraded” message |
| Disk full | Write operations fail | Clear old WAL files, expand volume, verify pg_wal directory size |
| Memory exhaustion | Backend OOM kills | Reduce connection pool size, add swap temporarily, investigate query memory usage |
| Seed data mismatch | Diagnostic results nonsensical | Re-run 00_run_all_seed_inserts.sql, clear ReferenceLoader cache via /admin/schema/reload |
Disaster Recovery Testing
Section titled “Disaster Recovery Testing”Test the full recovery procedure quarterly:
- Take a fresh backup
- Provision a clean environment
- Restore from backup
- Run the diagnostic test suite against restored data
- Verify all 100 IMS rows, all rule libraries, and a sample diagnostic request
- Document recovery time and any issues encountered
Standards Alignment
Section titled “Standards Alignment”| Standard | Relevance to This Chapter |
|---|---|
| IEC 62443 — Industrial cybersecurity | The deployment procedures, environment variable management, and database backup/recovery processes implement IEC 62443’s operational security requirements for industrial automation systems. |
| NIST SP 800-82 — Guide to Industrial Control Systems security | The operations guide follows NIST SP 800-82 guidelines for securing industrial control system components, including network segmentation, access control, and incident response procedures. |
| OWASP Top 10 — Web application security | The deployment configuration (CORS, rate limiting, authentication) addresses OWASP Top 10 vulnerabilities in the production environment through documented operational procedures. |
Changelog
Section titled “Changelog”| Version | Date | Author | Changes |
|---|---|---|---|
| 2.1.0 | 2026-03-17 | Rick D | Added standards alignment, living doc metadata, changelog |
| 2.0.0 | 2026-03-17 | Rick D | Enriched with production codebase content |
| 1.0.0 | 2026-03-17 | Rick D | Initial chapter creation |