DevOps / Infrastructure

⚙️

Active

Auto-deploy pipeline, AI-powered bug triage, health monitoring, test suite, security layer, and 3 AI agent skills for status checks, incident response, and deploy management. The systems that keep production running without a dedicated ops team.

Active Systems

🚀 Auto-Deploy Pipeline Auto

Push to main auto-deploys frontend (Render static CDN) and backend (Render web service). Feature branches deploy to staging environments. Database migrations run automatically on deploy. Zero manual intervention.

Render Git push Auto-migrate Auto-seed

🔍 Triage Agent AI Agent

Autonomous Node.js agent polls GitHub issues hourly via macOS launchd. Spawns Claude Code instances (Agent SDK) to investigate bug reports — reads codebase, traces errors, creates fix branches + draft PRs for valid bugs, comments and closes invalid ones. POSTs results to server for email notifications.

Claude Agent SDK launchd GitHub API DRY_RUN mode

🛡️ Maintenance Detection Auto

4-state health machine: healthy → suspicious → down → recovering. Heartbeat polling (60s healthy, 5s suspicious), exponential backoff (2s→30s when down), recovery requires 3 consecutive successes. Admin-toggled (default OFF). Fixed overlay preserves app tree state.

State Machine Admin Toggle Exponential Backoff

🧪 Test Suite CI

789+ tests across frontend (Vitest/jsdom) and backend (supertest). Covers lesson JSON validation, interactive components, palette utilities, hooks, forum endpoints, admin API, auth flows. Run before every deploy.

Vitest supertest jsdom 789+ tests

🔐 Security Layer Auto

Multi-layer security: prompt injection guards (sanitizeForPrompt), HTML sanitization, content moderation (profanity + link spam via moderateContent), CSS color validation, input size limits via Zod, XML delimiter wrapping for AI inputs.

Zod sanitizeForPrompt moderateContent HTML sanitizer

AI Agent Skills

🩺 Status Check AI Skill

Quick 10-second production health probe. Curls API, frontend, and DB endpoints in parallel. Checks last deploy status and recent error count. Traffic light verdict: green/yellow/red.

/devops-status --full Render MCP

🚑 Incident Response AI Skill

Traces production errors from Render logs through the codebase to root cause. Classifies severity (critical/high/medium), identifies the breaking commit, proposes or implements fixes on a feature branch.

/devops-incident --fix --dry-run

📦 Deploy Manager AI Skill

Pre-deploy safety checklist (tests, migrations, env vars, breaking changes), post-deploy health verification, deploy history, and rollback guidance. The safety net before pushing to main.

/devops-deploy check | verify | rollback history

Missing — To Build

🚨 Error Monitoring Missing

No Sentry or equivalent. Frontend crashes in production go undetected unless a student manually reports via forum or GitHub issue. Need real-time error tracking with source maps and session replay.

Sentry? LogRocket?

📡 Uptime Monitoring Missing

No external uptime monitor. Maintenance page detects outages reactively from the client side. Need proactive alerts (Slack/email) and a public status page for trust.

UptimeRobot? Checkly?

Roadmap

3 AI skills active. Next: Sentry error monitoring integration, external uptime probes with status page, automated incident runbooks.