Disaster Recovery Runbook
On-call engineer? Jump straight to the Quick-Reference Runbook Checklists.
This document defines disaster scenario categories, backup strategy, step-by-step recovery procedures, health check monitoring, RTO/RPO targets, and escalation contacts for ThreatWeaver. It is designed to be usable by an on-call engineer at 2am with no prior context.
Disaster Scenario Categoriesβ
ThreatWeaver failures fall into five categories. Identify your scenario first, then follow the corresponding runbook.
| # | Category | Symptoms |
|---|---|---|
| 1 | Backend service crash | API returning 502/503, /health unreachable, Render shows service stopped |
| 2 | Database corruption or accidental data deletion | 500 errors referencing DB, missing assets/findings, TypeORM sync errors on startup |
| 3 | Failed deployment | New deploy broke the app, build logs show errors, API regression after push |
| 4 | Security incident | Unauthorized access detected, tokens leaked, suspicious SecurityAuditLog entries |
| 5 | Accidental deletion of scan data | Findings missing, assessment results gone, user reports data loss |
Backup Strategyβ
PostgreSQL β Render Managed Backupsβ
ThreatWeaver's production database runs on Render PostgreSQL (Pro plan). Render automatically creates daily snapshots with a 7-day retention window.
What is backed up:
- All tenant schemas (public schema + per-tenant schemas)
- All application data: assets, vulnerabilities, findings, assessments, users, API keys, entitlements
- Migration state table (
migrations)
What is NOT backed up automatically:
- In-memory state (none β ThreatWeaver is stateless between restarts)
- Redis cache (none in local; in production Redis is ephemeral by design)
- Local
.envfiles (these live only on the engineer's machine)
Render backup retention by plan:
| Plan | Retention | Point-in-time |
|---|---|---|
| Free | None | No |
| Starter | 1 day | No |
| Pro | 7 days | No |
| Pro Plus | 7 days | Yes (PITR) |
How to trigger a manual backup via Render dashboardβ
- Log in to Render Dashboard
- Navigate to Databases in the left sidebar
- Select the ThreatWeaver PostgreSQL instance (
dpg-d6vc8nnfte5s73dppuqg-afor UAT/dev) - Click the Backups tab
- Click Create Backup (button in the top-right of the Backups panel)
- The backup will appear in the list within 1-2 minutes
- Note the backup timestamp β you will need it for restore operations
Always trigger a manual backup before running migrations, bulk deletes, schema changes, or any potentially destructive admin operation.
Export Data from ThreatWeaver (Admin UI)β
For targeted exports without a full DB restore:
- Log in as an admin (
admin@company.comlocally,testingadmin@blucypher.comon UAT) - Navigate to Admin β Archives β Export
- Select the data type: Findings, Assets, Assessments, or full tenant export
- Choose date range if applicable
- Click Export β a CSV/JSON file will download to your browser
- Store the export in a secure location (not in the repository)
Exported files may contain vulnerability details, IP addresses, and credentials used in test scans. Handle with the same care as production credentials. Never commit exports to Git.
Git-Based Source Recoveryβ
The source code is always recoverable from GitHub. Any deployment can be rebuilt from scratch:
git clone git@github.com:BluCypher1/ThreatWeaver.git
cd ThreatWeaver
git checkout dev # or main for the last stable release
- Frontend: Rebuild and redeploy to Vercel
- Backend: Rebuild and redeploy to Render
- Database schema: Recreate by running
npm run migrate:productionagainst a fresh PostgreSQL instance
Recovery Proceduresβ
Scenario 1: Backend Service Crash (Render)β
Indicators: API returns 502/503, /health endpoint unreachable, users report the app is down.
Expected recovery time: 2 minutes (auto-restart) to 15 minutes (manual intervention)
Step 1 β Assess the crash reasonβ
- Open Render Dashboard
- Navigate to Services β ThreatWeaver Backend (
kina-vulnerability-management-uq1t) - Check the Status badge:
- Restarting β Render is already recovering; wait 60-90 seconds
- Failed β requires manual action
- Click Logs tab and scroll to the crash point
- Identify the crash reason:
| Log pattern | Cause | Action |
|---|---|---|
JavaScript heap out of memory | OOM kill | Increase Render instance size or fix memory leak |
ECONNREFUSED to DB | Database unreachable | Check DB service status separately |
Cannot find module | Bad deploy | Rollback to previous build (Step 3) |
TypeORM EntityMetadata error | Entity definition mismatch | Check recent entity file changes |
Port already in use | Zombie process | Render will auto-kill and restart |
Step 2 β Monitor the auto-restartβ
Render automatically restarts crashed services. Monitor recovery:
# Poll the health endpoint every 10 seconds
watch -n 10 curl -s https://kina-vulnerability-management-uq1t.onrender.com/health
Expected healthy response:
{"status":"healthy","database":"connected","timestamp":"2026-04-05T..."}
If the service restarts cleanly within 2-3 minutes, the incident is resolved. Log it in docs/incidents/.
Step 3 β If persistent: redeploy last known good buildβ
- In Render Dashboard β Services β ThreatWeaver Backend
- Click Deploys tab
- Find the last deploy with status Live before the incident started
- Click Redeploy on that commit
- Monitor build logs β the build typically takes 3-5 minutes
- After deploy: verify
/healthreturns{"status":"healthy","database":"connected"}
Step 4 β If OOM: check for runaway scanβ
Scans with large target APIs can spike memory. If a scan is running:
- Open ThreatWeaver Admin β AppSec β Active Assessments
- Cancel any running assessment
- Wait for memory to normalize (Render metrics tab)
- Restart the service manually if needed: Render β Service β Restart
Step 5 β Verify full recoveryβ
# Backend health
curl -s https://kina-vulnerability-management-uq1t.onrender.com/health
# Auth service health
curl -s https://kina-vulnerability-management-uq1t.onrender.com/api/auth/health
# Full API health (requires auth token)
curl -s -H "Authorization: Bearer <token>" \
https://kina-vulnerability-management-uq1t.onrender.com/api/health
All three should return 200 with healthy status before declaring recovery complete.
Scenario 2: Database Corruption or Accidental Data Deletionβ
Indicators: 500 errors with DB references in logs, missing records, TypeORM errors on startup, users report data gone.
Expected recovery time: 30-60 minutes
Expected data loss: Up to 24 hours (last daily snapshot) unless PITR is enabled
Before restoring, stop the backend service to prevent further writes to a corrupt state. Writes during restore can create inconsistency.
Step 1 β Stop the backend serviceβ
- Render Dashboard β Services β ThreatWeaver Backend
- Click Suspend (or use the Render API:
mcp__render__update_web_service) - Confirm the service shows Suspended status
Step 2 β Identify the last known good backupβ
- Render Dashboard β Databases β ThreatWeaver PostgreSQL
- Click Backups tab
- Review the backup list with timestamps
- Identify the backup created before the incident:
- If the incident just happened, use today's backup
- If data was deleted hours ago, use yesterday's backup
- Note the backup ID and timestamp
Step 3 β Restore from backupβ
- In the Backups tab, find the target backup
- Click Restore next to the backup
- Render will prompt: "This will overwrite the current database. Confirm?"
- Type the confirmation string and click Restore
- The restore typically takes 5-20 minutes depending on database size
- Render will show a progress indicator
Any data written between the backup timestamp and the restore will be lost permanently. Communicate this window to affected tenants.
Step 4 β Restart the backendβ
- Render Dashboard β Services β ThreatWeaver Backend
- Click Resume (or Deploy if the service was deleted)
- On startup, the backend runs
npm run migrate:productionautomatically - Migrations are forward-only and idempotent β they will detect the restored state and apply only missing migrations
- Monitor startup logs for any migration errors
Step 5 β Verify data integrityβ
After the backend is running, verify key data is present:
# Check asset count per tenant
curl -s -H "Authorization: Bearer <admin-token>" \
https://kina-vulnerability-management-uq1t.onrender.com/api/admin/stats
# Check vulnerability count
curl -s -H "Authorization: Bearer <admin-token>" \
https://kina-vulnerability-management-uq1t.onrender.com/api/vulnerabilities?limit=1
# Check user list
curl -s -H "Authorization: Bearer <admin-token>" \
https://kina-vulnerability-management-uq1t.onrender.com/api/admin/users
Verify against your pre-incident knowledge of record counts. If counts are wrong, you may need an older backup.
Step 6 β Inform affected tenantsβ
Send an incident notification to affected tenant admins. Use the template in Communication Templates.
Scenario 3: Failed Deploymentβ
Indicators: App was working before a push, now returns errors. Build succeeded but runtime fails, or build itself failed.
Expected recovery time: 10-20 minutes
Step 1 β Confirm this is a deployment regressionβ
- Note the exact time the push happened (check
git logor Render deploy history) - Confirm the issue started at or after that time
- Check if the issue affects all users or only specific features
Step 2 β Rollback to last successful deployβ
- Render Dashboard β Services β ThreatWeaver Backend β Deploys tab
- Find the deploy immediately before the broken one (status: Live)
- Click Redeploy on that commit
- Monitor build logs carefully
While waiting for rollback, check what changed:
# What changed in the broken commit?
git log --oneline -5
git show HEAD --stat
git diff HEAD~1 HEAD -- backend/src/
Step 3 β Diagnose the build failure (if build failed)β
Common build failures and fixes:
TypeScript compile error:
# Reproduce locally
cd backend && rm -rf dist && npm run build 2>&1
# Fix the TS error, commit, push
npm install failure (missing package):
# Check if package.json has the dependency
# Check if node_modules is being cached incorrectly
# Clear cache in Render: Service β Environment β Clear Build Cache
Missing environment variable:
# Render Dashboard β Service β Environment
# Compare against .env.example β add any missing vars
Migration fails on startup:
# Check migration logs in Render
# If a migration is stuck, you may need to manually mark it as run
# DB Shell (local): docker-compose exec -T postgres psql -U tenable -d tenable_dashboard
Step 4 β Hotfix workflowβ
If a rollback is not sufficient and you need to fix the broken code:
# Create a hotfix on local branch
git checkout local
git pull origin local
# Make the minimal fix
# ... edit files ...
# Test the build locally FIRST
cd backend && rm -rf dist && npm run build
npm start # verify it runs
# Commit with a clear message
git commit -m "fix(deploy): resolve TypeScript error in routes/scan.ts [hotfix]"
# Push to local for review
git push origin local
# Only after explicit confirmation: push to dev
# git push origin local:dev
Step 5 β Verify full recoveryβ
After rollback or hotfix deploy:
curl -s https://kina-vulnerability-management-uq1t.onrender.com/health
# Expected: {"status":"healthy","database":"connected"}
Test the specific feature that was broken to confirm it is working.
Scenario 4: Security Incident (Compromised Credentials)β
Indicators: Unauthorized access in SecurityAuditLog, leaked JWT_SECRET, compromised Tenable/Anthropic API keys, suspicious logins.
Expected recovery time: 30-60 minutes for initial containment
Every minute of delay increases the blast radius. Follow these steps in order β do not skip or reorder.
Step 1 β Revoke all active JWT sessionsβ
- Log in to ThreatWeaver Admin (if admin credentials are still valid)
- Navigate to Admin β Security β Revoke All Sessions
- Click Revoke All β this invalidates all existing JWT tokens immediately
- All users will be logged out and must re-authenticate
If you cannot log in (credentials compromised):
# Direct DB intervention β invalidate all sessions by rotating the JWT secret
# Do this immediately in Render environment variables (Step 2)
Step 2 β Rotate JWT_SECRETβ
- Render Dashboard β Services β ThreatWeaver Backend β Environment
- Find
JWT_SECRET - Generate a new strong secret:
node -e "console.log(require('crypto').randomBytes(64).toString('hex'))" - Update the value in Render β Save Changes
- Render will automatically redeploy with the new secret
- All existing tokens signed with the old secret are now invalid
Step 3 β Force password reset for all usersβ
- After the new deploy is live, log in with admin credentials
- Admin β Users β Force Password Reset (All)
- Users will receive reset emails (if email is configured) or be prompted on next login
Step 4 β Check SecurityAuditLog for breach scopeβ
# Via DB shell (local) or Render PostgreSQL query
docker-compose exec -T postgres psql -U tenable -d tenable_dashboard -c \
"SELECT * FROM security_audit_log WHERE created_at > NOW() - INTERVAL '48 hours' ORDER BY created_at DESC LIMIT 100;"
Look for:
- Logins from unexpected IP addresses
- Access to admin endpoints by non-admin users
- Bulk data exports
- Unusual API key usage
Step 5 β Rotate all external secretsβ
Rotate these in order (highest impact first):
| Secret | Where to rotate | Impact of not rotating |
|---|---|---|
JWT_SECRET | Render env vars (done in Step 2) | Token forgery |
TENABLE_API_KEY | Tenable.io dashboard | Unauthorized Tenable API access |
ANTHROPIC_API_KEY | console.anthropic.com | AI cost abuse |
DATABASE_URL | Render DB β Reset credentials | DB access |
SESSION_SECRET | Render env vars | Session hijacking |
After rotating each secret, update Render environment variables and trigger a redeploy.
Step 6 β Review IP access patternsβ
Check Render access logs for unusual patterns:
- Render Dashboard β Service β Logs β filter by
[ERROR]and unusual HTTP methods - Look for: mass enumeration (sequential IDs), admin endpoint access outside business hours, requests from unexpected geos
If patterns indicate active scanning or exfiltration:
- Enable IP allowlisting in Render (Service β Settings β IP Allowlist)
- Add only known office/VPN IP ranges
- Block all other traffic temporarily
Step 7 β Post-incident documentationβ
- Create an incident report in
docs/incidents/INCIDENT-<DATE>-<NAME>.md - Document: timeline, root cause, blast radius, remediation steps taken
- Update
docs/audits/ISSUE_TRACKER.mdwith a security finding entry - Notify all tenants of the incident (see Communication Templates)
Scenario 5: Accidental Deletion of Scan Dataβ
Indicators: Assessment results gone, findings missing, user reports losing work.
Expected recovery time: 5 minutes (from archive) to 60 minutes (from DB backup)
Step 1 β Check the Archives firstβ
ThreatWeaver archives scan data before deletion. Check here before doing a DB restore:
- Log in as admin
- Navigate to Admin β Archives
- Search for the assessment by name, date, or target URL
- If found: click Restore to bring the data back
- Verify the restored data is complete
This is the fastest path β always check archives before escalating to DB restore.
Step 2 β If not in archives: restore from DB backupβ
Follow Scenario 2 steps.
Key consideration for scan data restore: Scan findings are tenant-scoped. If restoring a specific tenant's data, consider whether a full DB restore is warranted or if the data can be re-generated.
Step 3 β Re-run the assessment (last resort)β
For AppSec scanner assessments, scans are fully reproducible:
- Note the original assessment configuration (target URL, scan type, auth profile)
- Create a new assessment with identical settings
- Re-run the scan β results will be regenerated
- The new findings will differ slightly from the originals (different timestamps, possibly slightly different findings depending on target state)
Health Check Endpointsβ
Monitor these endpoints to detect issues proactively:
Endpoint Referenceβ
| Endpoint | What it checks | Expected response |
|---|---|---|
GET /health | Service alive (no auth required) | {"status":"healthy"} |
GET /api/health | DB connected | {"status":"healthy","database":"connected"} |
GET /api/auth/health | Auth service alive | {"status":"healthy","auth":"operational"} |
Recommended Monitoring Setupβ
Set up uptime monitoring (e.g., UptimeRobot, BetterStack, or Render's built-in health checks) to poll GET /health every 60 seconds.
Alert thresholds:
- Warning: Response time > 2000ms
- Critical: Status code != 200 for 2 consecutive checks
- Down: Status code != 200 for 5 consecutive checks
Local Health Verificationβ
# Quick health check (paste this into your terminal)
echo "=== Service Health ===" && \
curl -s http://localhost:4005/health | python3 -m json.tool && \
echo "=== DB Health ===" && \
curl -s http://localhost:4005/api/health | python3 -m json.tool && \
echo "=== Auth Health ===" && \
curl -s http://localhost:4005/api/auth/health | python3 -m json.tool
Production Health Verificationβ
BASE_URL="https://kina-vulnerability-management-uq1t.onrender.com"
echo "=== Service Health ===" && \
curl -s "$BASE_URL/health" | python3 -m json.tool && \
echo "=== DB Health ===" && \
curl -s "$BASE_URL/api/health" | python3 -m json.tool && \
echo "=== Auth Health ===" && \
curl -s "$BASE_URL/api/auth/health" | python3 -m json.tool
Recovery Time and Point Objectivesβ
| Scenario | RTO | RPO | Notes |
|---|---|---|---|
| Backend service crash (transient) | ~2 min | 0 (stateless) | Render auto-restart |
| Backend service crash (persistent) | ~15 min | 0 (stateless) | Manual redeploy required |
| Database restore (daily backup) | ~30-60 min | 24 hours | Last daily snapshot |
| Database restore (PITR, Pro Plus) | ~30-60 min | Minutes | Point-in-time recovery |
| Deployment rollback | ~10 min | 0 (code in Git) | Render redeploy |
| Security incident containment | ~30 min | N/A | Depends on breach scope |
| Scan data from archive | ~5 min | 0 | If archived |
| Scan data re-run | ~30-120 min | N/A | Reproducible |
Quick-Reference Runbook Checklistsβ
Copy these into a text editor or incident management tool at 2am.
Runbook A: Backend Service Downβ
[ ] 1. Check Render Dashboard β Service status
[ ] 2. Read crash logs β identify OOM / runtime error / bad deploy
[ ] 3. Wait 2 min for auto-restart
[ ] 4. If still down: Render β Deploys β Redeploy last Live commit
[ ] 5. Monitor build logs (~3-5 min build time)
[ ] 6. Verify: curl /health returns 200
[ ] 7. Verify: curl /api/health returns database:connected
[ ] 8. Notify users if downtime > 5 min
[ ] 9. Document in docs/incidents/
Runbook B: Database Restoreβ
[ ] 1. Stop backend: Render β Service β Suspend
[ ] 2. Trigger manual DB backup NOW (documents current state even if corrupt)
[ ] 3. Identify target backup (last known good, from Render β DB β Backups)
[ ] 4. Click Restore on target backup
[ ] 5. Wait 5-20 min for restore
[ ] 6. Resume backend: Render β Service β Resume
[ ] 7. Monitor startup logs for migration errors
[ ] 8. Verify: curl /api/health returns database:connected
[ ] 9. Check asset count / vuln count / user list via API
[ ] 10. Notify affected tenants of data loss window
[ ] 11. Document in docs/incidents/
Runbook C: Deployment Rollbackβ
[ ] 1. Confirm issue started after recent push (check git log + Render deploy time)
[ ] 2. Render β Service β Deploys β find last Live deploy before incident
[ ] 3. Click Redeploy
[ ] 4. Wait for build (~3-5 min)
[ ] 5. Verify: curl /health returns 200
[ ] 6. Test the specific feature that was broken
[ ] 7. Investigate root cause in parallel
[ ] 8. Create hotfix on local branch, test locally, get confirmation before pushing to dev
Runbook D: Security Incident Containmentβ
[ ] 1. Admin β Security β Revoke All Sessions (immediate)
[ ] 2. Generate new JWT_SECRET: node -e "require('crypto').randomBytes(64).toString('hex')"
[ ] 3. Render β Service β Environment β Update JWT_SECRET β Save (triggers redeploy)
[ ] 4. After redeploy: Admin β Users β Force Password Reset (All)
[ ] 5. Query SecurityAuditLog for breach scope (last 48 hours)
[ ] 6. Rotate TENABLE_API_KEY in Tenable.io dashboard
[ ] 7. Rotate ANTHROPIC_API_KEY in console.anthropic.com
[ ] 8. If active attack: enable IP allowlisting in Render β Service β Settings
[ ] 9. Notify all tenant admins
[ ] 10. Document full timeline in docs/incidents/INCIDENT-<DATE>.md
[ ] 11. Update ISSUE_TRACKER.md
Runbook E: Scan Data Missingβ
[ ] 1. Admin β Archives β Search for assessment by name/date
[ ] 2. If found: click Restore β verify data is complete
[ ] 3. If not found: determine if DB restore is warranted vs re-scan
[ ] 4. If re-scan: create new assessment with identical settings, re-run
[ ] 5. If DB restore needed: follow Runbook B
[ ] 6. Notify user who reported the loss
Environment-Specific Notesβ
Local Developmentβ
- No managed backups β your Docker volume is the only copy
- Back up local DB manually before destructive operations:
docker-compose exec -T postgres pg_dump -U tenable tenable_dashboard > backup-$(date +%Y%m%d).sql - Restore local DB:
docker-compose exec -T postgres psql -U tenable tenable_dashboard < backup-20260405.sql - Local service crash: just restart with
npm run dev
UAT / Dev Environment (dev.threatweaver.ai)β
- Backend:
threatweaver-backend.onrender.com(Singapore, Pro Plus) - DB:
dpg-d6vc8nnfte5s73dppuqg-a - Credentials:
testingadmin@blucypher.com/TestAdmin@Blu2026! - Same runbooks apply β treat UAT data loss as lower severity than production
Production (kinavulnerabilitymanagement.vercel.app)β
- Backend:
kina-vulnerability-management-uq1t.onrender.com - All incidents require tenant notification within 1 hour of discovery
- Any DB restore requires written approval from the project owner
Escalation Contactsβ
| Role | Responsibility | When to escalate |
|---|---|---|
| On-call engineer | Initial triage, Scenarios 1-3 | Immediately on alert |
| Project owner (Tilak) | Scenarios 4-5, DB restore approval | Security incidents, data loss > 1 hour |
| Render Support | DB restore failures, platform issues | When Render UI/API is unresponsive |
| Supabase Support | Production DB issues (if on Supabase) | DB connection failures not resolved in 15 min |
Render Support: https://render.com/support
Render Status Page: https://status.render.com (bookmark this)
GitHub Repo: https://github.com/BluCypher1/ThreatWeaver
Communication Templatesβ
Template 1: Service Disruption Notificationβ
Subject: ThreatWeaver Service Disruption β [DATE] [TIME UTC]
Team,
We are currently experiencing a service disruption affecting ThreatWeaver.
Impact: [API unavailable / Slow response times / Data access issues]
Started: [TIME UTC]
Affected services: [Backend API / Dashboard / Scanner]
Cause: [Known / Under investigation]
Current status: [Investigating / Restoring / Monitoring recovery]
Next update: [TIME UTC]
We apologize for the inconvenience. Our team is working to resolve this as quickly as possible.
β ThreatWeaver Operations
Template 2: Data Loss Notificationβ
Subject: ThreatWeaver β Data Restoration Notice for Your Tenant
[Tenant Admin Name],
We are writing to inform you of a data restoration that affected your ThreatWeaver tenant.
What happened: [Brief description]
Data affected: [Findings / Assets / Assessments / User data]
Time window of data loss: [FROM timestamp] to [TO timestamp]
Data restored to: [Backup timestamp]
Action required: [None β your data has been restored / Please re-run assessments created after DATE]
We are sorry for any disruption this caused. Please contact us if you have any questions or notice any data discrepancies.
β ThreatWeaver Operations
Template 3: Security Incident Notificationβ
Subject: IMPORTANT β ThreatWeaver Security Notice β Action Required
[Tenant Admin Name],
We are writing to notify you of a security incident that may have affected your ThreatWeaver account.
What happened: [Brief, factual description β do not speculate]
When: [DATE/TIME UTC]
What we did: Revoked all active sessions, rotated credentials, forced password resets
What you need to do:
1. Re-authenticate to ThreatWeaver at [URL]
2. Set a new password when prompted
3. Review your SecurityAuditLog for any unauthorized activity
4. Rotate any API keys your team stored in ThreatWeaver
If you see any suspicious activity in your audit log, please reply to this email immediately.
We take security very seriously and are conducting a full post-incident review. We will share a detailed report within 72 hours.
β ThreatWeaver Security Team
Post-Incident Review Checklistβ
After every incident, regardless of severity:
[ ] Write incident report: docs/incidents/INCIDENT-<YYYY-MM-DD>-<slug>.md
[ ] Timeline documented: detection time, response time, resolution time
[ ] Root cause identified and documented
[ ] Blast radius determined: which tenants/data were affected
[ ] Remediation steps documented
[ ] Prevention: what change prevents recurrence?
[ ] Update ISSUE_TRACKER.md if a code fix is needed
[ ] Update this runbook if the steps were unclear or incomplete
[ ] Share report with project owner within 24 hours of resolution
Related Documentationβ
- Deployment Guide β deploy procedures, environment setup
- Runbooks β operational runbooks for common tasks
- Migration History β DB migration log
- Environment Variables β all env vars and their purpose
- Permission Matrix β what each role can access