Version: Local · In Progress

Disaster Recovery Runbook

On-call engineer? Jump straight to the Quick-Reference Runbook Checklists.

This document defines disaster scenario categories, backup strategy, step-by-step recovery procedures, health check monitoring, RTO/RPO targets, and escalation contacts for ThreatWeaver. It is designed to be usable by an on-call engineer at 2am with no prior context.

Disaster Scenario Categories

ThreatWeaver failures fall into five categories. Identify your scenario first, then follow the corresponding runbook.

#	Category	Symptoms
1	Backend service crash	API returning 502/503, `/health` unreachable, Render shows service stopped
2	Database corruption or accidental data deletion	500 errors referencing DB, missing assets/findings, TypeORM sync errors on startup
3	Failed deployment	New deploy broke the app, build logs show errors, API regression after push
4	Security incident	Unauthorized access detected, tokens leaked, suspicious SecurityAuditLog entries
5	Accidental deletion of scan data	Findings missing, assessment results gone, user reports data loss

Backup Strategy

PostgreSQL — Render Managed Backups

ThreatWeaver's production database runs on Render PostgreSQL (Pro plan). Render automatically creates daily snapshots with a 7-day retention window.

What is backed up:

All tenant schemas (public schema + per-tenant schemas)
All application data: assets, vulnerabilities, findings, assessments, users, API keys, entitlements
Migration state table (migrations)

What is NOT backed up automatically:

In-memory state (none — ThreatWeaver is stateless between restarts)
Redis cache (none in local; in production Redis is ephemeral by design)
Local .env files (these live only on the engineer's machine)

Render backup retention by plan:

Plan	Retention	Point-in-time
Free	None	No
Starter	1 day	No
Pro	7 days	No
Pro Plus	7 days	Yes (PITR)

How to trigger a manual backup via Render dashboard

Log in to Render Dashboard
Navigate to Databases in the left sidebar
Select the ThreatWeaver PostgreSQL instance (dpg-d6vc8nnfte5s73dppuqg-a for UAT/dev)
Click the Backups tab
Click Create Backup (button in the top-right of the Backups panel)
The backup will appear in the list within 1-2 minutes
Note the backup timestamp — you will need it for restore operations

Before any risky operation

Always trigger a manual backup before running migrations, bulk deletes, schema changes, or any potentially destructive admin operation.

Export Data from ThreatWeaver (Admin UI)

For targeted exports without a full DB restore:

Log in as an admin (admin@company.com locally, testingadmin@blucypher.com on UAT)
Navigate to Admin → Archives → Export
Select the data type: Findings, Assets, Assessments, or full tenant export
Choose date range if applicable
Click Export — a CSV/JSON file will download to your browser
Store the export in a secure location (not in the repository)

Exports contain sensitive data

Exported files may contain vulnerability details, IP addresses, and credentials used in test scans. Handle with the same care as production credentials. Never commit exports to Git.

Git-Based Source Recovery

The source code is always recoverable from GitHub. Any deployment can be rebuilt from scratch:

git clone git@github.com:BluCypher1/ThreatWeaver.git
cd ThreatWeaver
git checkout dev  # or main for the last stable release

Frontend: Rebuild and redeploy to Vercel
Backend: Rebuild and redeploy to Render
Database schema: Recreate by running npm run migrate:production against a fresh PostgreSQL instance

Recovery Procedures

Scenario 1: Backend Service Crash (Render)

Indicators: API returns 502/503, /health endpoint unreachable, users report the app is down.

Expected recovery time: 2 minutes (auto-restart) to 15 minutes (manual intervention)

Step 1 — Assess the crash reason

Open Render Dashboard
Navigate to Services → ThreatWeaver Backend (kina-vulnerability-management-uq1t)
Check the Status badge:
- Restarting — Render is already recovering; wait 60-90 seconds
- Failed — requires manual action
Click Logs tab and scroll to the crash point
Identify the crash reason:

Log pattern	Cause	Action
`JavaScript heap out of memory`	OOM kill	Increase Render instance size or fix memory leak
`ECONNREFUSED` to DB	Database unreachable	Check DB service status separately
`Cannot find module`	Bad deploy	Rollback to previous build (Step 3)
`TypeORM EntityMetadata` error	Entity definition mismatch	Check recent entity file changes
`Port already in use`	Zombie process	Render will auto-kill and restart

Step 2 — Monitor the auto-restart

Render automatically restarts crashed services. Monitor recovery:

# Poll the health endpoint every 10 seconds
watch -n 10 curl -s https://kina-vulnerability-management-uq1t.onrender.com/health

Expected healthy response:

{"status":"healthy","database":"connected","timestamp":"2026-04-05T..."}

If the service restarts cleanly within 2-3 minutes, the incident is resolved. Log it in docs/incidents/.

Step 3 — If persistent: redeploy last known good build

In Render Dashboard → Services → ThreatWeaver Backend
Click Deploys tab
Find the last deploy with status Live before the incident started
Click Redeploy on that commit
Monitor build logs — the build typically takes 3-5 minutes
After deploy: verify /health returns {"status":"healthy","database":"connected"}

Step 4 — If OOM: check for runaway scan

Scans with large target APIs can spike memory. If a scan is running:

Open ThreatWeaver Admin → AppSec → Active Assessments
Cancel any running assessment
Wait for memory to normalize (Render metrics tab)
Restart the service manually if needed: Render → Service → Restart

Step 5 — Verify full recovery

# Backend health
curl -s https://kina-vulnerability-management-uq1t.onrender.com/health

# Auth service health
curl -s https://kina-vulnerability-management-uq1t.onrender.com/api/auth/health

# Full API health (requires auth token)
curl -s -H "Authorization: Bearer <token>" \
  https://kina-vulnerability-management-uq1t.onrender.com/api/health

All three should return 200 with healthy status before declaring recovery complete.

Scenario 2: Database Corruption or Accidental Data Deletion

Indicators: 500 errors with DB references in logs, missing records, TypeORM errors on startup, users report data gone.

Expected recovery time: 30-60 minutes
Expected data loss: Up to 24 hours (last daily snapshot) unless PITR is enabled

Stop all writes immediately

Before restoring, stop the backend service to prevent further writes to a corrupt state. Writes during restore can create inconsistency.

Step 1 — Stop the backend service

Render Dashboard → Services → ThreatWeaver Backend
Click Suspend (or use the Render API: mcp__render__update_web_service)
Confirm the service shows Suspended status

Step 2 — Identify the last known good backup

Render Dashboard → Databases → ThreatWeaver PostgreSQL
Click Backups tab
Review the backup list with timestamps
Identify the backup created before the incident:
- If the incident just happened, use today's backup
- If data was deleted hours ago, use yesterday's backup
Note the backup ID and timestamp

Step 3 — Restore from backup

In the Backups tab, find the target backup
Click Restore next to the backup
Render will prompt: "This will overwrite the current database. Confirm?"
Type the confirmation string and click Restore
The restore typically takes 5-20 minutes depending on database size
Render will show a progress indicator

All data since backup is lost

Any data written between the backup timestamp and the restore will be lost permanently. Communicate this window to affected tenants.

Step 4 — Restart the backend

Render Dashboard → Services → ThreatWeaver Backend
Click Resume (or Deploy if the service was deleted)
On startup, the backend runs npm run migrate:production automatically
Migrations are forward-only and idempotent — they will detect the restored state and apply only missing migrations
Monitor startup logs for any migration errors

Step 5 — Verify data integrity

After the backend is running, verify key data is present:

# Check asset count per tenant
curl -s -H "Authorization: Bearer <admin-token>" \
  https://kina-vulnerability-management-uq1t.onrender.com/api/admin/stats

# Check vulnerability count
curl -s -H "Authorization: Bearer <admin-token>" \
  https://kina-vulnerability-management-uq1t.onrender.com/api/vulnerabilities?limit=1

# Check user list
curl -s -H "Authorization: Bearer <admin-token>" \
  https://kina-vulnerability-management-uq1t.onrender.com/api/admin/users

Verify against your pre-incident knowledge of record counts. If counts are wrong, you may need an older backup.

Step 6 — Inform affected tenants

Send an incident notification to affected tenant admins. Use the template in Communication Templates.

Scenario 3: Failed Deployment

Indicators: App was working before a push, now returns errors. Build succeeded but runtime fails, or build itself failed.

Expected recovery time: 10-20 minutes

Step 1 — Confirm this is a deployment regression

Note the exact time the push happened (check git log or Render deploy history)
Confirm the issue started at or after that time
Check if the issue affects all users or only specific features

Step 2 — Rollback to last successful deploy

Render Dashboard → Services → ThreatWeaver Backend → Deploys tab
Find the deploy immediately before the broken one (status: Live)
Click Redeploy on that commit
Monitor build logs carefully

While waiting for rollback, check what changed:

# What changed in the broken commit?
git log --oneline -5
git show HEAD --stat
git diff HEAD~1 HEAD -- backend/src/

Step 3 — Diagnose the build failure (if build failed)

Common build failures and fixes:

TypeScript compile error:

# Reproduce locally
cd backend && rm -rf dist && npm run build 2>&1
# Fix the TS error, commit, push

npm install failure (missing package):

# Check if package.json has the dependency
# Check if node_modules is being cached incorrectly
# Clear cache in Render: Service → Environment → Clear Build Cache

Missing environment variable:

# Render Dashboard → Service → Environment
# Compare against .env.example — add any missing vars

Migration fails on startup:

# Check migration logs in Render
# If a migration is stuck, you may need to manually mark it as run
# DB Shell (local): docker-compose exec -T postgres psql -U tenable -d tenable_dashboard

Step 4 — Hotfix workflow

If a rollback is not sufficient and you need to fix the broken code:

# Create a hotfix on local branch
git checkout local
git pull origin local

# Make the minimal fix
# ... edit files ...

# Test the build locally FIRST
cd backend && rm -rf dist && npm run build
npm start  # verify it runs

# Commit with a clear message
git commit -m "fix(deploy): resolve TypeScript error in routes/scan.ts [hotfix]"

# Push to local for review
git push origin local

# Only after explicit confirmation: push to dev
# git push origin local:dev

Step 5 — Verify full recovery

After rollback or hotfix deploy:

curl -s https://kina-vulnerability-management-uq1t.onrender.com/health
# Expected: {"status":"healthy","database":"connected"}

Test the specific feature that was broken to confirm it is working.

Scenario 4: Security Incident (Compromised Credentials)

Indicators: Unauthorized access in SecurityAuditLog, leaked JWT_SECRET, compromised Tenable/Anthropic API keys, suspicious logins.

Expected recovery time: 30-60 minutes for initial containment

Act immediately

Every minute of delay increases the blast radius. Follow these steps in order — do not skip or reorder.

Step 1 — Revoke all active JWT sessions

Log in to ThreatWeaver Admin (if admin credentials are still valid)
Navigate to Admin → Security → Revoke All Sessions
Click Revoke All — this invalidates all existing JWT tokens immediately
All users will be logged out and must re-authenticate

If you cannot log in (credentials compromised):

# Direct DB intervention — invalidate all sessions by rotating the JWT secret
# Do this immediately in Render environment variables (Step 2)

Step 2 — Rotate JWT_SECRET

Render Dashboard → Services → ThreatWeaver Backend → Environment
Find JWT_SECRET

Generate a new strong secret:

node -e "console.log(require('crypto').randomBytes(64).toString('hex'))"

Update the value in Render → Save Changes
Render will automatically redeploy with the new secret
All existing tokens signed with the old secret are now invalid

Step 3 — Force password reset for all users

After the new deploy is live, log in with admin credentials
Admin → Users → Force Password Reset (All)
Users will receive reset emails (if email is configured) or be prompted on next login

Step 4 — Check SecurityAuditLog for breach scope

# Via DB shell (local) or Render PostgreSQL query
docker-compose exec -T postgres psql -U tenable -d tenable_dashboard -c \
  "SELECT * FROM security_audit_log WHERE created_at > NOW() - INTERVAL '48 hours' ORDER BY created_at DESC LIMIT 100;"

Look for:

Logins from unexpected IP addresses
Access to admin endpoints by non-admin users
Bulk data exports
Unusual API key usage

Step 5 — Rotate all external secrets

Rotate these in order (highest impact first):

Secret	Where to rotate	Impact of not rotating
`JWT_SECRET`	Render env vars (done in Step 2)	Token forgery
`TENABLE_API_KEY`	Tenable.io dashboard	Unauthorized Tenable API access
`ANTHROPIC_API_KEY`	console.anthropic.com	AI cost abuse
`DATABASE_URL`	Render DB → Reset credentials	DB access
`SESSION_SECRET`	Render env vars	Session hijacking

After rotating each secret, update Render environment variables and trigger a redeploy.

Step 6 — Review IP access patterns

Check Render access logs for unusual patterns:

Render Dashboard → Service → Logs → filter by [ERROR] and unusual HTTP methods
Look for: mass enumeration (sequential IDs), admin endpoint access outside business hours, requests from unexpected geos

If patterns indicate active scanning or exfiltration:

Enable IP allowlisting in Render (Service → Settings → IP Allowlist)
Add only known office/VPN IP ranges
Block all other traffic temporarily

Step 7 — Post-incident documentation

Create an incident report in docs/incidents/INCIDENT-<DATE>-<NAME>.md
Document: timeline, root cause, blast radius, remediation steps taken
Update docs/audits/ISSUE_TRACKER.md with a security finding entry
Notify all tenants of the incident (see Communication Templates)

Scenario 5: Accidental Deletion of Scan Data

Indicators: Assessment results gone, findings missing, user reports losing work.

Expected recovery time: 5 minutes (from archive) to 60 minutes (from DB backup)

Step 1 — Check the Archives first

ThreatWeaver archives scan data before deletion. Check here before doing a DB restore:

Log in as admin
Navigate to Admin → Archives
Search for the assessment by name, date, or target URL
If found: click Restore to bring the data back
Verify the restored data is complete

This is the fastest path — always check archives before escalating to DB restore.

Step 2 — If not in archives: restore from DB backup

Follow Scenario 2 steps.

Key consideration for scan data restore: Scan findings are tenant-scoped. If restoring a specific tenant's data, consider whether a full DB restore is warranted or if the data can be re-generated.

Step 3 — Re-run the assessment (last resort)

For AppSec scanner assessments, scans are fully reproducible:

Note the original assessment configuration (target URL, scan type, auth profile)
Create a new assessment with identical settings
Re-run the scan — results will be regenerated
The new findings will differ slightly from the originals (different timestamps, possibly slightly different findings depending on target state)

Health Check Endpoints

Monitor these endpoints to detect issues proactively:

Endpoint Reference

Endpoint	What it checks	Expected response
`GET /health`	Service alive (no auth required)	`{"status":"healthy"}`
`GET /api/health`	DB connected	`{"status":"healthy","database":"connected"}`
`GET /api/auth/health`	Auth service alive	`{"status":"healthy","auth":"operational"}`

Recommended Monitoring Setup

Set up uptime monitoring (e.g., UptimeRobot, BetterStack, or Render's built-in health checks) to poll GET /health every 60 seconds.

Alert thresholds:

Warning: Response time > 2000ms
Critical: Status code != 200 for 2 consecutive checks
Down: Status code != 200 for 5 consecutive checks

Local Health Verification

# Quick health check (paste this into your terminal)
echo "=== Service Health ===" && \
  curl -s http://localhost:4005/health | python3 -m json.tool && \
  echo "=== DB Health ===" && \
  curl -s http://localhost:4005/api/health | python3 -m json.tool && \
  echo "=== Auth Health ===" && \
  curl -s http://localhost:4005/api/auth/health | python3 -m json.tool

Production Health Verification

BASE_URL="https://kina-vulnerability-management-uq1t.onrender.com"

echo "=== Service Health ===" && \
  curl -s "$BASE_URL/health" | python3 -m json.tool && \
  echo "=== DB Health ===" && \
  curl -s "$BASE_URL/api/health" | python3 -m json.tool && \
  echo "=== Auth Health ===" && \
  curl -s "$BASE_URL/api/auth/health" | python3 -m json.tool

Recovery Time and Point Objectives

Scenario	RTO	RPO	Notes
Backend service crash (transient)	~2 min	0 (stateless)	Render auto-restart
Backend service crash (persistent)	~15 min	0 (stateless)	Manual redeploy required
Database restore (daily backup)	~30-60 min	24 hours	Last daily snapshot
Database restore (PITR, Pro Plus)	~30-60 min	Minutes	Point-in-time recovery
Deployment rollback	~10 min	0 (code in Git)	Render redeploy
Security incident containment	~30 min	N/A	Depends on breach scope
Scan data from archive	~5 min	0	If archived
Scan data re-run	~30-120 min	N/A	Reproducible

Quick-Reference Runbook Checklists

Copy these into a text editor or incident management tool at 2am.

Runbook A: Backend Service Down

[ ] 1. Check Render Dashboard → Service status
[ ] 2. Read crash logs → identify OOM / runtime error / bad deploy
[ ] 3. Wait 2 min for auto-restart
[ ] 4. If still down: Render → Deploys → Redeploy last Live commit
[ ] 5. Monitor build logs (~3-5 min build time)
[ ] 6. Verify: curl /health returns 200
[ ] 7. Verify: curl /api/health returns database:connected
[ ] 8. Notify users if downtime > 5 min
[ ] 9. Document in docs/incidents/

Runbook B: Database Restore

[ ] 1. Stop backend: Render → Service → Suspend
[ ] 2. Trigger manual DB backup NOW (documents current state even if corrupt)
[ ] 3. Identify target backup (last known good, from Render → DB → Backups)
[ ] 4. Click Restore on target backup
[ ] 5. Wait 5-20 min for restore
[ ] 6. Resume backend: Render → Service → Resume
[ ] 7. Monitor startup logs for migration errors
[ ] 8. Verify: curl /api/health returns database:connected
[ ] 9. Check asset count / vuln count / user list via API
[ ] 10. Notify affected tenants of data loss window
[ ] 11. Document in docs/incidents/

Runbook C: Deployment Rollback

[ ] 1. Confirm issue started after recent push (check git log + Render deploy time)
[ ] 2. Render → Service → Deploys → find last Live deploy before incident
[ ] 3. Click Redeploy
[ ] 4. Wait for build (~3-5 min)
[ ] 5. Verify: curl /health returns 200
[ ] 6. Test the specific feature that was broken
[ ] 7. Investigate root cause in parallel
[ ] 8. Create hotfix on local branch, test locally, get confirmation before pushing to dev

Runbook D: Security Incident Containment

[ ] 1. Admin → Security → Revoke All Sessions (immediate)
[ ] 2. Generate new JWT_SECRET: node -e "require('crypto').randomBytes(64).toString('hex')" 
[ ] 3. Render → Service → Environment → Update JWT_SECRET → Save (triggers redeploy)
[ ] 4. After redeploy: Admin → Users → Force Password Reset (All)
[ ] 5. Query SecurityAuditLog for breach scope (last 48 hours)
[ ] 6. Rotate TENABLE_API_KEY in Tenable.io dashboard
[ ] 7. Rotate ANTHROPIC_API_KEY in console.anthropic.com
[ ] 8. If active attack: enable IP allowlisting in Render → Service → Settings
[ ] 9. Notify all tenant admins
[ ] 10. Document full timeline in docs/incidents/INCIDENT-<DATE>.md
[ ] 11. Update ISSUE_TRACKER.md

Runbook E: Scan Data Missing

[ ] 1. Admin → Archives → Search for assessment by name/date
[ ] 2. If found: click Restore → verify data is complete
[ ] 3. If not found: determine if DB restore is warranted vs re-scan
[ ] 4. If re-scan: create new assessment with identical settings, re-run
[ ] 5. If DB restore needed: follow Runbook B
[ ] 6. Notify user who reported the loss

Environment-Specific Notes

Local Development

No managed backups — your Docker volume is the only copy

Back up local DB manually before destructive operations:

docker-compose exec -T postgres pg_dump -U tenable tenable_dashboard > backup-$(date +%Y%m%d).sql

Restore local DB:

docker-compose exec -T postgres psql -U tenable tenable_dashboard < backup-20260405.sql

Local service crash: just restart with npm run dev

UAT / Dev Environment (dev.threatweaver.ai)

Backend: threatweaver-backend.onrender.com (Singapore, Pro Plus)
DB: dpg-d6vc8nnfte5s73dppuqg-a
Credentials: testingadmin@blucypher.com / TestAdmin@Blu2026!
Same runbooks apply — treat UAT data loss as lower severity than production

Production (kinavulnerabilitymanagement.vercel.app)

Backend: kina-vulnerability-management-uq1t.onrender.com
All incidents require tenant notification within 1 hour of discovery
Any DB restore requires written approval from the project owner

Escalation Contacts

Role	Responsibility	When to escalate
On-call engineer	Initial triage, Scenarios 1-3	Immediately on alert
Project owner (Tilak)	Scenarios 4-5, DB restore approval	Security incidents, data loss > 1 hour
Render Support	DB restore failures, platform issues	When Render UI/API is unresponsive
Supabase Support	Production DB issues (if on Supabase)	DB connection failures not resolved in 15 min

Render Support: https://render.com/support
Render Status Page: https://status.render.com (bookmark this)
GitHub Repo: https://github.com/BluCypher1/ThreatWeaver

Communication Templates

Template 1: Service Disruption Notification

Subject: ThreatWeaver Service Disruption — [DATE] [TIME UTC]

Team,

We are currently experiencing a service disruption affecting ThreatWeaver.

Impact: [API unavailable / Slow response times / Data access issues]
Started: [TIME UTC]
Affected services: [Backend API / Dashboard / Scanner]
Cause: [Known / Under investigation]

Current status: [Investigating / Restoring / Monitoring recovery]

Next update: [TIME UTC]

We apologize for the inconvenience. Our team is working to resolve this as quickly as possible.

— ThreatWeaver Operations

Template 2: Data Loss Notification

Subject: ThreatWeaver — Data Restoration Notice for Your Tenant

[Tenant Admin Name],

We are writing to inform you of a data restoration that affected your ThreatWeaver tenant.

What happened: [Brief description]
Data affected: [Findings / Assets / Assessments / User data]
Time window of data loss: [FROM timestamp] to [TO timestamp]
Data restored to: [Backup timestamp]

Action required: [None — your data has been restored / Please re-run assessments created after DATE]

We are sorry for any disruption this caused. Please contact us if you have any questions or notice any data discrepancies.

— ThreatWeaver Operations

Template 3: Security Incident Notification

Subject: IMPORTANT — ThreatWeaver Security Notice — Action Required

[Tenant Admin Name],

We are writing to notify you of a security incident that may have affected your ThreatWeaver account.

What happened: [Brief, factual description — do not speculate]
When: [DATE/TIME UTC]
What we did: Revoked all active sessions, rotated credentials, forced password resets
What you need to do:
  1. Re-authenticate to ThreatWeaver at [URL]
  2. Set a new password when prompted
  3. Review your SecurityAuditLog for any unauthorized activity
  4. Rotate any API keys your team stored in ThreatWeaver

If you see any suspicious activity in your audit log, please reply to this email immediately.

We take security very seriously and are conducting a full post-incident review. We will share a detailed report within 72 hours.

— ThreatWeaver Security Team

Post-Incident Review Checklist

After every incident, regardless of severity:

[ ] Write incident report: docs/incidents/INCIDENT-<YYYY-MM-DD>-<slug>.md
[ ] Timeline documented: detection time, response time, resolution time
[ ] Root cause identified and documented
[ ] Blast radius determined: which tenants/data were affected
[ ] Remediation steps documented
[ ] Prevention: what change prevents recurrence?
[ ] Update ISSUE_TRACKER.md if a code fix is needed
[ ] Update this runbook if the steps were unclear or incomplete
[ ] Share report with project owner within 24 hours of resolution

Deployment Guide — deploy procedures, environment setup
Runbooks — operational runbooks for common tasks
Migration History — DB migration log
Environment Variables — all env vars and their purpose
Permission Matrix — what each role can access

Disaster Scenario Categories​

Backup Strategy​

PostgreSQL — Render Managed Backups​

How to trigger a manual backup via Render dashboard​

Export Data from ThreatWeaver (Admin UI)​

Git-Based Source Recovery​

Recovery Procedures​

Scenario 1: Backend Service Crash (Render)​

Step 1 — Assess the crash reason​

Step 2 — Monitor the auto-restart​

Step 3 — If persistent: redeploy last known good build​

Step 4 — If OOM: check for runaway scan​

Step 5 — Verify full recovery​

Scenario 2: Database Corruption or Accidental Data Deletion​

Step 1 — Stop the backend service​

Step 2 — Identify the last known good backup​

Step 3 — Restore from backup​

Step 4 — Restart the backend​

Step 5 — Verify data integrity​

Step 6 — Inform affected tenants​

Scenario 3: Failed Deployment​

Step 1 — Confirm this is a deployment regression​

Step 2 — Rollback to last successful deploy​

Step 3 — Diagnose the build failure (if build failed)​

Step 4 — Hotfix workflow​

Step 5 — Verify full recovery​

Scenario 4: Security Incident (Compromised Credentials)​

Step 1 — Revoke all active JWT sessions​

Step 2 — Rotate JWT_SECRET​

Step 3 — Force password reset for all users​

Step 4 — Check SecurityAuditLog for breach scope​

Step 5 — Rotate all external secrets​

Step 6 — Review IP access patterns​

Step 7 — Post-incident documentation​

Scenario 5: Accidental Deletion of Scan Data​

Step 1 — Check the Archives first​

Step 2 — If not in archives: restore from DB backup​

Step 3 — Re-run the assessment (last resort)​

Health Check Endpoints​

Endpoint Reference​

Recommended Monitoring Setup​

Local Health Verification​

Production Health Verification​

Recovery Time and Point Objectives​

Quick-Reference Runbook Checklists​

Runbook A: Backend Service Down​

Runbook B: Database Restore​

Runbook C: Deployment Rollback​

Runbook D: Security Incident Containment​

Runbook E: Scan Data Missing​

Environment-Specific Notes​

Local Development​

UAT / Dev Environment (dev.threatweaver.ai)​

Production (kinavulnerabilitymanagement.vercel.app)​

Escalation Contacts​

Communication Templates​

Template 1: Service Disruption Notification​

Template 2: Data Loss Notification​

Template 3: Security Incident Notification​

Post-Incident Review Checklist​

Related Documentation​

Disaster Scenario Categories

Backup Strategy

PostgreSQL — Render Managed Backups

How to trigger a manual backup via Render dashboard

Export Data from ThreatWeaver (Admin UI)

Git-Based Source Recovery

Recovery Procedures

Scenario 1: Backend Service Crash (Render)

Step 1 — Assess the crash reason

Step 2 — Monitor the auto-restart

Step 3 — If persistent: redeploy last known good build

Step 4 — If OOM: check for runaway scan

Step 5 — Verify full recovery

Scenario 2: Database Corruption or Accidental Data Deletion

Step 1 — Stop the backend service

Step 2 — Identify the last known good backup

Step 3 — Restore from backup

Step 4 — Restart the backend

Step 5 — Verify data integrity

Step 6 — Inform affected tenants

Scenario 3: Failed Deployment

Step 1 — Confirm this is a deployment regression

Step 2 — Rollback to last successful deploy

Step 3 — Diagnose the build failure (if build failed)

Step 4 — Hotfix workflow

Step 5 — Verify full recovery

Scenario 4: Security Incident (Compromised Credentials)

Step 1 — Revoke all active JWT sessions

Step 2 — Rotate JWT_SECRET

Step 3 — Force password reset for all users

Step 4 — Check SecurityAuditLog for breach scope

Step 5 — Rotate all external secrets

Step 6 — Review IP access patterns

Step 7 — Post-incident documentation

Scenario 5: Accidental Deletion of Scan Data

Step 1 — Check the Archives first

Step 2 — If not in archives: restore from DB backup

Step 3 — Re-run the assessment (last resort)

Health Check Endpoints

Endpoint Reference

Recommended Monitoring Setup

Local Health Verification

Production Health Verification

Recovery Time and Point Objectives

Quick-Reference Runbook Checklists

Runbook A: Backend Service Down

Runbook B: Database Restore

Runbook C: Deployment Rollback

Runbook D: Security Incident Containment

Runbook E: Scan Data Missing

Environment-Specific Notes

Local Development

UAT / Dev Environment (dev.threatweaver.ai)

Production (kinavulnerabilitymanagement.vercel.app)

Escalation Contacts

Communication Templates

Template 1: Service Disruption Notification

Template 2: Data Loss Notification

Template 3: Security Incident Notification

Post-Incident Review Checklist

Related Documentation