Skip to main content

Backup & Disaster Recovery

Owner: Brett (CXD)
Platforms: Windows (Primary + Backup)
Services: RABS Admin (backend + admin UI), PostgreSQL, NGINX, File storage (storage/, storage_user/)
RPO target: ≤ 3 hours (daytime cadence) • RTO: Manual cutover (~30–60 min)
Principle: “State first, then evolve.” We restore schema+data from dumps and keep app/DB/files in lock‑step.


0) Terms & Goals

  • RPO (Recovery Point Objective): Accept up to 3 hours of data/code loss during daytime; nightly “hard” backup reduces risk further.
  • RTO (Recovery Time Objective): Manual cutover (swap WAN IP / DNS or controlled LAN “backdoor”) within ~1 hour.
  • Primary: Windows host running apps, DB, file storage.
  • Backup: Off‑site Windows host that regularly restores the latest code+DB+files and can become primary on demand.
  • Backups restore state; migrations embody intent. We use backups to get running quickly; migrations carry old states forward when needed.

1) Inventory & Critical Paths

1.1 Data & Config

  • PostgreSQL DB: rabspocdb (with pgcrypto, citext)
  • Secrets: backend/.env (JWT_SECRET, APP_ENCRYPTION_KEY, SMTP_* …) — must match on both machines
  • File storage:
    • C:\\Users\\brett\\dev\\rabs\\storage\\ (public drive)
    • C:\\Users\\brett\\dev\\rabs\\storage_user\\ (user drive)
  • Code: Git repo rabs (backend + admin)
  • Reverse proxy: NGINX configs (same version, same vhosts on both machines)

1.2 Health Gates (fast, lightweight)

  • DB connectivity: SELECT 1 (latency log)
  • App health: /health endpoint returns 200
  • Disk free: ensure enough space for dumps and storage sync

2) Schedules (Daily Cadence)

2.1 Daytime (every 3 hours)

Primary (HH:00, HH:03, HH:06, …):

  1. Ensure Postgres parity/exts (see §4.1 script)
  2. Dump DB (custom format) → \\backupshare\\db\\rabspocdb_YYYY-MM-DD_HH-mm.dump
  3. Ship dump to Backup via SMB/SSH
  4. (File storage already GoodSync’d continuously)

Backup (stagger start +10 min):

  1. Stop app services (graceful)
  2. Ensure Postgres parity/exts (see §4.1)
  3. Restore DB (pg_restore)
  4. Start app services
  5. Log “restore OK” with timestamp

2.2 Nightly “Hard” Backup (00:00–01:00)

Primary @ 00:00:

  1. Stop app services
  2. Ensure Postgres parity/exts
  3. Dump DB (custom format + schema‑only dump)
  4. Archive code (tag commit + zip optional)
  5. Start app services @ ~00:30

Backup @ 00:30:

  1. Stop app services
  2. Ensure Postgres parity/exts
  3. Restore DB from the latest nightly dump
  4. Start app services
  5. Log “nightly restored OK”

GoodSync keeps storage/ & storage_user/ current continuously. If strict atomicity is ever required (DB references must exist on disk), add a brief file‑sync freeze while dumping.


3) Dump & Restore Procedures (PostgreSQL)

3.1 Dump (Primary)

Use custom format (parallelizable, selective restore capable):

$env:PGPASSWORD = '<postgres_password>'
$ts = (Get-Date).ToString('yyyy-MM-dd_HH-mm')
$dump = "C:\\Backups\\db\\rabspocdb_$ts.dump"
& "C:\\Program Files\\PostgreSQL\\16\\bin\\pg_dump.exe" `
-h localhost -U postgres -F c -d rabspocdb -f $dump
# (Optional) schema-only file:
& "C:\\Program Files\\PostgreSQL\\16\\bin\\pg_dump.exe" `
-h localhost -U postgres -F p -s -d rabspocdb -f ("C:\\Backups\\db\\rabspocdb_schema_$ts.sql")

3.2 Restore (Backup)

$env:PGPASSWORD = '<postgres_password>'
$dump = "C:\\Backups\\db\\rabspocdb_YYYY-MM-DD_HH-mm.dump"
# restore in-place (drops/creates objects as needed)
& "C:\\Program Files\\PostgreSQL\\16\\bin\\pg_restore.exe" `
-h localhost -U postgres -d rabspocdb --clean --if-exists $dump

Notes

  • Ensure identical major Postgres versions on both machines.
  • If extension errors occur, run:
    CREATE EXTENSION IF NOT EXISTS pgcrypto;
    CREATE EXTENSION IF NOT EXISTS citext;
    then rerun the failing part or the restore.

4) Nightly Checks & Parity

4.1 Ensure Postgres Parity & Extensions (PowerShell)

Manifest: C:\\Users\\brett\\dev\\rabs\\ops\\pg_extensions.json

{
"required_server_major": "16",
"extensions": [
{"name":"pgcrypto", "min_version":"1.3"},
{"name":"citext", "min_version":"1.6"}
]
}

Script: C:\\Users\\brett\\dev\\rabs\\ops\\Ensure-Postgres.ps1

param(
[string]$DbHost="localhost",[int]$DbPort=5432,[string]$DbName="rabspocdb",
[string]$DbUser="postgres",[string]$PgBin="C:\Program Files\PostgreSQL\16\bin",
[string]$Manifest="C:\Users\brett\dev\rabs\ops\pg_extensions.json",
[string]$EnvFile="C:\Users\brett\dev\rabs\backend\.env",
[string]$EnvHashFile="C:\Users\brett\dev\rabs\ops\.env.sha256"
)
$ErrorActionPreference="Stop"; $psql=Join-Path $PgBin "psql.exe"
function Invoke-Psql([string]$sql){ & $psql -h $DbHost -p $DbPort -U $DbUser -d $DbName -t -A -c $sql 2>$null }
$serverVer=Invoke-Psql "show server_version_num;"
if(-not $serverVer){throw "Cannot query server_version_num"}
$serverMajor=[int]($serverVer.Substring(0,2))
$manifestObj=Get-Content $Manifest|ConvertFrom-Json
if($serverMajor -ne [int]$manifestObj.required_server_major){throw "Server major $serverMajor != required $($manifestObj.required_server_major)"}
$installed=@{}; Invoke-Psql "select extname, extversion from pg_extension;"|%{ $p=$_.Split("|"); if($p.Count-ge 2){$installed[$p[0]]=$p[1]}}
foreach($ext in $manifestObj.extensions){ if(-not $installed.ContainsKey($ext.name)){ Invoke-Psql "create extension if not exists $($ext.name);" } }
$installed=@{}; Invoke-Psql "select extname, extversion from pg_extension;"|%{ $p=$_.Split("|"); if($p.Count-ge 2){$installed[$p[0]]=$p[1]}}
foreach($ext in $manifestObj.extensions){ if($installed.ContainsKey($ext.name) -and $ext.min_version -and ($installed[$ext.name] -lt $ext.min_version)){ Write-Warning "$($ext.name) $($installed[$ext.name]) < $($ext.min_version)"; } }
if(Test-Path $EnvFile){ $hash=(Get-FileHash -Algorithm SHA256 $EnvFile).Hash; if(Test-Path $EnvHashFile){ $prev=Get-Content $EnvHashFile -TotalCount 1; if($prev -ne $hash){ Write-Warning ".env changed since last run" } } Set-Content -Path $EnvHashFile -Value $hash -NoNewline } else { Write-Warning "Missing .env at $EnvFile" }
Write-Host "Postgres parity check complete"

Run this before dumping on Primary and before restoring on Backup.

4.2 Secrets Parity

  • .env on Primary and Backup must match (JWT_SECRET, APP_ENCRYPTION_KEY, SMTP_*, STORAGE_*_DIR).
  • The script above stores a .env hash — warn if drifted.

4.3 Config Parity (NGINX, service files)

  • Keep a small checksum list (optional). Fail the job on drift if you prefer strictness.

5) Storage Folders Sync

  • GoodSync runs continuously for:
    • storage/ ↔ backup storage/
    • storage_user/ ↔ backup storage_user/
  • If strict consistency is ever required (DB → file reference must exist), pause file sync for 1–2 minutes during DB dump and resume after transfer.

6) Start/Stop Order (to avoid mixed states)

On Backup during restore:

  1. Stop app services (backend/API, NGINX site, workers)
  2. Ensure Postgres parity
  3. Restore DB dump
  4. Start app services (backend first, then NGINX)

On Primary during nightly hard backup:

  1. Stop app services
  2. Ensure Postgres parity
  3. Dump DB
  4. Start app services

7) Failover Plan (Manual)

  1. Confirm last Backup restore was OK (health checks).
  2. Swap WAN IP or DNS when ready.
  3. Monitor logs, 2xx/4xx/5xx, and Postgres connections.
  4. When Primary returns, either:
    • keep Backup as Primary until next maintenance window, or
    • perform reverse‑sync + controlled cutback.

Note: If both hosts served traffic during a split‑brain scenario (shouldn’t happen here), reconcile DB before resuming dual cadence.


8) Testing & Drills

  • Monthly drill: disable Primary network for 15–30 min, cutover to Backup, verify end‑to‑end.
  • Nightly job logs: write success/exit codes to C:\\Backups\\logs\\YYYY-MM-DD.txt.
  • Alerting: simple email or SMS on failure of dump/restore tasks.

9) Health vs Schema‑Ensure (optional)

  • Health check: keep it tiny — SELECT 1, /health 200.
  • Optional schema ensure: a few idempotent DDLs (e.g., CREATE EXTENSION IF NOT EXISTS …, ALTER TABLE … ADD COLUMN IF NOT EXISTS …) after connectivity.
    Feature flag:
    ENSURE_SCHEMA_ON_BOOT=true

10) Quick PowerShell “One‑Button” Tasks (templates)

10.1 Primary: Backup-And-Ship.ps1

# 0) ensure postgres parity
powershell -File C:\Users\brett\dev\rabs\ops\Ensure-Postgres.ps1

# 1) dump
$ts=(Get-Date).ToString('yyyy-MM-dd_HH-mm')
$dump="C:\Backups\db\rabspocdb_$ts.dump"
$env:PGPASSWORD='<pwd>'
& "C:\Program Files\PostgreSQL\16\bin\pg_dump.exe" -h localhost -U postgres -F c -d rabspocdb -f $dump

# 2) ship (SMB example)
Copy-Item $dump \\backup-host\backups\db\ -Force

# 3) log
Add-Content C:\Backups\logs\backup.log "[$(Get-Date -Format s)] dumped $dump"

10.2 Backup: Restore-And-Start.ps1

# 0) stop app
Stop-Service RABS-Backend -ErrorAction SilentlyContinue
# nginx site stop: depends on your install

# 1) ensure postgres parity
powershell -File C:\Users\brett\dev\rabs\ops\Ensure-Postgres.ps1

# 2) restore latest
$latest = Get-ChildItem C:\Backups\db\ | Sort-Object LastWriteTime -Desc | Select-Object -First 1
$env:PGPASSWORD='<pwd>'
& "C:\Program Files\PostgreSQL\16\bin\pg_restore.exe" -h localhost -U postgres -d rabspocdb --clean --if-exists $latest.FullName

# 3) start app
Start-Service RABS-Backend
# nginx site start…

# 4) log
Add-Content C:\Backups\logs\restore.log "[$(Get-Date -Format s)] restored $($latest.Name)"

Adapt service names and NGINX commands to your environment.


11) Incident Runbook (Primary down)

  1. Verify Backup has restored the most recent dump (check restore log + app /health).
  2. Swap WAN IP or switch DNS when ready.
  3. Monitor 2xx/4xx/5xx, queue depths, and Postgres connections.
  4. Communicate status (internal + external if needed).
  5. When Primary is back, plan either a reverse‑sync or keep Backup as new Primary until maintenance window.

12) Notes & Future Options

  • To reduce RPO further: streaming replication or WAL archiving + PITR.
  • To automate failover: Patroni/etc. (Linux‑friendly; Windows via containers/WSL possible but more complex).
  • Consider signed URLs or auth-checked endpoints for serving private files under storage_user/.

Status: v1.0 (initial runbook).
Next review: after Users/Auth go live and first agent pass on file manager is complete.