Backup & Disaster Recovery
Owner: Brett (CXD)
Platforms: Windows (Primary + Backup)
Services: RABS Admin (backend + admin UI), PostgreSQL, NGINX, File storage (storage/, storage_user/)
RPO target: ≤ 3 hours (daytime cadence) • RTO: Manual cutover (~30–60 min)
Principle: “State first, then evolve.” We restore schema+data from dumps and keep app/DB/files in lock‑step.
0) Terms & Goals
- RPO (Recovery Point Objective): Accept up to 3 hours of data/code loss during daytime; nightly “hard” backup reduces risk further.
- RTO (Recovery Time Objective): Manual cutover (swap WAN IP / DNS or controlled LAN “backdoor”) within ~1 hour.
- Primary: Windows host running apps, DB, file storage.
- Backup: Off‑site Windows host that regularly restores the latest code+DB+files and can become primary on demand.
- Backups restore state; migrations embody intent. We use backups to get running quickly; migrations carry old states forward when needed.
1) Inventory & Critical Paths
1.1 Data & Config
- PostgreSQL DB:
rabspocdb(withpgcrypto,citext) - Secrets:
backend/.env(JWT_SECRET, APP_ENCRYPTION_KEY, SMTP_* …) — must match on both machines - File storage:
C:\\Users\\brett\\dev\\rabs\\storage\\(public drive)C:\\Users\\brett\\dev\\rabs\\storage_user\\(user drive)
- Code: Git repo
rabs(backend + admin) - Reverse proxy: NGINX configs (same version, same vhosts on both machines)
1.2 Health Gates (fast, lightweight)
- DB connectivity:
SELECT 1(latency log) - App health:
/healthendpoint returns 200 - Disk free: ensure enough space for dumps and storage sync
2) Schedules (Daily Cadence)
2.1 Daytime (every 3 hours)
Primary (HH:00, HH:03, HH:06, …):
- Ensure Postgres parity/exts (see §4.1 script)
- Dump DB (custom format) →
\\backupshare\\db\\rabspocdb_YYYY-MM-DD_HH-mm.dump - Ship dump to Backup via SMB/SSH
- (File storage already GoodSync’d continuously)
Backup (stagger start +10 min):
- Stop app services (graceful)
- Ensure Postgres parity/exts (see §4.1)
- Restore DB (pg_restore)
- Start app services
- Log “restore OK” with timestamp
2.2 Nightly “Hard” Backup (00:00–01:00)
Primary @ 00:00:
- Stop app services
- Ensure Postgres parity/exts
- Dump DB (custom format + schema‑only dump)
- Archive code (tag commit + zip optional)
- Start app services @ ~00:30
Backup @ 00:30:
- Stop app services
- Ensure Postgres parity/exts
- Restore DB from the latest nightly dump
- Start app services
- Log “nightly restored OK”
GoodSync keeps
storage/&storage_user/current continuously. If strict atomicity is ever required (DB references must exist on disk), add a brief file‑sync freeze while dumping.
3) Dump & Restore Procedures (PostgreSQL)
3.1 Dump (Primary)
Use custom format (parallelizable, selective restore capable):
$env:PGPASSWORD = '<postgres_password>'
$ts = (Get-Date).ToString('yyyy-MM-dd_HH-mm')
$dump = "C:\\Backups\\db\\rabspocdb_$ts.dump"
& "C:\\Program Files\\PostgreSQL\\16\\bin\\pg_dump.exe" `
-h localhost -U postgres -F c -d rabspocdb -f $dump
# (Optional) schema-only file:
& "C:\\Program Files\\PostgreSQL\\16\\bin\\pg_dump.exe" `
-h localhost -U postgres -F p -s -d rabspocdb -f ("C:\\Backups\\db\\rabspocdb_schema_$ts.sql")
3.2 Restore (Backup)
$env:PGPASSWORD = '<postgres_password>'
$dump = "C:\\Backups\\db\\rabspocdb_YYYY-MM-DD_HH-mm.dump"
# restore in-place (drops/creates objects as needed)
& "C:\\Program Files\\PostgreSQL\\16\\bin\\pg_restore.exe" `
-h localhost -U postgres -d rabspocdb --clean --if-exists $dump
Notes
- Ensure identical major Postgres versions on both machines.
- If extension errors occur, run:
then rerun the failing part or the restore.
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE EXTENSION IF NOT EXISTS citext;
4) Nightly Checks & Parity
4.1 Ensure Postgres Parity & Extensions (PowerShell)
Manifest: C:\\Users\\brett\\dev\\rabs\\ops\\pg_extensions.json
{
"required_server_major": "16",
"extensions": [
{"name":"pgcrypto", "min_version":"1.3"},
{"name":"citext", "min_version":"1.6"}
]
}
Script: C:\\Users\\brett\\dev\\rabs\\ops\\Ensure-Postgres.ps1
param(
[string]$DbHost="localhost",[int]$DbPort=5432,[string]$DbName="rabspocdb",
[string]$DbUser="postgres",[string]$PgBin="C:\Program Files\PostgreSQL\16\bin",
[string]$Manifest="C:\Users\brett\dev\rabs\ops\pg_extensions.json",
[string]$EnvFile="C:\Users\brett\dev\rabs\backend\.env",
[string]$EnvHashFile="C:\Users\brett\dev\rabs\ops\.env.sha256"
)
$ErrorActionPreference="Stop"; $psql=Join-Path $PgBin "psql.exe"
function Invoke-Psql([string]$sql){ & $psql -h $DbHost -p $DbPort -U $DbUser -d $DbName -t -A -c $sql 2>$null }
$serverVer=Invoke-Psql "show server_version_num;"
if(-not $serverVer){throw "Cannot query server_version_num"}
$serverMajor=[int]($serverVer.Substring(0,2))
$manifestObj=Get-Content $Manifest|ConvertFrom-Json
if($serverMajor -ne [int]$manifestObj.required_server_major){throw "Server major $serverMajor != required $($manifestObj.required_server_major)"}
$installed=@{}; Invoke-Psql "select extname, extversion from pg_extension;"|%{ $p=$_.Split("|"); if($p.Count-ge 2){$installed[$p[0]]=$p[1]}}
foreach($ext in $manifestObj.extensions){ if(-not $installed.ContainsKey($ext.name)){ Invoke-Psql "create extension if not exists $($ext.name);" } }
$installed=@{}; Invoke-Psql "select extname, extversion from pg_extension;"|%{ $p=$_.Split("|"); if($p.Count-ge 2){$installed[$p[0]]=$p[1]}}
foreach($ext in $manifestObj.extensions){ if($installed.ContainsKey($ext.name) -and $ext.min_version -and ($installed[$ext.name] -lt $ext.min_version)){ Write-Warning "$($ext.name) $($installed[$ext.name]) < $($ext.min_version)"; } }
if(Test-Path $EnvFile){ $hash=(Get-FileHash -Algorithm SHA256 $EnvFile).Hash; if(Test-Path $EnvHashFile){ $prev=Get-Content $EnvHashFile -TotalCount 1; if($prev -ne $hash){ Write-Warning ".env changed since last run" } } Set-Content -Path $EnvHashFile -Value $hash -NoNewline } else { Write-Warning "Missing .env at $EnvFile" }
Write-Host "Postgres parity check complete"
Run this before dumping on Primary and before restoring on Backup.
4.2 Secrets Parity
.envon Primary and Backup must match (JWT_SECRET,APP_ENCRYPTION_KEY,SMTP_*,STORAGE_*_DIR).- The script above stores a
.envhash — warn if drifted.
4.3 Config Parity (NGINX, service files)
- Keep a small checksum list (optional). Fail the job on drift if you prefer strictness.
5) Storage Folders Sync
- GoodSync runs continuously for:
storage/↔ backupstorage/storage_user/↔ backupstorage_user/
- If strict consistency is ever required (DB → file reference must exist), pause file sync for 1–2 minutes during DB dump and resume after transfer.
6) Start/Stop Order (to avoid mixed states)
On Backup during restore:
- Stop app services (backend/API, NGINX site, workers)
- Ensure Postgres parity
- Restore DB dump
- Start app services (backend first, then NGINX)
On Primary during nightly hard backup:
- Stop app services
- Ensure Postgres parity
- Dump DB
- Start app services
7) Failover Plan (Manual)
- Confirm last Backup restore was OK (health checks).
- Swap WAN IP or DNS when ready.
- Monitor logs, 2xx/4xx/5xx, and Postgres connections.
- When Primary returns, either:
- keep Backup as Primary until next maintenance window, or
- perform reverse‑sync + controlled cutback.
Note: If both hosts served traffic during a split‑brain scenario (shouldn’t happen here), reconcile DB before resuming dual cadence.
8) Testing & Drills
- Monthly drill: disable Primary network for 15–30 min, cutover to Backup, verify end‑to‑end.
- Nightly job logs: write success/exit codes to
C:\\Backups\\logs\\YYYY-MM-DD.txt. - Alerting: simple email or SMS on failure of dump/restore tasks.
9) Health vs Schema‑Ensure (optional)
- Health check: keep it tiny —
SELECT 1,/health200. - Optional schema ensure: a few idempotent DDLs (e.g.,
CREATE EXTENSION IF NOT EXISTS …,ALTER TABLE … ADD COLUMN IF NOT EXISTS …) after connectivity.
Feature flag:ENSURE_SCHEMA_ON_BOOT=true
10) Quick PowerShell “One‑Button” Tasks (templates)
10.1 Primary: Backup-And-Ship.ps1
# 0) ensure postgres parity
powershell -File C:\Users\brett\dev\rabs\ops\Ensure-Postgres.ps1
# 1) dump
$ts=(Get-Date).ToString('yyyy-MM-dd_HH-mm')
$dump="C:\Backups\db\rabspocdb_$ts.dump"
$env:PGPASSWORD='<pwd>'
& "C:\Program Files\PostgreSQL\16\bin\pg_dump.exe" -h localhost -U postgres -F c -d rabspocdb -f $dump
# 2) ship (SMB example)
Copy-Item $dump \\backup-host\backups\db\ -Force
# 3) log
Add-Content C:\Backups\logs\backup.log "[$(Get-Date -Format s)] dumped $dump"
10.2 Backup: Restore-And-Start.ps1
# 0) stop app
Stop-Service RABS-Backend -ErrorAction SilentlyContinue
# nginx site stop: depends on your install
# 1) ensure postgres parity
powershell -File C:\Users\brett\dev\rabs\ops\Ensure-Postgres.ps1
# 2) restore latest
$latest = Get-ChildItem C:\Backups\db\ | Sort-Object LastWriteTime -Desc | Select-Object -First 1
$env:PGPASSWORD='<pwd>'
& "C:\Program Files\PostgreSQL\16\bin\pg_restore.exe" -h localhost -U postgres -d rabspocdb --clean --if-exists $latest.FullName
# 3) start app
Start-Service RABS-Backend
# nginx site start…
# 4) log
Add-Content C:\Backups\logs\restore.log "[$(Get-Date -Format s)] restored $($latest.Name)"
Adapt service names and NGINX commands to your environment.
11) Incident Runbook (Primary down)
- Verify Backup has restored the most recent dump (check restore log + app
/health). - Swap WAN IP or switch DNS when ready.
- Monitor 2xx/4xx/5xx, queue depths, and Postgres connections.
- Communicate status (internal + external if needed).
- When Primary is back, plan either a reverse‑sync or keep Backup as new Primary until maintenance window.
12) Notes & Future Options
- To reduce RPO further: streaming replication or WAL archiving + PITR.
- To automate failover: Patroni/etc. (Linux‑friendly; Windows via containers/WSL possible but more complex).
- Consider signed URLs or auth-checked endpoints for serving private files under
storage_user/.
Status: v1.0 (initial runbook).
Next review: after Users/Auth go live and first agent pass on file manager is complete.