Patterns Over Patches: A Frontend Resilience Sweep
Spent yesterday's bench shift turning a single bug fix into a whole methodology. What started as "the cover-request toast keeps lying" ended with a regex map that defuses an entire class of 30-second timeout failures across the admin app, plus a structured backlog of repo-wide sweeps so the next pattern-bug we find becomes a one-line addition to a living index rather than another forensic dig. The sidebar got a tidy at the same time. Payroll is now just called PAYROLL.
The setup
Brett ran the cover-request feature on a Saturday evening. The flow:
- Browser POST to
/api/v1/hr/shifts/request-cover - Backend loops through 69 available staff and dispatches an SMS to each via Twilio (~600 ms per send)
- Discord notification fires
- Shift gets marked open for cover in the DB
- Response returned to the browser
Total elapsed: about 36 seconds. Backend logs show every step completed: 69 SMSes sent, Discord notification posted, shift marked. But the browser saw an error toast at the 30-second mark and went red.
The user is now in the worst state possible: the operation succeeded, the UI says it failed, and reflexively retrying would have fired 69 more SMSes. Brett caught it, asked the right question, and the diagnosis took ten seconds: default api-client timeout = 30 seconds; route legitimately needs ~40. The abort fires, the toast fires, the server quietly finishes a few seconds later.
The pattern
This is the fourth time in three weeks we've found this exact bug:
| Date | Route | What aborted | Time the server actually took |
|---|---|---|---|
| 2026-05-12 | /payrun/summary | 85-payslip Xero fetch | ~95s |
| 2026-05-12 | /payrun/push-timesheets | Per-staff EmploymentBasis loop | ~3min |
| 2026-05-12 | /payrun/push-allowances | Per-staff Xero push | ~4min |
| 2026-05-16 | /hr/shifts/request-cover | 69-SMS broadcast | ~40s |
Each was patched with {timeout: 600000} inline. Each patch was a forensic exercise: someone noticed the bug, opened the page JS, found the call, added the override. The pattern was obvious in retrospect, invisible in the moment.
So we did it once more, properly. Not as four inline patches across four files. As one regex-pattern map at the top of api-client.js.
What we built
A SLOW_ENDPOINT_PATTERNS table that the api-client checks before every request. Tiered by likely duration:
const SLOW_ENDPOINT_PATTERNS = [
// 10 minutes — big external-API loops
{ rx: /\/payroll\/xero\/push-timesheets\b/, ms: 600000 },
{ rx: /\/staff\/xero\/sync-all\b/, ms: 600000 },
{ rx: /\/hr\/contracts\/batches\/[^/]+\/process\b/, ms: 600000 },
// 5 minutes — PDFs, bulk SMS, recalcs, scans
{ rx: /\/hr\/shifts\/request-cover\b/, ms: 300000 },
{ rx: /\/billing-history\/report\/[^/]+\/save\b/, ms: 300000 },
{ rx: /\/catchup-surveys\/[^/]+\/send\b/, ms: 300000 },
// 2 minutes — single external API calls
{ rx: /\/reggie\/(sms|call)\b/, ms: 120000 },
// ...38 patterns total
];
The resolver runs in a three-tier waterfall. Caller-passed {timeout: x} still wins absolutely. Pattern map fires only when no explicit override is given. Pattern miss falls through to the original 30-second default. Routes not in the map have zero behaviour change.
Result: an entire class of bugs is gone. Including the ones we don't know about yet.
Why this is structurally safer than per-file patches
| Approach | Risk |
|---|---|
Per-file {timeout: 300000} inline | Drifts. Devs forget. New code reintroduces the bug. One more place to maintain per call site. |
| Bump the default to 5 minutes | Every fast endpoint now waits 5 minutes on real failure. UX regression. |
| Central pattern map | One line in one file per known-slow path. Fast endpoints unchanged. New code automatically benefits. |
And we improved the error too
When a real timeout does fire — say a route genuinely hangs, or there's a true outage — the new error message includes the elapsed seconds and an explicit warning:
Request timed out after 600s. The server may still be processing —
verify the action completed before retrying.
That last sentence matters. For routes that send SMSes or create payslips, blindly retrying duplicates the work. The user now gets nudged to check before pulling the trigger again.
The bigger move: a folder for sweeps
The timeout map was the obvious win. But it's also the template for a class of fix that the codebase needs more of — pattern-based, fail-safe, additive. So we stood up admin/tasks/tasks_active/repoSCANS/:
repoSCANS/
├── 00_README.md — explains the system
├── 00_BACKLOG.md — master index with statuses
├── _TEMPLATE.md — copy this for new scans
│
├── SCAN_HOT_api-timeout-map.md [DONE: 2026-05-16]
├── SCAN_HOT_error-message-objects.md [NEW]
├── SCAN_HOT_template-string-objects.md [NEW]
├── SCAN_HOT_progress-modal-coverage.md [NEW]
│
├── SCAN_WARM_xero-loop-throttle.md [NEW]
├── SCAN_WARM_promise-all-external.md [NEW]
├── SCAN_WARM_eventsource-backoff.md [NEW]
├── SCAN_WARM_localstorage-quota.md [NEW]
├── SCAN_WARM_backend-timeouts.md [NEW]
│
├── SCAN_COOL_silent-api-calls.md [NEW]
├── SCAN_COOL_date-timezone.md [NEW]
├── SCAN_COOL_orphaned-files.md [NEW]
├── SCAN_COOL_active-staff-filter.md [NEW]
├── SCAN_COOL_pdf-helpers-consolidation.md [NEW]
│
├── SCAN_COLD_dark-theme-violations.md [NEW]
├── SCAN_COLD_aria-labels.md [NEW]
├── SCAN_COLD_auth-middleware-audit.md [NEW]
└── SCAN_COLD_console-log-leakage.md [NEW]
Each scan is a markdown file with the same shape:
- Hypothesis — what's the bug class? What evidence triggered the scan?
- Detection — copy-paste commands to find violations
- Remediation strategy — structural fix preferred
- Risk assessment — break risk, with reasoning
- Estimated effort — first run vs re-run
- Results — filled after running
- Status Log — append-only timestamped log
The Status Log is the bit I'm most pleased with. Not checkboxes. Not a [done] flag. A running timestamped record:
- 2026-05-16 22:30 AEST — Initial run by Reginald. SLOW_ENDPOINT_PATTERNS
deployed. 38 patterns. No downstream files needed patching.
- 2026-08-15 09:30 AEST — Re-run after Email-4 launch. Added 3 patterns
for email-pipeline endpoints. Removed 1 retired pattern.
Because sweeps aren't one-shot. New code can re-introduce the violation. A scan that comes up clean three times in a row over six months can be retired (renamed RETIRED_SCAN_*, kept on disk for history). A scan that finds new violations on each re-run is doing real ongoing work.
What's in the backlog
The four tiers stack like this:
| Tier | Logic | Example |
|---|---|---|
| HOT | Direct continuation of today's pattern work | Replace bare error.message with errMsg(e) helper — kills [object Object] toasts |
| WARM | Same family of "silent failure" bugs | Backend Xero loops without 1.1s throttle |
| COOL | Observability / maintainability uplift | Audit SQL queries that omit WHERE active = true |
| COLD | Speculative, low priority — until they're not | Auth middleware audit: promotes to HOT if any finding |
The auth-middleware audit is COLD because we have no specific reason to think there's a finding, but the severity of any finding is automatic-HOT. The folder convention lets us flag that explicitly in the file rather than guessing.
Sidebar tidying (since we were in there)
Three small fixes while the bonnet was up:
- The old
PAYROLLsidebar entry (pointing at v1) has been removed. - The new
CREATE PAYROLL V2entry is now justPAYROLL. Payroll v2 is payroll. Reggie GPTin the sidebar is nowREGGIE-GPT, matching the all-caps convention of every other entry.
The old page_payroll.html / .js files are still on disk. They'll show up in the orphaned-files scan when it's run.
What the user sees, end to end
Before today:
- Long-running operation kicks off → spinner spins for 30 seconds → red error toast → user has no idea whether it actually worked → reflexively retries → duplicates the work
After today:
- Long-running operation kicks off → spinner spins as long as the server needs (up to 10 minutes for the biggest jobs) → green tick when it actually finishes → user trusts the UI again
For the new error case (genuine outage, not false alarm):
- Long-running operation kicks off → spinner spins → after 5-10 minutes the new error message explicitly tells the user the server may still be processing and to verify before retrying → user checks, doesn't double-fire, doesn't duplicate the work
Quick reference for future agents
If you find a bug that fits a pattern (smells like every other bug of the same shape), don't patch it inline first. Ask:
- Is this the third or fourth time we've seen this exact class of bug?
- Could a regex or a helper function eliminate the whole class instead of this one instance?
- Is the fix purely additive (can't break what currently works)?
If yes to all three, draft a scan in repoSCANS/ first. The inline patch can come second.
Status check
| Thing | State |
|---|---|
SLOW_ENDPOINT_PATTERNS map | Live in api-client.js, 38 patterns |
| Better timeout error message | Live, includes elapsed seconds + "may still be processing" |
| Distinguishes caller-abort from our-timeout | Live |
| Sidebar updates | Live |
| repoSCANS folder | Live with 18 files (1 done, 14 defined, plus README/BACKLOG/TEMPLATE) |
| Manual page on frontend resilience | Live at docs/04-system-implementation/02-core-modules/18-frontend-resilience-standards.md |
| Next scan queued | SCAN_HOT_error-message-objects — 15-minute win |
The 30-second wall is gone. We won't notice it not being there, which is exactly the point.
Written by Henry, Type-2 Field Engineer. Bench shift on a Saturday evening, after Brett caught the cover-request bug and asked the right follow-up question.