Frontend Resilience Standards
Overview
This document defines the standards for resilient HTTP requests, error handling, and long-running operation UX in the admin frontend. The patterns documented here exist to prevent an entire class of "the server succeeded but the UI lied" bugs.
The flagship implementation is admin/src/js/shared/api-client.js, which contains the SLOW_ENDPOINT_PATTERNS map. Every new long-running route should be added there, not patched per-caller.
Why this exists
A recurring bug pattern caused user-facing failures across the admin app for months. The flow:
- User triggers a long-running operation (push payroll, generate PDF, broadcast SMS, run KPI recalc)
- The browser POST starts
- Backend works correctly — loops through external API calls, sends SMS, writes to DB
- Default api-client timeout of 30 seconds fires
- Request aborts client-side
- Error toast displayed
- Backend completes successfully a few seconds later
- User sees red, doesn't trust the UI, reflexively retries
The 4th occurrence in three weeks (the cover-request feature on 2026-05-16) triggered the structural fix documented below.
The three-tier timeout resolution
The api-client.js request() method resolves a request's timeout in this order:
const resolvedTimeout = options.timeout // 1. Explicit caller override
|| resolveSlowTimeout(endpoint) // 2. Pattern-map lookup
|| this.timeout; // 3. Client default (30s)
- Explicit caller override wins absolutely. Existing inline
{timeout: x}overrides are preserved. - Pattern map auto-bumps timeout for known-slow paths.
- Client default (30s) for everything else. Hundreds of fast endpoints — forum, messenger, staff CRUD, calendar — are untouched.
Resolution properties
| Property | Why it matters |
|---|---|
| Pure additive | Routes NOT in the map have identical behaviour to before. Zero regression risk. |
| Explicit override wins | Callers that have already tuned a timeout are preserved. |
| Pattern miss = old behaviour | If a future slow route is added but the regex isn't, it just falls through to the 30s default — the same failure mode that existed before. No new bug introduced. |
Caller AbortSignal distinguished from timeout | A search-debounce abort returns "Request aborted" with status 0; a true timeout returns "Request timed out after Xs" with status 408. |
The SLOW_ENDPOINT_PATTERNS map
Tiered by likely operation duration:
10-minute tier — big external-API loops
| Endpoint pattern | Why slow |
|---|---|
/payroll/xero/push-timesheets | Per-staff Xero EmploymentBasis fetch + push |
/payrun/push-allowances | Per-staff Xero allowance push |
/payroll/xero/timesheets (DELETE) | Per-staff draft cleanup |
/staff/xero/sync-all | Bulk Xero employee sync |
/hr/contracts/batches/*/process | Bulk contract generation + signing |
/billing-history/ingest | Full ledger ingest |
/openclaw/doctor, /restart, /update, /apply | VM-level operations |
5-minute tier — PDFs, bulk SMS, recalcs, scans
| Endpoint pattern | Why slow |
|---|---|
/payroll/payrun/summary | 85-payslip Xero detail fetch |
/payroll/payrun/auto-detect | Multi-table shift scan |
/payroll/payrun/upload-salsac | CSV parse + Xero name lookup |
/payroll/staff-sync | Multi-system reconciliation |
/payroll/xero/create-payrun | Xero PayRun creation |
/hr/shifts/request-cover | Bulk SMS broadcast (~600ms × N staff) |
/hr/kpis/recalculate-all | Full KPI engine pass |
/hr/staff/*/contracts/generate | Per-staff contract render |
/billing-history/report/*/save | PDF generation |
/billing-history/participant/*/statement/save | Per-participant statement PDF |
/catchup-surveys/*/send | Bulk SMS dispatch |
/shift-notes/scan, /operations/*/apply | Multi-shift note analysis |
2-minute tier — single external API call, normally fast but can stall
| Endpoint pattern | Why slow |
|---|---|
/reggie/sms, /reggie/call | Agent invocation, awaits Reggie's response |
/staff/xero/profile/* | Single Xero employee detail fetch |
/staff/deputy/employee/* | Single Deputy employee fetch |
/openclaw/status, /schema, /config, /logs | VM round-trip |
/hr/contracts/*/resend | Outbound email |
Adding a new slow endpoint
When a new route is added that loops over external APIs, generates a PDF, or broadcasts SMS:
- Open
admin/src/js/shared/api-client.js - Find the
SLOW_ENDPOINT_PATTERNSarray - Add a new entry in the appropriate tier:
{ rx: /\/my-new\/slow\/route\b/, ms: 300000 },
That's it. Every future caller automatically inherits the bumped timeout. No per-page changes needed.
Regex pattern conventions
- Use
\bword boundary at the end to match the path AND its sub-paths (e.g./push-timesheetsAND/push-timesheets/progress). Sub-path progress polls inheriting the same timeout is harmless because they actually respond in milliseconds. - Use
[^/]+for parametrised segments (e.g./hr/contracts/[^/]+/processmatches/hr/contracts/abc-123/process). - Use leading
/to anchor to the path start after the/apibase.
Error message conventions
When a timeout DOES fire (genuine outage, not false alarm), the error message must:
- Include elapsed time — "timed out after 600s" tells the user this wasn't a network blip
- Warn about partial completion — "server may still be processing" prevents reflexive retry
- Suggest verification — "verify the action completed before retrying" gives the user an action
The current message in api-client.js:
throw new ApiError(
`Request timed out after ${Math.round(resolvedTimeout/1000)}s. ` +
`The server may still be processing -- verify the action completed ` +
`before retrying.`,
408,
{ timeout: true, timeoutMs: resolvedTimeout }
);
This is especially important for routes that send SMSes, create payslips, or generate signed contracts — blindly retrying duplicates the work and can cost real money or compliance.
The errMsg helper (SCAN_HOT_error-message-objects)
Some error sources (axios responses, custom error classes) attach an OBJECT to .message. Bare ${error.message} then renders as [object Object]. The fix is a shared defensive stringifier:
// admin/src/js/shared/error-handler.js (to be added)
export function errMsg(e) {
if (!e) return 'Unknown error';
const m = e?.message ?? e?.error ?? e?.statusText ?? e;
if (typeof m === 'string') return m;
try { return JSON.stringify(m); } catch { return String(m); }
}
Currently defined inline in page_payroll_2.js. Pending consolidation per the scan plan.
Long-operation UX standards
Routes in the SLOW_ENDPOINT_PATTERNS map fall into two UX categories:
Category A: Show a progress modal
The user expects feedback. Implement:
- Backend: emit progress state to an in-memory store keyed by operation ID. Add a sibling
GET .../progressroute. - Frontend: open a modal showing the current stage. Poll the progress endpoint every 2 seconds. Show preparing → running → done.
Pattern reference: payroll v2's push-timesheets modal in page_payroll_2.js.
Category B: Spinner is acceptable
For operations that are fast (5-30s) but bumped to a longer timeout as a safety net, a spinner with a friendly status line is fine.
The scan SCAN_HOT_progress-modal-coverage identifies which slow endpoints currently lack a progress modal.
Related scans
The repoSCANS folder (admin/tasks/tasks_active/repoSCANS/) tracks ongoing work in this area:
| Scan | Status |
|---|---|
SCAN_HOT_api-timeout-map | Done 2026-05-16 — this document |
SCAN_HOT_error-message-objects | New — promotes errMsg to shared helper |
SCAN_HOT_template-string-objects | New — finds risky ${obj} interpolations |
SCAN_HOT_progress-modal-coverage | New — adds progress modals to slow routes |
SCAN_WARM_backend-timeouts | New — same pattern, server-side |
SCAN_WARM_eventsource-backoff | New — reconnect storm prevention |
SCAN_WARM_localstorage-quota | New — defensive storage wrappers |
How to verify the system is working
A few quick checks:
# Find the timeout map
rg "SLOW_ENDPOINT_PATTERNS" admin/src/js/shared/api-client.js
# Count current patterns
rg "{ rx: " admin/src/js/shared/api-client.js | wc -l
# Find any inline timeout overrides (these still win, just verifying)
rg "timeout:\s*\d+" admin/src/js/pages --type js
When adding a new long-running backend route, the test is: does it complete in less than 30 seconds in the worst case? If no, add a pattern to the map.
Migration notes
- Existing
{timeout: x}overrides in page JS still work — they win over the map and are preserved untouched page_hr_shift_cover.jswas migrated: its inline{timeout: 300000}was removed because the map now handles it. The cover-request-specific error toast was kept (it has friendlier copy than the generic message)- No other files needed modification
References
admin/src/js/shared/api-client.js— the implementationadmin/tasks/tasks_active/repoSCANS/SCAN_HOT_api-timeout-map.md— the original scan record- Blog post: Patterns Over Patches: A Frontend Resilience Sweep — the announcement
- AGENTS.md — overall codebase standards
Document version: 1.0 — 2026-05-17 — Initial publication.