Frontend Resilience Standards

Overview

This document defines the standards for resilient HTTP requests, error handling, and long-running operation UX in the admin frontend. The patterns documented here exist to prevent an entire class of "the server succeeded but the UI lied" bugs.

The flagship implementation is admin/src/js/shared/api-client.js, which contains the SLOW_ENDPOINT_PATTERNS map. Every new long-running route should be added there, not patched per-caller.

Why this exists

A recurring bug pattern caused user-facing failures across the admin app for months. The flow:

User triggers a long-running operation (push payroll, generate PDF, broadcast SMS, run KPI recalc)
The browser POST starts
Backend works correctly — loops through external API calls, sends SMS, writes to DB
Default api-client timeout of 30 seconds fires
Request aborts client-side
Error toast displayed
Backend completes successfully a few seconds later
User sees red, doesn't trust the UI, reflexively retries

The 4th occurrence in three weeks (the cover-request feature on 2026-05-16) triggered the structural fix documented below.

The three-tier timeout resolution

The api-client.js request() method resolves a request's timeout in this order:

const resolvedTimeout = options.timeout      // 1. Explicit caller override
  || resolveSlowTimeout(endpoint)            // 2. Pattern-map lookup
  || this.timeout;                           // 3. Client default (30s)

Explicit caller override wins absolutely. Existing inline {timeout: x} overrides are preserved.
Pattern map auto-bumps timeout for known-slow paths.
Client default (30s) for everything else. Hundreds of fast endpoints — forum, messenger, staff CRUD, calendar — are untouched.

Resolution properties

Property	Why it matters
Pure additive	Routes NOT in the map have identical behaviour to before. Zero regression risk.
Explicit override wins	Callers that have already tuned a timeout are preserved.
Pattern miss = old behaviour	If a future slow route is added but the regex isn't, it just falls through to the 30s default — the same failure mode that existed before. No new bug introduced.
Caller `AbortSignal` distinguished from timeout	A search-debounce abort returns "Request aborted" with status 0; a true timeout returns "Request timed out after Xs" with status 408.

The SLOW_ENDPOINT_PATTERNS map

Tiered by likely operation duration:

10-minute tier — big external-API loops

Endpoint pattern	Why slow
`/payroll/xero/push-timesheets`	Per-staff Xero EmploymentBasis fetch + push
`/payrun/push-allowances`	Per-staff Xero allowance push
`/payroll/xero/timesheets` (DELETE)	Per-staff draft cleanup
`/staff/xero/sync-all`	Bulk Xero employee sync
`/hr/contracts/batches/*/process`	Bulk contract generation + signing
`/billing-history/ingest`	Full ledger ingest
`/openclaw/doctor`, `/restart`, `/update`, `/apply`	VM-level operations

5-minute tier — PDFs, bulk SMS, recalcs, scans

Endpoint pattern	Why slow
`/payroll/payrun/summary`	85-payslip Xero detail fetch
`/payroll/payrun/auto-detect`	Multi-table shift scan
`/payroll/payrun/upload-salsac`	CSV parse + Xero name lookup
`/payroll/staff-sync`	Multi-system reconciliation
`/payroll/xero/create-payrun`	Xero PayRun creation
`/hr/shifts/request-cover`	Bulk SMS broadcast (~600ms × N staff)
`/hr/kpis/recalculate-all`	Full KPI engine pass
`/hr/staff/*/contracts/generate`	Per-staff contract render
`/billing-history/report/*/save`	PDF generation
`/billing-history/participant/*/statement/save`	Per-participant statement PDF
`/catchup-surveys/*/send`	Bulk SMS dispatch
`/shift-notes/scan`, `/operations/*/apply`	Multi-shift note analysis

2-minute tier — single external API call, normally fast but can stall

Endpoint pattern	Why slow
`/reggie/sms`, `/reggie/call`	Agent invocation, awaits Reggie's response
`/staff/xero/profile/*`	Single Xero employee detail fetch
`/staff/deputy/employee/*`	Single Deputy employee fetch
`/openclaw/status`, `/schema`, `/config`, `/logs`	VM round-trip
`/hr/contracts/*/resend`	Outbound email

Adding a new slow endpoint

When a new route is added that loops over external APIs, generates a PDF, or broadcasts SMS:

Open admin/src/js/shared/api-client.js
Find the SLOW_ENDPOINT_PATTERNS array
Add a new entry in the appropriate tier:

{ rx: /\/my-new\/slow\/route\b/, ms: 300000 },

That's it. Every future caller automatically inherits the bumped timeout. No per-page changes needed.

Regex pattern conventions

Use \b word boundary at the end to match the path AND its sub-paths (e.g. /push-timesheets AND /push-timesheets/progress). Sub-path progress polls inheriting the same timeout is harmless because they actually respond in milliseconds.
Use [^/]+ for parametrised segments (e.g. /hr/contracts/[^/]+/process matches /hr/contracts/abc-123/process).
Use leading / to anchor to the path start after the /api base.

Error message conventions

When a timeout DOES fire (genuine outage, not false alarm), the error message must:

Include elapsed time — "timed out after 600s" tells the user this wasn't a network blip
Warn about partial completion — "server may still be processing" prevents reflexive retry
Suggest verification — "verify the action completed before retrying" gives the user an action

The current message in api-client.js:

throw new ApiError(
  `Request timed out after ${Math.round(resolvedTimeout/1000)}s. ` +
  `The server may still be processing -- verify the action completed ` +
  `before retrying.`,
  408,
  { timeout: true, timeoutMs: resolvedTimeout }
);

This is especially important for routes that send SMSes, create payslips, or generate signed contracts — blindly retrying duplicates the work and can cost real money or compliance.

The errMsg helper (SCAN_HOT_error-message-objects)

Some error sources (axios responses, custom error classes) attach an OBJECT to .message. Bare ${error.message} then renders as [object Object]. The fix is a shared defensive stringifier:

// admin/src/js/shared/error-handler.js (to be added)
export function errMsg(e) {
  if (!e) return 'Unknown error';
  const m = e?.message ?? e?.error ?? e?.statusText ?? e;
  if (typeof m === 'string') return m;
  try { return JSON.stringify(m); } catch { return String(m); }
}

Currently defined inline in page_payroll_2.js. Pending consolidation per the scan plan.

Long-operation UX standards

Routes in the SLOW_ENDPOINT_PATTERNS map fall into two UX categories:

The user expects feedback. Implement:

Backend: emit progress state to an in-memory store keyed by operation ID. Add a sibling GET .../progress route.
Frontend: open a modal showing the current stage. Poll the progress endpoint every 2 seconds. Show preparing → running → done.

Pattern reference: payroll v2's push-timesheets modal in page_payroll_2.js.

Category B: Spinner is acceptable

For operations that are fast (5-30s) but bumped to a longer timeout as a safety net, a spinner with a friendly status line is fine.

The scan SCAN_HOT_progress-modal-coverage identifies which slow endpoints currently lack a progress modal.

The repoSCANS folder (admin/tasks/tasks_active/repoSCANS/) tracks ongoing work in this area:

Scan	Status
`SCAN_HOT_api-timeout-map`	Done 2026-05-16 — this document
`SCAN_HOT_error-message-objects`	New — promotes `errMsg` to shared helper
`SCAN_HOT_template-string-objects`	New — finds risky `${obj}` interpolations
`SCAN_HOT_progress-modal-coverage`	New — adds progress modals to slow routes
`SCAN_WARM_backend-timeouts`	New — same pattern, server-side
`SCAN_WARM_eventsource-backoff`	New — reconnect storm prevention
`SCAN_WARM_localstorage-quota`	New — defensive storage wrappers

How to verify the system is working

A few quick checks:

# Find the timeout map
rg "SLOW_ENDPOINT_PATTERNS" admin/src/js/shared/api-client.js

# Count current patterns
rg "{ rx: " admin/src/js/shared/api-client.js | wc -l

# Find any inline timeout overrides (these still win, just verifying)
rg "timeout:\s*\d+" admin/src/js/pages --type js

When adding a new long-running backend route, the test is: does it complete in less than 30 seconds in the worst case? If no, add a pattern to the map.

Migration notes

Existing {timeout: x} overrides in page JS still work — they win over the map and are preserved untouched
page_hr_shift_cover.js was migrated: its inline {timeout: 300000} was removed because the map now handles it. The cover-request-specific error toast was kept (it has friendlier copy than the generic message)
No other files needed modification

References

admin/src/js/shared/api-client.js — the implementation
admin/tasks/tasks_active/repoSCANS/SCAN_HOT_api-timeout-map.md — the original scan record
Blog post: Patterns Over Patches: A Frontend Resilience Sweep — the announcement
AGENTS.md — overall codebase standards

Document version: 1.0 — 2026-05-17 — Initial publication.

Overview​

Why this exists​

The three-tier timeout resolution​

Resolution properties​

The SLOW_ENDPOINT_PATTERNS map​

10-minute tier — big external-API loops​

5-minute tier — PDFs, bulk SMS, recalcs, scans​

2-minute tier — single external API call, normally fast but can stall​

Adding a new slow endpoint​

Regex pattern conventions​

Error message conventions​

The errMsg helper (SCAN_HOT_error-message-objects)​

Long-operation UX standards​

Category A: Show a progress modal​

Category B: Spinner is acceptable​

Related scans​

How to verify the system is working​

Migration notes​

References​