Skip to main content

Frontend Resilience Standards

Overview

This document defines the standards for resilient HTTP requests, error handling, and long-running operation UX in the admin frontend. The patterns documented here exist to prevent an entire class of "the server succeeded but the UI lied" bugs.

The flagship implementation is admin/src/js/shared/api-client.js, which contains the SLOW_ENDPOINT_PATTERNS map. Every new long-running route should be added there, not patched per-caller.

Why this exists

A recurring bug pattern caused user-facing failures across the admin app for months. The flow:

  1. User triggers a long-running operation (push payroll, generate PDF, broadcast SMS, run KPI recalc)
  2. The browser POST starts
  3. Backend works correctly — loops through external API calls, sends SMS, writes to DB
  4. Default api-client timeout of 30 seconds fires
  5. Request aborts client-side
  6. Error toast displayed
  7. Backend completes successfully a few seconds later
  8. User sees red, doesn't trust the UI, reflexively retries

The 4th occurrence in three weeks (the cover-request feature on 2026-05-16) triggered the structural fix documented below.

The three-tier timeout resolution

The api-client.js request() method resolves a request's timeout in this order:

const resolvedTimeout = options.timeout      // 1. Explicit caller override
|| resolveSlowTimeout(endpoint) // 2. Pattern-map lookup
|| this.timeout; // 3. Client default (30s)
  1. Explicit caller override wins absolutely. Existing inline {timeout: x} overrides are preserved.
  2. Pattern map auto-bumps timeout for known-slow paths.
  3. Client default (30s) for everything else. Hundreds of fast endpoints — forum, messenger, staff CRUD, calendar — are untouched.

Resolution properties

PropertyWhy it matters
Pure additiveRoutes NOT in the map have identical behaviour to before. Zero regression risk.
Explicit override winsCallers that have already tuned a timeout are preserved.
Pattern miss = old behaviourIf a future slow route is added but the regex isn't, it just falls through to the 30s default — the same failure mode that existed before. No new bug introduced.
Caller AbortSignal distinguished from timeoutA search-debounce abort returns "Request aborted" with status 0; a true timeout returns "Request timed out after Xs" with status 408.

The SLOW_ENDPOINT_PATTERNS map

Tiered by likely operation duration:

10-minute tier — big external-API loops

Endpoint patternWhy slow
/payroll/xero/push-timesheetsPer-staff Xero EmploymentBasis fetch + push
/payrun/push-allowancesPer-staff Xero allowance push
/payroll/xero/timesheets (DELETE)Per-staff draft cleanup
/staff/xero/sync-allBulk Xero employee sync
/hr/contracts/batches/*/processBulk contract generation + signing
/billing-history/ingestFull ledger ingest
/openclaw/doctor, /restart, /update, /applyVM-level operations

5-minute tier — PDFs, bulk SMS, recalcs, scans

Endpoint patternWhy slow
/payroll/payrun/summary85-payslip Xero detail fetch
/payroll/payrun/auto-detectMulti-table shift scan
/payroll/payrun/upload-salsacCSV parse + Xero name lookup
/payroll/staff-syncMulti-system reconciliation
/payroll/xero/create-payrunXero PayRun creation
/hr/shifts/request-coverBulk SMS broadcast (~600ms × N staff)
/hr/kpis/recalculate-allFull KPI engine pass
/hr/staff/*/contracts/generatePer-staff contract render
/billing-history/report/*/savePDF generation
/billing-history/participant/*/statement/savePer-participant statement PDF
/catchup-surveys/*/sendBulk SMS dispatch
/shift-notes/scan, /operations/*/applyMulti-shift note analysis

2-minute tier — single external API call, normally fast but can stall

Endpoint patternWhy slow
/reggie/sms, /reggie/callAgent invocation, awaits Reggie's response
/staff/xero/profile/*Single Xero employee detail fetch
/staff/deputy/employee/*Single Deputy employee fetch
/openclaw/status, /schema, /config, /logsVM round-trip
/hr/contracts/*/resendOutbound email

Adding a new slow endpoint

When a new route is added that loops over external APIs, generates a PDF, or broadcasts SMS:

  1. Open admin/src/js/shared/api-client.js
  2. Find the SLOW_ENDPOINT_PATTERNS array
  3. Add a new entry in the appropriate tier:
{ rx: /\/my-new\/slow\/route\b/, ms: 300000 },

That's it. Every future caller automatically inherits the bumped timeout. No per-page changes needed.

Regex pattern conventions

  • Use \b word boundary at the end to match the path AND its sub-paths (e.g. /push-timesheets AND /push-timesheets/progress). Sub-path progress polls inheriting the same timeout is harmless because they actually respond in milliseconds.
  • Use [^/]+ for parametrised segments (e.g. /hr/contracts/[^/]+/process matches /hr/contracts/abc-123/process).
  • Use leading / to anchor to the path start after the /api base.

Error message conventions

When a timeout DOES fire (genuine outage, not false alarm), the error message must:

  1. Include elapsed time — "timed out after 600s" tells the user this wasn't a network blip
  2. Warn about partial completion — "server may still be processing" prevents reflexive retry
  3. Suggest verification — "verify the action completed before retrying" gives the user an action

The current message in api-client.js:

throw new ApiError(
`Request timed out after ${Math.round(resolvedTimeout/1000)}s. ` +
`The server may still be processing -- verify the action completed ` +
`before retrying.`,
408,
{ timeout: true, timeoutMs: resolvedTimeout }
);

This is especially important for routes that send SMSes, create payslips, or generate signed contracts — blindly retrying duplicates the work and can cost real money or compliance.

The errMsg helper (SCAN_HOT_error-message-objects)

Some error sources (axios responses, custom error classes) attach an OBJECT to .message. Bare ${error.message} then renders as [object Object]. The fix is a shared defensive stringifier:

// admin/src/js/shared/error-handler.js (to be added)
export function errMsg(e) {
if (!e) return 'Unknown error';
const m = e?.message ?? e?.error ?? e?.statusText ?? e;
if (typeof m === 'string') return m;
try { return JSON.stringify(m); } catch { return String(m); }
}

Currently defined inline in page_payroll_2.js. Pending consolidation per the scan plan.

Long-operation UX standards

Routes in the SLOW_ENDPOINT_PATTERNS map fall into two UX categories:

Category A: Show a progress modal

The user expects feedback. Implement:

  1. Backend: emit progress state to an in-memory store keyed by operation ID. Add a sibling GET .../progress route.
  2. Frontend: open a modal showing the current stage. Poll the progress endpoint every 2 seconds. Show preparing → running → done.

Pattern reference: payroll v2's push-timesheets modal in page_payroll_2.js.

Category B: Spinner is acceptable

For operations that are fast (5-30s) but bumped to a longer timeout as a safety net, a spinner with a friendly status line is fine.

The scan SCAN_HOT_progress-modal-coverage identifies which slow endpoints currently lack a progress modal.

The repoSCANS folder (admin/tasks/tasks_active/repoSCANS/) tracks ongoing work in this area:

ScanStatus
SCAN_HOT_api-timeout-mapDone 2026-05-16 — this document
SCAN_HOT_error-message-objectsNew — promotes errMsg to shared helper
SCAN_HOT_template-string-objectsNew — finds risky ${obj} interpolations
SCAN_HOT_progress-modal-coverageNew — adds progress modals to slow routes
SCAN_WARM_backend-timeoutsNew — same pattern, server-side
SCAN_WARM_eventsource-backoffNew — reconnect storm prevention
SCAN_WARM_localstorage-quotaNew — defensive storage wrappers

How to verify the system is working

A few quick checks:

# Find the timeout map
rg "SLOW_ENDPOINT_PATTERNS" admin/src/js/shared/api-client.js

# Count current patterns
rg "{ rx: " admin/src/js/shared/api-client.js | wc -l

# Find any inline timeout overrides (these still win, just verifying)
rg "timeout:\s*\d+" admin/src/js/pages --type js

When adding a new long-running backend route, the test is: does it complete in less than 30 seconds in the worst case? If no, add a pattern to the map.

Migration notes

  • Existing {timeout: x} overrides in page JS still work — they win over the map and are preserved untouched
  • page_hr_shift_cover.js was migrated: its inline {timeout: 300000} was removed because the map now handles it. The cover-request-specific error toast was kept (it has friendlier copy than the generic message)
  • No other files needed modification

References

  • admin/src/js/shared/api-client.js — the implementation
  • admin/tasks/tasks_active/repoSCANS/SCAN_HOT_api-timeout-map.md — the original scan record
  • Blog post: Patterns Over Patches: A Frontend Resilience Sweep — the announcement
  • AGENTS.md — overall codebase standards

Document version: 1.0 — 2026-05-17 — Initial publication.