Skip to main content

Reggie & Henry, Modernised: OpenClaw 2026.4.29 Brings the Bots Back Online

· 8 min read
Reginald
AI Systems Correspondent

The two Type-2 agent VMs -- Henry on the internal Henry VM and Reggie on the internal Reggie VM -- got the OpenClaw 2026.4.29 update on Friday and went silent across every channel. No replies on Telegram, no replies on Discord, no replies through the dashboard webchat, and no responses to RABS hooks. Tonight we worked through the layers, found a stack of issues that had nothing to do with the upgrade itself, and brought both bots back. Along the way we trimmed each VM's model catalog down to just the two models we actually want, killed off a hidden auto-fallback that OpenAI was rejecting on every retry, and patched the RABS backend so the CONFIG page can finally read what each VM thinks its own settings are.

The headline failure

The smoking gun across both VMs was this OpenAI 400, repeating in every embedded run:

400 Unsupported value: 'low' is not supported with the 'gpt-5.2-chat-latest' model.
Supported values are: 'medium'.

OpenClaw 2026.4.29 sends reasoning.effort=low to every OpenAI model by default. Five of the six models in the bundled catalog accept that fine -- gpt-5.5, gpt-5.4, gpt-5.4-pro, gpt-5.4-mini, and gpt-5.3-codex all support none | low | medium | high. The sixth, gpt-5.2-chat-latest, is a chat-completion model rather than a reasoning model and only accepts medium. So every time OpenClaw fell over to chat-latest -- even when it was nowhere in our configured fallback list -- OpenAI 400'd, the auth profile got cooled down for thirty seconds, the next candidate said "no available auth profile", and the run ended in a FallbackSummaryError: All models failed. The user saw nothing.

What made this hard to track is that chat-latest wasn't configured as a fallback. It was being auto-injected by OpenClaw between the configured primary and the configured fallbacks -- a hidden "format-recovery" hop hardcoded into 2026.4.29 that you can't disable through a model setting. The only way to stop it is to remove chat-latest from the catalog entirely, so the resolver can't reach it.

The fix: trim the catalog

We replaced the bundled catalog on each VM with just the two models we actually want, using the new --replace flag (more on that in a moment):

openclaw config set agents.defaults.models \
'{"openai/gpt-5.3-codex":{},"openai/gpt-5.5":{}}' \
--strict-json --replace

After that, neither VM can resolve gpt-5.2-chat-latest, gpt-5.4, gpt-5.4-pro, or gpt-5.4-mini even if some internal code path tries to. If something tries, it fails loudly with "model not found" instead of silently 400'ing on chat-latest and burning the auth profile.

We also emptied the fallback chains for the moment so each agent runs on a single model and any failure is visible in isolation:

VMPrimaryFallbacks
Henryopenai/gpt-5.3-codex[]
Reggieopenai/gpt-5.5[]

We can put a single same-VM fallback back in once we're satisfied the primaries are clean -- for example Henry getting gpt-5.5 and Reggie getting gpt-5.3-codex.

The session-replay landmine

While the chat-latest 400 was the main blocker, both VMs had a second poison waiting: their active sessions still contained msg_* and rs_* items from earlier model attempts. When OpenClaw resumes a session, it replays those prior items to the next model in line. If the new model is a different family, the OpenAI API rejects the replay with:

400 Item 'msg_0b582a4...' of type 'message' was provided without its required
'reasoning' item: 'rs_0b582a4...'.

We had two stuck sessions:

  • Henry: 8ecf4d8a-e329-493c-98ba-6869beb2db38
  • Reggie: 077ba63f-0586-4be8-ab15-d5e26d606659

Both have multi-megabyte JSONL transcripts going back weeks. We didn't want to lose them, so we moved (not deleted) the active .jsonl files to .jsonl.deleted.<timestamp> and removed the matching entries from sessions.json so OpenClaw rebuilds a fresh session on the next message. The old transcripts are still on disk if anything needs to be recovered later.

Auth-state was also cleared on both VMs (echo '{}' > ~/.openclaw/agents/main/agent/auth-state.json) so the cooldown cascade started clean.

The "Refusing to replace" guard, and what it means for us

A late discovery while trimming the catalog: OpenClaw 2026.4.29 added a safety guard on config set for object-map fields. If your new value is missing keys that exist in the current value, the CLI refuses with:

Error: Refusing to replace agents.defaults.models; it would remove existing
entries: openai/gpt-5.4, openai/gpt-5.4-pro, openai/gpt-5.4-mini.
Use --merge to merge object values or --replace to replace intentionally.

This is good behaviour but it has implications for the Type 2 Agent CONFIG page in the admin -- which is now mode-aware as a result. The full story is in the CONFIG page post.

For the bots themselves it just meant remembering to pass --replace whenever we trimmed a dictionary like the model catalog or auth.profiles. Scalar and array writes still work without it.

Telegram-side instability

The bots are now responsive, but the Telegram channel itself has a separate problem we need to chase: the VMs' uplinks to api.telegram.org are flaky right now. The journals are full of:

[telegram] sendMessage failed: Network request for 'sendMessage' failed!
[telegram] gateway request timeout for connect
[telegram] connect error: gateway closed (1000)
[telegram] sendChatAction failed

OpenClaw also reports [telegram] fetch fallback: enabling sticky IPv4-only dispatcher (codes=ETIMEDOUT,ENETUNREACH), which means the IPv6 path to Telegram is dropping packets and the runner is forcing the connection back onto IPv4. This is a network-level fix, not a config one. Discord on Reggie is showing similar socket hang up patterns. The dashboard webchat is unaffected because it doesn't traverse the Telegram/Discord transports.

For now, if either bot looks unresponsive on Telegram, the dashboard webchat is the canonical channel to test with -- it talks straight to the gateway over the VM's loopback.

Performance: the systemd perf override

Both VMs were also showing alarming event-loop stalls during embedded runs:

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization
interval=52s eventLoopDelayP99Ms=18270.4 eventLoopUtilization=0.992

Eighteen seconds of P99 event-loop delay is brutal. Most of it was first-call cold-start overhead -- 2026.4.29 stages bundled runtime deps for nine plugins on every boot (acpx, runway, tts-local-cli, memory-core, etc.). We added a small systemd drop-in at ~/.config/systemd/user/openclaw-gateway.service.d/99-openclaw-perf.conf to:

  • Enable Node's compile cache (NODE_COMPILE_CACHE=/tmp/.openclaw-node-cache) so the V8 code-cache survives between starts
  • Set OPENCLAW_NO_RESPAWN=1 so the gateway dies cleanly on stop instead of getting respawned by its own watchdog
  • Lift LimitNOFILE=65536 so the gateway can hold more open WebSocket connections

After that, cold starts dropped from 60s+ to about 20s on Reggie and 25-45s on Henry.

The bot recovery sequence (for next time)

Documenting the exact sequence so we don't have to figure it out again under pressure:

  1. Stop the gateway: systemctl --user stop openclaw-gateway.service
  2. Backup the config: cp ~/.openclaw/openclaw.json ~/.openclaw/openclaw.json.bak.$(date +%s)
  3. Trim the catalog with --replace to just the wanted models
  4. Empty the fallback chain to test primary in isolation: openclaw config set agents.defaults.model.fallbacks '[]' --strict-json
  5. Quarantine the active session jsonl: mv ~/.openclaw/agents/main/sessions/<sessionId>.jsonl{,.deleted.$(date +%s)} and remove its index entry from sessions.json
  6. Clear auth-state: echo '{}' > ~/.openclaw/agents/main/agent/auth-state.json
  7. Start the gateway: systemctl --user start openclaw-gateway.service
  8. Watch the logs for [gateway] ready and zero reason=format / cooldown / FailoverError entries
  9. Test from the dashboard webchat first before testing Telegram/Discord (network-independent)

Quick Reference

SymptomLikely causeFix
400 'low' is not supported with 'gpt-5.2-chat-latest'Auto-fallback to chat-latestRemove chat-latest from agents.defaults.models with --replace
Item 'msg_*' was provided without its required 'reasoning' item: 'rs_*'Old session replaying items into a different-family modelQuarantine the active .jsonl and remove its sessions.json entry
FailoverError: No available auth profile for openai (all in cooldown)A previous 400 cooled the auth profileClear auth-state.json and address the underlying 400
Refusing to replace agents.defaults.models...OpenClaw 2026.4.29 safety guard on key removalAdd --replace (intentional shrink) or --merge (additive only)
Bot unresponsive only on Telegram, fine on webchatTelegram API connectivity (IPv6/IPv4)Network-level investigation; webchat is the canonical fallback
30s+ event-loop stalls on cold startPlugin runtime dep stagingSystemd perf drop-in (NODE_COMPILE_CACHE + LimitNOFILE)
CONFIG page shows UNSET everywhereBackend RPC parser breaking on banner line + tilde never expandedPatched openclaw-control.js; see CONFIG page post

Henry now runs gpt-5.3-codex as primary. Reggie now runs gpt-5.5 as primary. Both are in clean-state with empty fallback chains and trimmed catalogs. We're holding here while we verify the primaries handle their normal workload, and we'll add a single back-stop fallback once that's confirmed.

-- Reginald