Skip to main content

๐Ÿงฉ The Central LLM Gateway

The Central LLM Gateway is the single choke-point for every large-language-model call made anywhere inside RABS.
It guarantees:

  • Consistent three-tier prompt assembly
  • Prime-Directive safety filtering and PII redaction
  • Unified cost / token accounting
  • Vendor hot-swap through a declarative Model Matrix
  • End-to-end observability (internalmonologue events)

Historical design notes live in docs_archive/LLM.md.


1. Why Centralise?โ€‹

Need / ConcernBenefit of the Gateway
Model driftHot-swap models platform-wide without touching feature code.
Cost visibilityOne place to meter token usage and dollar cost per module.
Prompt hygieneThe Prompt Assembly Engine enforces System โ†’ Situation โ†’ Welfare stacking, preventing rule override.
Compliance & PIIRedacts sensitive data and injects Prime Directives before any vendor sees the prompt.
ObservabilityEmits unified Prometheus metrics and LLM_CALL events to internalmonologue.

2. High-Level Flowโ€‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  JSON-RPC  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  HTTPS  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Module โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ LLM Gateway API โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚ Vendor LLM โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ–ฒ (response stream) โ”‚ (logs every call)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผ
internalmonologue.event_type = 'LLM_CALL'

Modules never call vendors directly; they invoke the internal Gateway API.


3. Key Componentsโ€‹

ComponentPath / FilePurpose
Model Matrixconfig/models.yamlMaps abstract task keys to concrete provider endpoints.
Prompt Assembly Enginebrainframe/promptEngine.tsBuilds the final prompt from System, Situation, Welfare layers.
Gateway APIserver/routes/llm.tsValidates request, resolves model, executes vendor call.
PII Redactormiddleware/redact.tsScrubs or masks user data before prompt leaves the platform.
Telemetry & Billingmiddleware/metrics.tsCounts tokens, estimates cost, pushes metrics to Grafana.

4. The Model Matrix (models.yaml)โ€‹

This file decouples intent from implementation:

version: 1.0
defaults:
temperature: 0.2

tasks:
summary.v3:
provider: openai:gpt-4o
max_tokens: 300

reasoning.v2:
provider: anthropic:claude-3-opus
max_tokens: 1024

embeddings.v1:
provider: openai:text-embedding-3
dims: 1536

Swapping every reasoning call from Claude to Gemini involves changing a single line here.


5. Request Lifecycleโ€‹

  1. Module Call โ€“ POST /llm/invoke with:
    { "task": "reasoning.v2", "situation": { ... } }
  2. Validation โ€“ JSON schema check, rate-limit enforcement.
  3. Prompt Build โ€“ Prompt Engine composes:
    [System: Prime Directives + Style Guide]
    [Situation: caller-provided context, memory snippets]
    [Welfare: grounding buffer line + risk disclaimer]
  4. Redaction โ€“ PII scrub + sensitive-term blacklist.
  5. Model Resolve โ€“ Lookup reasoning.v2 in Model Matrix โ†’ claude-3-opus.
  6. Vendor Call โ€“ HTTPS request; retry with exponential back-off on 5xx.
  7. Logging โ€“ Hash of input, chosen model, tokens in/out, USD cost => internalmonologue.
  8. Response โ€“ Stream back to caller with latency stats.

6. Governance & Metricsโ€‹

  • Code owners: AI Platform team. PRs modifying Gateway, Model Matrix, or Prompt Engine require Security + Tech-Lead approval.
  • Metrics:
    • llm_calls_total{task=โ€ฆ}
    • llm_tokens_in, llm_tokens_out, llm_cost_usd
    • llm_vendor_latency_seconds
  • Alerts:
    • Cost overrun > $50/day
    • Error-rate > 2 % over 5 minutes
    • PII redaction failure (blocked request)