๐งฉ The Central LLM Gateway
The Central LLM Gateway is the single choke-point for every large-language-model call made anywhere inside RABS.
It guarantees:
- Consistent three-tier prompt assembly
- Prime-Directive safety filtering and PII redaction
- Unified cost / token accounting
- Vendor hot-swap through a declarative Model Matrix
- End-to-end observability (
internalmonologueevents)
Historical design notes live in docs_archive/LLM.md.
1. Why Centralise?โ
| Need / Concern | Benefit of the Gateway |
|---|---|
| Model drift | Hot-swap models platform-wide without touching feature code. |
| Cost visibility | One place to meter token usage and dollar cost per module. |
| Prompt hygiene | The Prompt Assembly Engine enforces System โ Situation โ Welfare stacking, preventing rule override. |
| Compliance & PII | Redacts sensitive data and injects Prime Directives before any vendor sees the prompt. |
| Observability | Emits unified Prometheus metrics and LLM_CALL events to internalmonologue. |
2. High-Level Flowโ
โโโโโโโโโโโโโ JSON-RPC โโโโโโโโโโโโโโโโโโโโ HTTPS โโโโโโโโโโโโโโ
โ Module โ โโโโโโโโโโโบโ LLM Gateway API โ โโโโโโโโบโ Vendor LLM โ
โโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโโโโโ
โฒ (response stream) โ (logs every call)
โโโโโโโโโโโโโโโโโโโโโโโโโโโผ
internalmonologue.event_type = 'LLM_CALL'
Modules never call vendors directly; they invoke the internal Gateway API.
3. Key Componentsโ
| Component | Path / File | Purpose |
|---|---|---|
| Model Matrix | config/models.yaml | Maps abstract task keys to concrete provider endpoints. |
| Prompt Assembly Engine | brainframe/promptEngine.ts | Builds the final prompt from System, Situation, Welfare layers. |
| Gateway API | server/routes/llm.ts | Validates request, resolves model, executes vendor call. |
| PII Redactor | middleware/redact.ts | Scrubs or masks user data before prompt leaves the platform. |
| Telemetry & Billing | middleware/metrics.ts | Counts tokens, estimates cost, pushes metrics to Grafana. |
4. The Model Matrix (models.yaml)โ
This file decouples intent from implementation:
version: 1.0
defaults:
temperature: 0.2
tasks:
summary.v3:
provider: openai:gpt-4o
max_tokens: 300
reasoning.v2:
provider: anthropic:claude-3-opus
max_tokens: 1024
embeddings.v1:
provider: openai:text-embedding-3
dims: 1536
Swapping every reasoning call from Claude to Gemini involves changing a single line here.
5. Request Lifecycleโ
- Module Call โ
POST /llm/invokewith:{ "task": "reasoning.v2", "situation": { ... } } - Validation โ JSON schema check, rate-limit enforcement.
- Prompt Build โ Prompt Engine composes:
[System: Prime Directives + Style Guide]
[Situation: caller-provided context, memory snippets]
[Welfare: grounding buffer line + risk disclaimer] - Redaction โ PII scrub + sensitive-term blacklist.
- Model Resolve โ Lookup
reasoning.v2in Model Matrix โclaude-3-opus. - Vendor Call โ HTTPS request; retry with exponential back-off on 5xx.
- Logging โ Hash of input, chosen model, tokens in/out, USD cost =>
internalmonologue. - Response โ Stream back to caller with latency stats.
6. Governance & Metricsโ
- Code owners: AI Platform team. PRs modifying Gateway, Model Matrix, or Prompt Engine require Security + Tech-Lead approval.
- Metrics:
llm_calls_total{task=โฆ}llm_tokens_in,llm_tokens_out,llm_cost_usdllm_vendor_latency_seconds
- Alerts:
- Cost overrun > $50/day
- Error-rate > 2 % over 5 minutes
- PII redaction failure (blocked request)