🧩 The Central LLM Gateway

The Central LLM Gateway is the single choke-point for every large-language-model call made anywhere inside RABS.
It guarantees:

Consistent three-tier prompt assembly
Prime-Directive safety filtering and PII redaction
Unified cost / token accounting
Vendor hot-swap through a declarative Model Matrix
End-to-end observability (internalmonologue events)

Historical design notes live in docs_archive/LLM.md.

1. Why Centralise?

Need / Concern	Benefit of the Gateway
Model drift	Hot-swap models platform-wide without touching feature code.
Cost visibility	One place to meter token usage and dollar cost per module.
Prompt hygiene	The Prompt Assembly Engine enforces System → Situation → Welfare stacking, preventing rule override.
Compliance & PII	Redacts sensitive data and injects Prime Directives before any vendor sees the prompt.
Observability	Emits unified Prometheus metrics and `LLM_CALL` events to `internalmonologue`.

2. High-Level Flow

┌───────────┐  JSON-RPC  ┌──────────────────┐  HTTPS  ┌────────────┐
│  Module   │ ──────────►│  LLM Gateway API │ ───────►│ Vendor LLM │
└───────────┘            └────────┬─────────┘        └────────────┘
        ▲ (response stream)       │  (logs every call)
        └─────────────────────────▼
          internalmonologue.event_type = 'LLM_CALL'

Modules never call vendors directly; they invoke the internal Gateway API.

3. Key Components

Component	Path / File	Purpose
Model Matrix	`config/models.yaml`	Maps abstract task keys to concrete provider endpoints.
Prompt Assembly Engine	`brainframe/promptEngine.ts`	Builds the final prompt from System, Situation, Welfare layers.
Gateway API	`server/routes/llm.ts`	Validates request, resolves model, executes vendor call.
PII Redactor	`middleware/redact.ts`	Scrubs or masks user data before prompt leaves the platform.
Telemetry & Billing	`middleware/metrics.ts`	Counts tokens, estimates cost, pushes metrics to Grafana.

4. The Model Matrix (`models.yaml`)

This file decouples intent from implementation:

version: 1.0
defaults:
  temperature: 0.2

tasks:
  summary.v3:
    provider: openai:gpt-4o
    max_tokens: 300

  reasoning.v2:
    provider: anthropic:claude-3-opus
    max_tokens: 1024

  embeddings.v1:
    provider: openai:text-embedding-3
    dims: 1536

Swapping every reasoning call from Claude to Gemini involves changing a single line here.

5. Request Lifecycle

Module Call – POST /llm/invoke with:

{ "task": "reasoning.v2", "situation": { ... } }

Validation – JSON schema check, rate-limit enforcement.

Prompt Build – Prompt Engine composes:

[System: Prime Directives + Style Guide]
[Situation: caller-provided context, memory snippets]
[Welfare: grounding buffer line + risk disclaimer]

Redaction – PII scrub + sensitive-term blacklist.
Model Resolve – Lookup reasoning.v2 in Model Matrix → claude-3-opus.
Vendor Call – HTTPS request; retry with exponential back-off on 5xx.
Logging – Hash of input, chosen model, tokens in/out, USD cost => internalmonologue.
Response – Stream back to caller with latency stats.

6. Governance & Metrics

Code owners: AI Platform team. PRs modifying Gateway, Model Matrix, or Prompt Engine require Security + Tech-Lead approval.
Metrics:
- llm_calls_total{task=…}
- llm_tokens_in, llm_tokens_out, llm_cost_usd
- llm_vendor_latency_seconds
Alerts:
- Cost overrun > $50/day
- Error-rate > 2 % over 5 minutes
- PII redaction failure (blocked request)

1. Why Centralise?​

2. High-Level Flow​

3. Key Components​

4. The Model Matrix (models.yaml)​

5. Request Lifecycle​

6. Governance & Metrics​

🔗 Related Docs​