Embedding V2 Linked Context Architecture
Created: 2026-01-24 Status: Design Complete - Awaiting Implementation Related: 27_Embedding&Semantic_Search_Strategy.md, 16_YP3000_Identity_System.md
Executive Summary
This document extends the embedding architecture (Doc 27) with Linked Groups, Identity Anchoring, and Temporal Re-processing. The goal is to transform isolated semantic units into rich, connected knowledge that surfaces complete context rather than fragmented mentions.
Core insight: A search for "roast pumpkin" should return the 9-message meal planning conversation, not a single "yes pumpkin please" message.
1. The Problem with Isolated Units
Current system extracts atomic units - great for precision, but loses vital context:
| Scenario | Current Result | Desired Result |
|---|---|---|
| Search "pumpkin" | Single message: "yes pumpkin please" | Full conversation: 9-message meal planning discussion |
| Search incident | Incident report only | Incident + related shift notes from staff who were there |
| Search "John" | Scattered mentions | All context about John, grouped by topic/time |
| Search question | The question (unanswered) | Question + answer (if found later) |
We need a layer above units that groups related content into searchable contexts.
2. New Concepts
2.1 Linked Groups
A Linked Group is a collection of units that form a coherent searchable context:
| Group Type | Description | Example |
|---|---|---|
conversation | Multi-message discussion | Discord meal planning thread |
incident_context | Incident + supporting content | Incident report + 4 staff shift notes |
qa_pair | Question and its resolution | "Has anyone seen the keys?" + "Found them in the van" |
decision_thread | Discussion leading to decision | Budget approval conversation |
topic_cluster | Auto-detected similar content | All mentions of vehicle maintenance |
daily_context | Everything at location on date | All activity at Picton SIL on Tuesday |
Key: The group gets its own embedding representing the WHOLE context. This becomes the primary search target.
2.2 Identity Anchoring (YP3000 Integration)
Every unit and group tracks WHO is involved:
| Role | Description |
|---|---|
author | Who created/said this |
subject | Who this is about |
mentioned | Names that appear |
recipient | Who it was sent to |
witness | Present but not primary |
Benefits:
- "Show me everything about Sarah" works across all sources
- Relationship discovery: "Who does John interact with most?"
- Contact confidence: Multiple sources = higher trust
- Helps YP3000 validate identities through corroboration
2.3 Temporal Re-processing
The Problem: Content processed at ingestion time lacks future context.
- Questions don't have answers yet
- Incidents don't have follow-up shift notes yet
- Conversations are incomplete
The Solution: Two-pass processing with delayed validation.
| Pass | Timing | Purpose |
|---|---|---|
| Pass 1 | Immediate | Best-effort extraction, create unit |
| Pass 2 | 24-48 hours later | Re-process with expanded context, link Q&A, validate |
2.4 Bi-directional Linking
Not all content types have symmetric relationships:
| Content A | Content B | Direction | Behavior |
|---|---|---|---|
| Incident Report | Shift Notes | A searches for B | Incident keeps looking until shift notes exist |
| Shift Notes | Incident Report | B finds A | Shift notes don't require incident; bonus if found |
| Question | Answer | A searches for B | Questions re-queue until resolved or timeout |
| Answer | Question | B finds A | Answers can exist without explicit question |
Implementation: Units get resolution_status and reprocess_priority:
- Unanswered questions:
open+ high priority - Incidents without context:
pending_context+ medium priority - Resolved content:
complete+ no priority
3. Enhanced Schema
3.1 Linked Groups Table
CREATE TABLE embeddings.linked_groups (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
-- Classification
group_type text NOT NULL, -- 'conversation', 'incident_context', 'qa_pair', etc.
processing_profile text, -- Source-specific extraction strategy
-- Human-readable summary (LLM-generated)
title text, -- "Christmas lunch planning at Picton SIL"
summary text, -- Longer description
-- The "searchable unit" embedding - represents the WHOLE group
group_embedding vector(1536),
embedding_model text DEFAULT 'text-embedding-3-small',
-- Identity Anchors (YP3000 integration)
yp_ids uuid[], -- All identities involved
primary_yp_id uuid, -- Main subject (if applicable)
-- Location/Context Anchors
location_ids uuid[], -- Linked locations
-- Temporal Anchors
event_date date, -- When this HAPPENED (not when recorded)
event_date_end date, -- For ranges
earliest_content_at timestamptz,
latest_content_at timestamptz,
-- Topics/Domains
domain_tags text[], -- hr, incidents, roster, etc.
topic_tags text[], -- meal_planning, vehicle, participant_care
-- Quality & Completeness
coherence_score float, -- How well do these units relate? (0-1)
completeness text, -- 'complete', 'partial', 'evolving'
resolution_status text, -- 'open', 'resolved', 'superseded'
-- Re-processing
reprocess_priority int DEFAULT 0, -- Higher = sooner
last_reprocessed_at timestamptz,
-- Metadata
created_at timestamptz DEFAULT NOW(),
updated_at timestamptz DEFAULT NOW()
);
3.2 Group Members Table (Unit-to-Group Links)
CREATE TABLE embeddings.group_members (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
group_id uuid REFERENCES embeddings.linked_groups(id) ON DELETE CASCADE,
unit_id uuid REFERENCES embeddings.units(id) ON DELETE CASCADE,
-- Position/Role in the group
sequence_order int, -- 1st message, 2nd message, etc.
member_role text, -- 'initiator', 'response', 'resolution', 'context'
-- Source classification
source_type text, -- 'primary' (the incident) vs 'supporting' (shift notes)
relevance_score float, -- How relevant is this unit to the group? (0-1)
created_at timestamptz DEFAULT NOW(),
UNIQUE(group_id, unit_id)
);
3.3 Identity Mentions Table (YP3000 Links)
CREATE TABLE embeddings.identity_mentions (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
-- What this mention is attached to
unit_id uuid REFERENCES embeddings.units(id) ON DELETE CASCADE,
group_id uuid REFERENCES embeddings.linked_groups(id) ON DELETE SET NULL,
-- YP3000 Identity
yp_id uuid NOT NULL, -- References core_source.yp3000_identities
-- Role in this content
mention_type text NOT NULL, -- 'author', 'subject', 'mentioned', 'recipient', 'witness'
-- Confidence
confidence float DEFAULT 0.8,
match_reasoning text, -- Why we think this is this person
-- Context
context_snippet text, -- The relevant text mentioning them
created_at timestamptz DEFAULT NOW(),
UNIQUE(unit_id, yp_id, mention_type)
);
3.4 Enhanced Units Table
Add columns to existing embeddings.units:
ALTER TABLE embeddings.units ADD COLUMN IF NOT EXISTS
-- Processing profile (determines extraction strategy)
processing_profile text, -- 'discord_chat', 'incident_report', 'shift_note', etc.
-- Direct anchors
location_ids uuid[], -- Linked locations
event_date date, -- When this HAPPENED
-- Sentiment/Tone
sentiment_score float, -- -1 (negative) to 1 (positive)
urgency_level int, -- 1-5
-- Content classification
contains_question boolean DEFAULT false,
contains_action_item boolean DEFAULT false,
-- Resolution tracking
resolution_status text, -- 'open', 'resolved', 'superseded'
resolved_by_unit_id uuid, -- Links question to answer
-- Re-processing
reprocess_priority int DEFAULT 0,
reprocessed_at timestamptz,
reprocess_reason text; -- 'temporal', 'validation', 'context_expansion'
3.5 Indexes
-- Linked Groups: Vector search
CREATE INDEX idx_linked_groups_embedding
ON embeddings.linked_groups USING ivfflat (group_embedding vector_cosine_ops)
WITH (lists = 100);
-- Linked Groups: Filtering
CREATE INDEX idx_linked_groups_location ON embeddings.linked_groups USING gin (location_ids);
CREATE INDEX idx_linked_groups_yp ON embeddings.linked_groups USING gin (yp_ids);
CREATE INDEX idx_linked_groups_event_date ON embeddings.linked_groups (event_date);
CREATE INDEX idx_linked_groups_type ON embeddings.linked_groups (group_type);
CREATE INDEX idx_linked_groups_reprocess ON embeddings.linked_groups (reprocess_priority DESC)
WHERE reprocess_priority > 0;
-- Identity Mentions: Person lookups
CREATE INDEX idx_identity_mentions_yp ON embeddings.identity_mentions (yp_id);
CREATE INDEX idx_identity_mentions_unit ON embeddings.identity_mentions (unit_id);
-- Units: Re-processing queue
CREATE INDEX idx_units_reprocess ON embeddings.units (reprocess_priority DESC)
WHERE reprocess_priority > 0;
CREATE INDEX idx_units_questions ON embeddings.units (created_at DESC)
WHERE contains_question = true AND resolution_status = 'open';
4. Processing Profiles
Different content types need tailored extraction:
| Profile | Sources | Chunking Strategy | Special Handling |
|---|---|---|---|
discord_chat | Discord messages | Thread-aware, speaker-grouped | Q&A pairing, emoji context |
incident_report | Incident forms | Keep narrative whole, section headers | Entity extraction critical |
shift_note | Daily summaries | Participant-focused sections | Link to incidents by date/location |
resource_doc | Policies, procedures | Semantic sections, hierarchy | Authority level, version tracking |
phone_call | Dialpad transcripts | Speaker turns, action items | Tone detection, urgency |
sms_message | Twilio/TextMagic | Conversation threading | Very brief, context-dependent |
email_thread | Inbound/outbound | Quote stripping, signature removal | CC/BCC = mentioned identities |
5. Group Detection Logic
5.1 Conversation Groups (Discord)
Trigger: Messages in same channel within time window Logic:
- Cluster messages by channel_id
- Group by time gaps (<5 min = same conversation)
- Detect topic shifts via embedding distance
- Create group with synthesized summary
5.2 Incident Context Groups
Trigger: Incident report created Logic:
- Find incident location + date
- Search for shift notes at same location ±1 day
- Search for staff mentioned in incident
- Search for participant-related content
- Create group with incident as
primary, notes assupporting
Bi-directional handling:
- Incident without notes: Set
reprocess_priority = 5,completeness = 'partial' - Re-process daily until notes found or 7 days elapsed
- Shift notes arriving later trigger incident re-evaluation
5.3 Q&A Pair Detection
Trigger: Unit marked contains_question = true
Logic:
- Search subsequent units in same context (thread, location, topic)
- Look for response patterns ("found it", "yes", "the answer is")
- LLM verification: "Does this answer that question?"
- Link via
resolved_by_unit_id
Priority handling:
- Unanswered questions:
reprocess_priority = 8(high) - Re-check at 6h, 24h, 48h, 7d
- After 7d with no answer: Mark
resolution_status = 'timeout'
6. Search Architecture
6.1 Query Flow
User Query: "What happened with pumpkin at Christmas?"
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 1: QUERY ANALYSIS │
│ - Extract intent: information retrieval │
│ - Entities: none explicit │
│ - Time hints: "Christmas" → December 2025 │
│ - Topics: food, meal planning │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 2: SEARCH linked_groups (PRIMARY) │
│ SELECT * FROM linked_groups │
│ WHERE group_embedding <-> query_embedding < 0.5 │
│ AND (event_date BETWEEN '2025-12-01' AND '2025-12-31' │
│ OR 'christmas' = ANY(topic_tags)) │
│ ORDER BY similarity DESC LIMIT 10 │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 3: EXPAND GROUP RESULTS │
│ For each group: │
│ - Fetch all group_members with their units │
│ - Fetch identity_mentions for participant names │
│ - Fetch original source content │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 4: FALLBACK TO UNITS (SECONDARY) │
│ If <3 group results, also search units directly │
│ Return as individual results with lower ranking │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 5: RETURN RICH RESULTS │
│ [ │
│ { │
│ type: 'conversation', │
│ title: 'Christmas lunch planning at Picton SIL', │
│ participants: ['Sarah', 'Mike', 'Jane'], │
│ location: 'Picton SIL', │
│ date: '2025-12-10', │
│ messages: [ ...9 messages in order... ], │
│ relevance: 0.89 │
│ } │
│ ] │
└─────────────────────────────────────────────────────────────┘
6.2 Cross-Source Unified Search
The header search bar returns mixed results from ALL sources:
// Grouped response format
{
query: "vehicle inspection",
results: {
conversations: [
{ title: "Van brake issue discussion", source: "discord", count: 7 }
],
incidents: [
{ title: "Minor vehicle damage at Picton", date: "2026-01-15" }
],
resources: [
{ title: "Vehicle Inspection Checklist", type: "procedure" }
],
shift_notes: [
{ title: "Noted van making noise", staff: "John Smith" }
]
},
total_hits: 12
}
7. Temporal Re-processing System
7.1 Re-processing Queue
Units and groups with reprocess_priority > 0 are queued:
| Priority | Reason | Timing |
|---|---|---|
| 10 | Critical (incident without context) | Every 6 hours |
| 8 | Unanswered question | 6h, 24h, 48h, 7d |
| 5 | Partial context (incident with some notes) | Daily |
| 3 | Validation pass (all content >24h old) | Once at 48h |
| 1 | Periodic refresh (active conversations) | Weekly |
7.2 Re-processing Logic
┌─────────────────────────────────────────────────────────────┐
│ CRON: Every hour during quiet hours (2 AM - 6 AM) │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ 1. Fetch units/groups with reprocess_priority > 0 │
│ 2. Order by priority DESC, last_reprocessed_at ASC │
│ 3. Process batch (50 items max per run) │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ For each item: │
│ - Re-run extraction with expanded context window │
│ - Check for Q&A resolution │
│ - Look for new linked content │
│ - Update confidence scores │
│ - Either: resolve (priority=0) OR reschedule │
└─────────────────────────────────────────────────────────────┘
8. YP3000 Feedback Loop
8.1 Embeddings → YP3000
When processing content, we extract identity mentions and feed them to YP3000:
// During extraction, found mention of "Sarah Jones"
await yp3000.recordSighting({
raw_name: "Sarah Jones",
phone: null, // Not mentioned
email: null,
context: "Mentioned in shift note as participant",
source_type: "embedding_extraction",
source_id: unit.id,
confidence: 75
});
YP3000 uses these sightings to:
- Discover new identities
- Reinforce existing identity confidence
- Detect conflicts (different details for same person)
8.2 YP3000 → Embeddings
When YP3000 resolves or merges identities:
- Update
identity_mentions.yp_idto canonical ID - Recalculate
linked_groups.yp_idsarrays - Flag affected groups for re-indexing
9. Implementation Phases
Phase 1: Schema & Foundation
- Create
linked_groupstable - Create
group_memberstable - Create
identity_mentionstable - Add new columns to
embeddings.units - Create indexes
Phase 2: Processing Enhancements
- Add
processing_profileto extraction pipeline - Implement incident-specific extraction
- Implement shift note extraction
- Add identity extraction (link to YP3000)
- Add location extraction
Phase 3: Group Detection
- Conversation grouping (Discord threads)
- Incident context grouping (incident + shift notes)
- Q&A pair detection
- Group embedding generation
Phase 4: Temporal Re-processing
- Build re-processing queue handler
- Implement Q&A resolution detection
- Implement incident context expansion
- Add validation pass logic
Phase 5: Cross-Source Search
- Build unified search endpoint
- Implement header search bar
- Add source-type grouping
- Add rich result formatting
Phase 6: YP3000 Integration
- Add identity extraction to pipeline
- Build sighting feed to YP3000
- Handle identity merge events
- Build "everything about person X" query
10. Success Criteria
- Context Quality: Searches return complete conversations, not fragments
- Q&A Resolution: 80%+ of questions linked to answers within 48h (when answer exists)
- Incident Context: Incidents automatically link to related shift notes
- Identity Coverage: 90%+ of content has identity anchors
- Search Relevance: Cross-source search returns expected results in top 3
- Processing Latency: New content searchable within 5 minutes, re-processed within 48h
11. Related Documents
- Embedding & Semantic Search - Base embedding architecture
- YP3000 Identity System - Identity resolution system
- Data Storage & RAG Design - Core RAG concepts
Changelog
| Date | Change | Author |
|---|---|---|
| 2026-01-24 | Initial document - design complete | Droid |