Skip to main content

Embedding V2 Linked Context Architecture

Created: 2026-01-24 Status: Design Complete - Awaiting Implementation Related: 27_Embedding&Semantic_Search_Strategy.md, 16_YP3000_Identity_System.md

Executive Summary

This document extends the embedding architecture (Doc 27) with Linked Groups, Identity Anchoring, and Temporal Re-processing. The goal is to transform isolated semantic units into rich, connected knowledge that surfaces complete context rather than fragmented mentions.

Core insight: A search for "roast pumpkin" should return the 9-message meal planning conversation, not a single "yes pumpkin please" message.


1. The Problem with Isolated Units

Current system extracts atomic units - great for precision, but loses vital context:

ScenarioCurrent ResultDesired Result
Search "pumpkin"Single message: "yes pumpkin please"Full conversation: 9-message meal planning discussion
Search incidentIncident report onlyIncident + related shift notes from staff who were there
Search "John"Scattered mentionsAll context about John, grouped by topic/time
Search questionThe question (unanswered)Question + answer (if found later)

We need a layer above units that groups related content into searchable contexts.


2. New Concepts

2.1 Linked Groups

A Linked Group is a collection of units that form a coherent searchable context:

Group TypeDescriptionExample
conversationMulti-message discussionDiscord meal planning thread
incident_contextIncident + supporting contentIncident report + 4 staff shift notes
qa_pairQuestion and its resolution"Has anyone seen the keys?" + "Found them in the van"
decision_threadDiscussion leading to decisionBudget approval conversation
topic_clusterAuto-detected similar contentAll mentions of vehicle maintenance
daily_contextEverything at location on dateAll activity at Picton SIL on Tuesday

Key: The group gets its own embedding representing the WHOLE context. This becomes the primary search target.

2.2 Identity Anchoring (YP3000 Integration)

Every unit and group tracks WHO is involved:

RoleDescription
authorWho created/said this
subjectWho this is about
mentionedNames that appear
recipientWho it was sent to
witnessPresent but not primary

Benefits:

  • "Show me everything about Sarah" works across all sources
  • Relationship discovery: "Who does John interact with most?"
  • Contact confidence: Multiple sources = higher trust
  • Helps YP3000 validate identities through corroboration

2.3 Temporal Re-processing

The Problem: Content processed at ingestion time lacks future context.

  • Questions don't have answers yet
  • Incidents don't have follow-up shift notes yet
  • Conversations are incomplete

The Solution: Two-pass processing with delayed validation.

PassTimingPurpose
Pass 1ImmediateBest-effort extraction, create unit
Pass 224-48 hours laterRe-process with expanded context, link Q&A, validate

2.4 Bi-directional Linking

Not all content types have symmetric relationships:

Content AContent BDirectionBehavior
Incident ReportShift NotesA searches for BIncident keeps looking until shift notes exist
Shift NotesIncident ReportB finds AShift notes don't require incident; bonus if found
QuestionAnswerA searches for BQuestions re-queue until resolved or timeout
AnswerQuestionB finds AAnswers can exist without explicit question

Implementation: Units get resolution_status and reprocess_priority:

  • Unanswered questions: open + high priority
  • Incidents without context: pending_context + medium priority
  • Resolved content: complete + no priority

3. Enhanced Schema

3.1 Linked Groups Table

CREATE TABLE embeddings.linked_groups (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),

-- Classification
group_type text NOT NULL, -- 'conversation', 'incident_context', 'qa_pair', etc.
processing_profile text, -- Source-specific extraction strategy

-- Human-readable summary (LLM-generated)
title text, -- "Christmas lunch planning at Picton SIL"
summary text, -- Longer description

-- The "searchable unit" embedding - represents the WHOLE group
group_embedding vector(1536),
embedding_model text DEFAULT 'text-embedding-3-small',

-- Identity Anchors (YP3000 integration)
yp_ids uuid[], -- All identities involved
primary_yp_id uuid, -- Main subject (if applicable)

-- Location/Context Anchors
location_ids uuid[], -- Linked locations

-- Temporal Anchors
event_date date, -- When this HAPPENED (not when recorded)
event_date_end date, -- For ranges
earliest_content_at timestamptz,
latest_content_at timestamptz,

-- Topics/Domains
domain_tags text[], -- hr, incidents, roster, etc.
topic_tags text[], -- meal_planning, vehicle, participant_care

-- Quality & Completeness
coherence_score float, -- How well do these units relate? (0-1)
completeness text, -- 'complete', 'partial', 'evolving'
resolution_status text, -- 'open', 'resolved', 'superseded'

-- Re-processing
reprocess_priority int DEFAULT 0, -- Higher = sooner
last_reprocessed_at timestamptz,

-- Metadata
created_at timestamptz DEFAULT NOW(),
updated_at timestamptz DEFAULT NOW()
);
CREATE TABLE embeddings.group_members (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
group_id uuid REFERENCES embeddings.linked_groups(id) ON DELETE CASCADE,
unit_id uuid REFERENCES embeddings.units(id) ON DELETE CASCADE,

-- Position/Role in the group
sequence_order int, -- 1st message, 2nd message, etc.
member_role text, -- 'initiator', 'response', 'resolution', 'context'

-- Source classification
source_type text, -- 'primary' (the incident) vs 'supporting' (shift notes)
relevance_score float, -- How relevant is this unit to the group? (0-1)

created_at timestamptz DEFAULT NOW(),

UNIQUE(group_id, unit_id)
);
CREATE TABLE embeddings.identity_mentions (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),

-- What this mention is attached to
unit_id uuid REFERENCES embeddings.units(id) ON DELETE CASCADE,
group_id uuid REFERENCES embeddings.linked_groups(id) ON DELETE SET NULL,

-- YP3000 Identity
yp_id uuid NOT NULL, -- References core_source.yp3000_identities

-- Role in this content
mention_type text NOT NULL, -- 'author', 'subject', 'mentioned', 'recipient', 'witness'

-- Confidence
confidence float DEFAULT 0.8,
match_reasoning text, -- Why we think this is this person

-- Context
context_snippet text, -- The relevant text mentioning them

created_at timestamptz DEFAULT NOW(),

UNIQUE(unit_id, yp_id, mention_type)
);

3.4 Enhanced Units Table

Add columns to existing embeddings.units:

ALTER TABLE embeddings.units ADD COLUMN IF NOT EXISTS
-- Processing profile (determines extraction strategy)
processing_profile text, -- 'discord_chat', 'incident_report', 'shift_note', etc.

-- Direct anchors
location_ids uuid[], -- Linked locations
event_date date, -- When this HAPPENED

-- Sentiment/Tone
sentiment_score float, -- -1 (negative) to 1 (positive)
urgency_level int, -- 1-5

-- Content classification
contains_question boolean DEFAULT false,
contains_action_item boolean DEFAULT false,

-- Resolution tracking
resolution_status text, -- 'open', 'resolved', 'superseded'
resolved_by_unit_id uuid, -- Links question to answer

-- Re-processing
reprocess_priority int DEFAULT 0,
reprocessed_at timestamptz,
reprocess_reason text; -- 'temporal', 'validation', 'context_expansion'

3.5 Indexes

-- Linked Groups: Vector search
CREATE INDEX idx_linked_groups_embedding
ON embeddings.linked_groups USING ivfflat (group_embedding vector_cosine_ops)
WITH (lists = 100);

-- Linked Groups: Filtering
CREATE INDEX idx_linked_groups_location ON embeddings.linked_groups USING gin (location_ids);
CREATE INDEX idx_linked_groups_yp ON embeddings.linked_groups USING gin (yp_ids);
CREATE INDEX idx_linked_groups_event_date ON embeddings.linked_groups (event_date);
CREATE INDEX idx_linked_groups_type ON embeddings.linked_groups (group_type);
CREATE INDEX idx_linked_groups_reprocess ON embeddings.linked_groups (reprocess_priority DESC)
WHERE reprocess_priority > 0;

-- Identity Mentions: Person lookups
CREATE INDEX idx_identity_mentions_yp ON embeddings.identity_mentions (yp_id);
CREATE INDEX idx_identity_mentions_unit ON embeddings.identity_mentions (unit_id);

-- Units: Re-processing queue
CREATE INDEX idx_units_reprocess ON embeddings.units (reprocess_priority DESC)
WHERE reprocess_priority > 0;
CREATE INDEX idx_units_questions ON embeddings.units (created_at DESC)
WHERE contains_question = true AND resolution_status = 'open';

4. Processing Profiles

Different content types need tailored extraction:

ProfileSourcesChunking StrategySpecial Handling
discord_chatDiscord messagesThread-aware, speaker-groupedQ&A pairing, emoji context
incident_reportIncident formsKeep narrative whole, section headersEntity extraction critical
shift_noteDaily summariesParticipant-focused sectionsLink to incidents by date/location
resource_docPolicies, proceduresSemantic sections, hierarchyAuthority level, version tracking
phone_callDialpad transcriptsSpeaker turns, action itemsTone detection, urgency
sms_messageTwilio/TextMagicConversation threadingVery brief, context-dependent
email_threadInbound/outboundQuote stripping, signature removalCC/BCC = mentioned identities

5. Group Detection Logic

5.1 Conversation Groups (Discord)

Trigger: Messages in same channel within time window Logic:

  1. Cluster messages by channel_id
  2. Group by time gaps (<5 min = same conversation)
  3. Detect topic shifts via embedding distance
  4. Create group with synthesized summary

5.2 Incident Context Groups

Trigger: Incident report created Logic:

  1. Find incident location + date
  2. Search for shift notes at same location ±1 day
  3. Search for staff mentioned in incident
  4. Search for participant-related content
  5. Create group with incident as primary, notes as supporting

Bi-directional handling:

  • Incident without notes: Set reprocess_priority = 5, completeness = 'partial'
  • Re-process daily until notes found or 7 days elapsed
  • Shift notes arriving later trigger incident re-evaluation

5.3 Q&A Pair Detection

Trigger: Unit marked contains_question = true Logic:

  1. Search subsequent units in same context (thread, location, topic)
  2. Look for response patterns ("found it", "yes", "the answer is")
  3. LLM verification: "Does this answer that question?"
  4. Link via resolved_by_unit_id

Priority handling:

  • Unanswered questions: reprocess_priority = 8 (high)
  • Re-check at 6h, 24h, 48h, 7d
  • After 7d with no answer: Mark resolution_status = 'timeout'

6. Search Architecture

6.1 Query Flow

User Query: "What happened with pumpkin at Christmas?"
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 1: QUERY ANALYSIS │
│ - Extract intent: information retrieval │
│ - Entities: none explicit │
│ - Time hints: "Christmas" → December 2025 │
│ - Topics: food, meal planning │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 2: SEARCH linked_groups (PRIMARY) │
│ SELECT * FROM linked_groups │
│ WHERE group_embedding <-> query_embedding < 0.5 │
│ AND (event_date BETWEEN '2025-12-01' AND '2025-12-31' │
│ OR 'christmas' = ANY(topic_tags)) │
│ ORDER BY similarity DESC LIMIT 10 │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 3: EXPAND GROUP RESULTS │
│ For each group: │
│ - Fetch all group_members with their units │
│ - Fetch identity_mentions for participant names │
│ - Fetch original source content │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 4: FALLBACK TO UNITS (SECONDARY) │
│ If <3 group results, also search units directly │
│ Return as individual results with lower ranking │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ Step 5: RETURN RICH RESULTS │
│ [ │
│ { │
│ type: 'conversation', │
│ title: 'Christmas lunch planning at Picton SIL', │
│ participants: ['Sarah', 'Mike', 'Jane'], │
│ location: 'Picton SIL', │
│ date: '2025-12-10', │
│ messages: [ ...9 messages in order... ], │
│ relevance: 0.89 │
│ } │
│ ] │
└─────────────────────────────────────────────────────────────┘

The header search bar returns mixed results from ALL sources:

// Grouped response format
{
query: "vehicle inspection",
results: {
conversations: [
{ title: "Van brake issue discussion", source: "discord", count: 7 }
],
incidents: [
{ title: "Minor vehicle damage at Picton", date: "2026-01-15" }
],
resources: [
{ title: "Vehicle Inspection Checklist", type: "procedure" }
],
shift_notes: [
{ title: "Noted van making noise", staff: "John Smith" }
]
},
total_hits: 12
}

7. Temporal Re-processing System

7.1 Re-processing Queue

Units and groups with reprocess_priority > 0 are queued:

PriorityReasonTiming
10Critical (incident without context)Every 6 hours
8Unanswered question6h, 24h, 48h, 7d
5Partial context (incident with some notes)Daily
3Validation pass (all content >24h old)Once at 48h
1Periodic refresh (active conversations)Weekly

7.2 Re-processing Logic

┌─────────────────────────────────────────────────────────────┐
│ CRON: Every hour during quiet hours (2 AM - 6 AM) │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ 1. Fetch units/groups with reprocess_priority > 0 │
│ 2. Order by priority DESC, last_reprocessed_at ASC │
│ 3. Process batch (50 items max per run) │
└─────────────────────────────────────────────────────────────┘
|
v
┌─────────────────────────────────────────────────────────────┐
│ For each item: │
│ - Re-run extraction with expanded context window │
│ - Check for Q&A resolution │
│ - Look for new linked content │
│ - Update confidence scores │
│ - Either: resolve (priority=0) OR reschedule │
└─────────────────────────────────────────────────────────────┘

8. YP3000 Feedback Loop

8.1 Embeddings → YP3000

When processing content, we extract identity mentions and feed them to YP3000:

// During extraction, found mention of "Sarah Jones"
await yp3000.recordSighting({
raw_name: "Sarah Jones",
phone: null, // Not mentioned
email: null,
context: "Mentioned in shift note as participant",
source_type: "embedding_extraction",
source_id: unit.id,
confidence: 75
});

YP3000 uses these sightings to:

  • Discover new identities
  • Reinforce existing identity confidence
  • Detect conflicts (different details for same person)

8.2 YP3000 → Embeddings

When YP3000 resolves or merges identities:

  • Update identity_mentions.yp_id to canonical ID
  • Recalculate linked_groups.yp_ids arrays
  • Flag affected groups for re-indexing

9. Implementation Phases

Phase 1: Schema & Foundation

  • Create linked_groups table
  • Create group_members table
  • Create identity_mentions table
  • Add new columns to embeddings.units
  • Create indexes

Phase 2: Processing Enhancements

  • Add processing_profile to extraction pipeline
  • Implement incident-specific extraction
  • Implement shift note extraction
  • Add identity extraction (link to YP3000)
  • Add location extraction

Phase 3: Group Detection

  • Conversation grouping (Discord threads)
  • Incident context grouping (incident + shift notes)
  • Q&A pair detection
  • Group embedding generation

Phase 4: Temporal Re-processing

  • Build re-processing queue handler
  • Implement Q&A resolution detection
  • Implement incident context expansion
  • Add validation pass logic
  • Build unified search endpoint
  • Implement header search bar
  • Add source-type grouping
  • Add rich result formatting

Phase 6: YP3000 Integration

  • Add identity extraction to pipeline
  • Build sighting feed to YP3000
  • Handle identity merge events
  • Build "everything about person X" query

10. Success Criteria

  1. Context Quality: Searches return complete conversations, not fragments
  2. Q&A Resolution: 80%+ of questions linked to answers within 48h (when answer exists)
  3. Incident Context: Incidents automatically link to related shift notes
  4. Identity Coverage: 90%+ of content has identity anchors
  5. Search Relevance: Cross-source search returns expected results in top 3
  6. Processing Latency: New content searchable within 5 minutes, re-processed within 48h


Changelog

DateChangeAuthor
2026-01-24Initial document - design completeDroid