Embedding V2 Linked Context Architecture

Created: 2026-01-24 Status: Design Complete - Awaiting Implementation Related: 27_Embedding&Semantic_Search_Strategy.md, 16_YP3000_Identity_System.md

Executive Summary

This document extends the embedding architecture (Doc 27) with Linked Groups, Identity Anchoring, and Temporal Re-processing. The goal is to transform isolated semantic units into rich, connected knowledge that surfaces complete context rather than fragmented mentions.

Core insight: A search for "roast pumpkin" should return the 9-message meal planning conversation, not a single "yes pumpkin please" message.

1. The Problem with Isolated Units

Current system extracts atomic units - great for precision, but loses vital context:

Scenario	Current Result	Desired Result
Search "pumpkin"	Single message: "yes pumpkin please"	Full conversation: 9-message meal planning discussion
Search incident	Incident report only	Incident + related shift notes from staff who were there
Search "John"	Scattered mentions	All context about John, grouped by topic/time
Search question	The question (unanswered)	Question + answer (if found later)

We need a layer above units that groups related content into searchable contexts.

2. New Concepts

2.1 Linked Groups

A Linked Group is a collection of units that form a coherent searchable context:

Group Type	Description	Example
`conversation`	Multi-message discussion	Discord meal planning thread
`incident_context`	Incident + supporting content	Incident report + 4 staff shift notes
`qa_pair`	Question and its resolution	"Has anyone seen the keys?" + "Found them in the van"
`decision_thread`	Discussion leading to decision	Budget approval conversation
`topic_cluster`	Auto-detected similar content	All mentions of vehicle maintenance
`daily_context`	Everything at location on date	All activity at Picton SIL on Tuesday

Key: The group gets its own embedding representing the WHOLE context. This becomes the primary search target.

2.2 Identity Anchoring (YP3000 Integration)

Every unit and group tracks WHO is involved:

Role	Description
`author`	Who created/said this
`subject`	Who this is about
`mentioned`	Names that appear
`recipient`	Who it was sent to
`witness`	Present but not primary

Benefits:

"Show me everything about Sarah" works across all sources
Relationship discovery: "Who does John interact with most?"
Contact confidence: Multiple sources = higher trust
Helps YP3000 validate identities through corroboration

2.3 Temporal Re-processing

The Problem: Content processed at ingestion time lacks future context.

Questions don't have answers yet
Incidents don't have follow-up shift notes yet
Conversations are incomplete

The Solution: Two-pass processing with delayed validation.

Pass	Timing	Purpose
Pass 1	Immediate	Best-effort extraction, create unit
Pass 2	24-48 hours later	Re-process with expanded context, link Q&A, validate

2.4 Bi-directional Linking

Not all content types have symmetric relationships:

Content A	Content B	Direction	Behavior
Incident Report	Shift Notes	A searches for B	Incident keeps looking until shift notes exist
Shift Notes	Incident Report	B finds A	Shift notes don't require incident; bonus if found
Question	Answer	A searches for B	Questions re-queue until resolved or timeout
Answer	Question	B finds A	Answers can exist without explicit question

Implementation: Units get resolution_status and reprocess_priority:

Unanswered questions: open + high priority
Incidents without context: pending_context + medium priority
Resolved content: complete + no priority

3. Enhanced Schema

3.1 Linked Groups Table

CREATE TABLE embeddings.linked_groups (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  
  -- Classification
  group_type text NOT NULL,  -- 'conversation', 'incident_context', 'qa_pair', etc.
  processing_profile text,   -- Source-specific extraction strategy
  
  -- Human-readable summary (LLM-generated)
  title text,                -- "Christmas lunch planning at Picton SIL"
  summary text,              -- Longer description
  
  -- The "searchable unit" embedding - represents the WHOLE group
  group_embedding vector(1536),
  embedding_model text DEFAULT 'text-embedding-3-small',
  
  -- Identity Anchors (YP3000 integration)
  yp_ids uuid[],             -- All identities involved
  primary_yp_id uuid,        -- Main subject (if applicable)
  
  -- Location/Context Anchors
  location_ids uuid[],       -- Linked locations
  
  -- Temporal Anchors
  event_date date,           -- When this HAPPENED (not when recorded)
  event_date_end date,       -- For ranges
  earliest_content_at timestamptz,
  latest_content_at timestamptz,
  
  -- Topics/Domains
  domain_tags text[],        -- hr, incidents, roster, etc.
  topic_tags text[],         -- meal_planning, vehicle, participant_care
  
  -- Quality & Completeness
  coherence_score float,     -- How well do these units relate? (0-1)
  completeness text,         -- 'complete', 'partial', 'evolving'
  resolution_status text,    -- 'open', 'resolved', 'superseded'
  
  -- Re-processing
  reprocess_priority int DEFAULT 0,  -- Higher = sooner
  last_reprocessed_at timestamptz,
  
  -- Metadata
  created_at timestamptz DEFAULT NOW(),
  updated_at timestamptz DEFAULT NOW()
);

3.2 Group Members Table (Unit-to-Group Links)

CREATE TABLE embeddings.group_members (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  group_id uuid REFERENCES embeddings.linked_groups(id) ON DELETE CASCADE,
  unit_id uuid REFERENCES embeddings.units(id) ON DELETE CASCADE,
  
  -- Position/Role in the group
  sequence_order int,        -- 1st message, 2nd message, etc.
  member_role text,          -- 'initiator', 'response', 'resolution', 'context'
  
  -- Source classification
  source_type text,          -- 'primary' (the incident) vs 'supporting' (shift notes)
  relevance_score float,     -- How relevant is this unit to the group? (0-1)
  
  created_at timestamptz DEFAULT NOW(),
  
  UNIQUE(group_id, unit_id)
);

3.3 Identity Mentions Table (YP3000 Links)

CREATE TABLE embeddings.identity_mentions (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  
  -- What this mention is attached to
  unit_id uuid REFERENCES embeddings.units(id) ON DELETE CASCADE,
  group_id uuid REFERENCES embeddings.linked_groups(id) ON DELETE SET NULL,
  
  -- YP3000 Identity
  yp_id uuid NOT NULL,       -- References core_source.yp3000_identities
  
  -- Role in this content
  mention_type text NOT NULL, -- 'author', 'subject', 'mentioned', 'recipient', 'witness'
  
  -- Confidence
  confidence float DEFAULT 0.8,
  match_reasoning text,      -- Why we think this is this person
  
  -- Context
  context_snippet text,      -- The relevant text mentioning them
  
  created_at timestamptz DEFAULT NOW(),
  
  UNIQUE(unit_id, yp_id, mention_type)
);

3.4 Enhanced Units Table

Add columns to existing embeddings.units:

ALTER TABLE embeddings.units ADD COLUMN IF NOT EXISTS
  -- Processing profile (determines extraction strategy)
  processing_profile text,   -- 'discord_chat', 'incident_report', 'shift_note', etc.
  
  -- Direct anchors
  location_ids uuid[],       -- Linked locations
  event_date date,           -- When this HAPPENED
  
  -- Sentiment/Tone
  sentiment_score float,     -- -1 (negative) to 1 (positive)
  urgency_level int,         -- 1-5
  
  -- Content classification
  contains_question boolean DEFAULT false,
  contains_action_item boolean DEFAULT false,
  
  -- Resolution tracking
  resolution_status text,    -- 'open', 'resolved', 'superseded'
  resolved_by_unit_id uuid,  -- Links question to answer
  
  -- Re-processing
  reprocess_priority int DEFAULT 0,
  reprocessed_at timestamptz,
  reprocess_reason text;     -- 'temporal', 'validation', 'context_expansion'

3.5 Indexes

-- Linked Groups: Vector search
CREATE INDEX idx_linked_groups_embedding 
  ON embeddings.linked_groups USING ivfflat (group_embedding vector_cosine_ops)
  WITH (lists = 100);

-- Linked Groups: Filtering
CREATE INDEX idx_linked_groups_location ON embeddings.linked_groups USING gin (location_ids);
CREATE INDEX idx_linked_groups_yp ON embeddings.linked_groups USING gin (yp_ids);
CREATE INDEX idx_linked_groups_event_date ON embeddings.linked_groups (event_date);
CREATE INDEX idx_linked_groups_type ON embeddings.linked_groups (group_type);
CREATE INDEX idx_linked_groups_reprocess ON embeddings.linked_groups (reprocess_priority DESC) 
  WHERE reprocess_priority > 0;

-- Identity Mentions: Person lookups
CREATE INDEX idx_identity_mentions_yp ON embeddings.identity_mentions (yp_id);
CREATE INDEX idx_identity_mentions_unit ON embeddings.identity_mentions (unit_id);

-- Units: Re-processing queue
CREATE INDEX idx_units_reprocess ON embeddings.units (reprocess_priority DESC) 
  WHERE reprocess_priority > 0;
CREATE INDEX idx_units_questions ON embeddings.units (created_at DESC) 
  WHERE contains_question = true AND resolution_status = 'open';

4. Processing Profiles

Different content types need tailored extraction:

Profile	Sources	Chunking Strategy	Special Handling
`discord_chat`	Discord messages	Thread-aware, speaker-grouped	Q&A pairing, emoji context
`incident_report`	Incident forms	Keep narrative whole, section headers	Entity extraction critical
`shift_note`	Daily summaries	Participant-focused sections	Link to incidents by date/location
`resource_doc`	Policies, procedures	Semantic sections, hierarchy	Authority level, version tracking
`phone_call`	Dialpad transcripts	Speaker turns, action items	Tone detection, urgency
`sms_message`	Twilio/TextMagic	Conversation threading	Very brief, context-dependent
`email_thread`	Inbound/outbound	Quote stripping, signature removal	CC/BCC = mentioned identities

5. Group Detection Logic

5.1 Conversation Groups (Discord)

Trigger: Messages in same channel within time window Logic:

Cluster messages by channel_id
Group by time gaps (<5 min = same conversation)
Detect topic shifts via embedding distance
Create group with synthesized summary

5.2 Incident Context Groups

Trigger: Incident report created Logic:

Find incident location + date
Search for shift notes at same location ±1 day
Search for staff mentioned in incident
Search for participant-related content
Create group with incident as primary, notes as supporting

Bi-directional handling:

Incident without notes: Set reprocess_priority = 5, completeness = 'partial'
Re-process daily until notes found or 7 days elapsed
Shift notes arriving later trigger incident re-evaluation

5.3 Q&A Pair Detection

Trigger: Unit marked contains_question = true Logic:

Search subsequent units in same context (thread, location, topic)
Look for response patterns ("found it", "yes", "the answer is")
LLM verification: "Does this answer that question?"
Link via resolved_by_unit_id

Priority handling:

Unanswered questions: reprocess_priority = 8 (high)
Re-check at 6h, 24h, 48h, 7d
After 7d with no answer: Mark resolution_status = 'timeout'

6. Search Architecture

6.1 Query Flow

User Query: "What happened with pumpkin at Christmas?"
                    |
                    v
┌─────────────────────────────────────────────────────────────┐
│  Step 1: QUERY ANALYSIS                                     │
│  - Extract intent: information retrieval                    │
│  - Entities: none explicit                                  │
│  - Time hints: "Christmas" → December 2025                  │
│  - Topics: food, meal planning                              │
└─────────────────────────────────────────────────────────────┘
                    |
                    v
┌─────────────────────────────────────────────────────────────┐
│  Step 2: SEARCH linked_groups (PRIMARY)                     │
│  SELECT * FROM linked_groups                                │
│  WHERE group_embedding <-> query_embedding < 0.5            │
│  AND (event_date BETWEEN '2025-12-01' AND '2025-12-31'      │
│       OR 'christmas' = ANY(topic_tags))                     │
│  ORDER BY similarity DESC LIMIT 10                          │
└─────────────────────────────────────────────────────────────┘
                    |
                    v
┌─────────────────────────────────────────────────────────────┐
│  Step 3: EXPAND GROUP RESULTS                               │
│  For each group:                                            │
│  - Fetch all group_members with their units                 │
│  - Fetch identity_mentions for participant names            │
│  - Fetch original source content                            │
└─────────────────────────────────────────────────────────────┘
                    |
                    v
┌─────────────────────────────────────────────────────────────┐
│  Step 4: FALLBACK TO UNITS (SECONDARY)                      │
│  If <3 group results, also search units directly            │
│  Return as individual results with lower ranking            │
└─────────────────────────────────────────────────────────────┘
                    |
                    v
┌─────────────────────────────────────────────────────────────┐
│  Step 5: RETURN RICH RESULTS                                │
│  [                                                          │
│    {                                                        │
│      type: 'conversation',                                  │
│      title: 'Christmas lunch planning at Picton SIL',       │
│      participants: ['Sarah', 'Mike', 'Jane'],               │
│      location: 'Picton SIL',                                │
│      date: '2025-12-10',                                    │
│      messages: [ ...9 messages in order... ],               │
│      relevance: 0.89                                        │
│    }                                                        │
│  ]                                                          │
└─────────────────────────────────────────────────────────────┘

6.2 Cross-Source Unified Search

The header search bar returns mixed results from ALL sources:

// Grouped response format
{
  query: "vehicle inspection",
  results: {
    conversations: [
      { title: "Van brake issue discussion", source: "discord", count: 7 }
    ],
    incidents: [
      { title: "Minor vehicle damage at Picton", date: "2026-01-15" }
    ],
    resources: [
      { title: "Vehicle Inspection Checklist", type: "procedure" }
    ],
    shift_notes: [
      { title: "Noted van making noise", staff: "John Smith" }
    ]
  },
  total_hits: 12
}

7. Temporal Re-processing System

7.1 Re-processing Queue

Units and groups with reprocess_priority > 0 are queued:

Priority	Reason	Timing
10	Critical (incident without context)	Every 6 hours
8	Unanswered question	6h, 24h, 48h, 7d
5	Partial context (incident with some notes)	Daily
3	Validation pass (all content >24h old)	Once at 48h
1	Periodic refresh (active conversations)	Weekly

7.2 Re-processing Logic

┌─────────────────────────────────────────────────────────────┐
│  CRON: Every hour during quiet hours (2 AM - 6 AM)          │
└─────────────────────────────────────────────────────────────┘
                    |
                    v
┌─────────────────────────────────────────────────────────────┐
│  1. Fetch units/groups with reprocess_priority > 0          │
│  2. Order by priority DESC, last_reprocessed_at ASC         │
│  3. Process batch (50 items max per run)                    │
└─────────────────────────────────────────────────────────────┘
                    |
                    v
┌─────────────────────────────────────────────────────────────┐
│  For each item:                                             │
│  - Re-run extraction with expanded context window           │
│  - Check for Q&A resolution                                 │
│  - Look for new linked content                              │
│  - Update confidence scores                                 │
│  - Either: resolve (priority=0) OR reschedule               │
└─────────────────────────────────────────────────────────────┘

8. YP3000 Feedback Loop

8.1 Embeddings → YP3000

When processing content, we extract identity mentions and feed them to YP3000:

// During extraction, found mention of "Sarah Jones"
await yp3000.recordSighting({
  raw_name: "Sarah Jones",
  phone: null,  // Not mentioned
  email: null,
  context: "Mentioned in shift note as participant",
  source_type: "embedding_extraction",
  source_id: unit.id,
  confidence: 75
});

YP3000 uses these sightings to:

Discover new identities
Reinforce existing identity confidence
Detect conflicts (different details for same person)

8.2 YP3000 → Embeddings

When YP3000 resolves or merges identities:

Update identity_mentions.yp_id to canonical ID
Recalculate linked_groups.yp_ids arrays
Flag affected groups for re-indexing

9. Implementation Phases

Phase 1: Schema & Foundation

Phase 2: Processing Enhancements

Add processing_profile to extraction pipeline
Implement incident-specific extraction
Implement shift note extraction
Add identity extraction (link to YP3000)
Add location extraction

Phase 3: Group Detection

Conversation grouping (Discord threads)
Incident context grouping (incident + shift notes)
Q&A pair detection
Group embedding generation

Phase 4: Temporal Re-processing

Build re-processing queue handler
Implement Q&A resolution detection
Implement incident context expansion
Add validation pass logic

Phase 5: Cross-Source Search

Build unified search endpoint
Implement header search bar
Add source-type grouping
Add rich result formatting

Phase 6: YP3000 Integration

Add identity extraction to pipeline
Build sighting feed to YP3000
Handle identity merge events
Build "everything about person X" query

10. Success Criteria

Context Quality: Searches return complete conversations, not fragments
Q&A Resolution: 80%+ of questions linked to answers within 48h (when answer exists)
Incident Context: Incidents automatically link to related shift notes
Identity Coverage: 90%+ of content has identity anchors
Search Relevance: Cross-source search returns expected results in top 3
Processing Latency: New content searchable within 5 minutes, re-processed within 48h

Embedding & Semantic Search - Base embedding architecture
YP3000 Identity System - Identity resolution system
Data Storage & RAG Design - Core RAG concepts

Changelog

Date	Change	Author
2026-01-24	Initial document - design complete	Droid

Executive Summary​

1. The Problem with Isolated Units​

2. New Concepts​

2.1 Linked Groups​

2.2 Identity Anchoring (YP3000 Integration)​

2.3 Temporal Re-processing​

2.4 Bi-directional Linking​

3. Enhanced Schema​

3.1 Linked Groups Table​

3.2 Group Members Table (Unit-to-Group Links)​

3.3 Identity Mentions Table (YP3000 Links)​

3.4 Enhanced Units Table​

3.5 Indexes​

4. Processing Profiles​

5. Group Detection Logic​

5.1 Conversation Groups (Discord)​

5.2 Incident Context Groups​

5.3 Q&A Pair Detection​

6. Search Architecture​

6.1 Query Flow​

6.2 Cross-Source Unified Search​

7. Temporal Re-processing System​

7.1 Re-processing Queue​

7.2 Re-processing Logic​

8. YP3000 Feedback Loop​

8.1 Embeddings → YP3000​

8.2 YP3000 → Embeddings​

9. Implementation Phases​

Phase 1: Schema & Foundation​

Phase 2: Processing Enhancements​

Phase 3: Group Detection​

Phase 4: Temporal Re-processing​

Phase 5: Cross-Source Search​

Phase 6: YP3000 Integration​

10. Success Criteria​

11. Related Documents​

Changelog​