Embedding & Semantic Search Strategy

Created: 2026-01-21 Status: Active - Living Document Related: AI/Agent System, Knowledge Base, Chat Context

Executive Summary

This document defines our strategy for embedding, enriching, and searching content across all data sources. Our philosophy is quality over scale - with ~300 users, we can afford multi-pass LLM extraction, parallel embedding strategies, and thorough quality checks that would be cost-prohibitive at scale.

Core principle: Each piece of content is gold that deserves microscopic analysis.

1. Philosophy & Approach

1.1 Quality-First Mindset

We are NOT a scale app. We have:

~20 heavily active users
~100 moderate users
~200 casual users

This means we CAN:

Run multi-pass LLM extraction (2-4 rounds per document)
Employ parallel embedding strategies (atomic + context + summary)
Use expensive models for quality checks (GPT-4o verification)
Re-process everything when extraction logic improves
Treat each document like it's being prepared for expert review

Cost reality: ~$5-10/month for extremely high-quality search. A single hour of staff time saved pays for a year.

1.2 Embeddings Are Coordinates, Not Symbols

Key mental model:

Vectors cannot be "decoded" back to text
Search is pure math (cosine similarity)
All intelligence happens before (preparation) and after (ranking)
The embedding itself is just coordinates in meaning-space

1.3 Separate Concerns

Stage	Responsibility	When Intelligence Happens
Ingestion	Store raw content (immutable)	None - just preservation
Extraction	Decompose into semantic units	LLM does the thinking
Enrichment	Add metadata, entities, tags	LLM + rules
Embedding	Generate vectors	API call (no intelligence)
Search	Similarity + filtering + ranking	App logic + optional LLM re-rank

1.4 Metadata Outside Vectors

Critical principle:

Domain tags, entity refs, authority flags live in columns, not embedded text
Never encode time/authority INTO the embedding text
Vectors handle meaning; app handles trust/relevance/recency

2. Data Classification

2.1 Container Types (Stable Set)

These are the source formats - unlikely to change often:

Container Type	Description	Examples
`short_message`	Brief informal text	Discord, SMS, chat turns
`long_message`	Extended text with structure	Emails, long posts
`staff_report`	Operational reports	Shift notes, incidents, supervision
`transcript`	Spoken word converted to text	Meeting transcripts, calls
`resource_document`	Official knowledge content	Policies, procedures, handbooks
`structured_record`	Field-based data	Form submissions, CRM entries, KPIs
`task_or_ticket`	Work items with status	Monday items, action registers
`conversation_thread`	Curated multi-message synthesis	Thread summaries, decision logs
`file_data`	Extracted text from files	PDF content, DOCX
`image_description`	Visual content as text	OCR, alt text, AI descriptions

Rule: When new source appears, map to existing container. Only add new container if extraction logic fundamentally differs.

2.2 Unit Types (What We Extract)

Each extracted semantic unit gets classified:

Facts & Descriptions

general_fact - Timeless objective facts
descriptive_fact - Attributes of person/place/object
state_fact - Current or recent state
historical_fact - Past event or outcome

Actions & Events

action_completed - Something done
action_in_progress - Ongoing work
action_required - Requested/expected action
event - Something that happened

Decisions & Intent

decision - Explicit choice or resolution
intent - Stated plan or aim
commitment - Promise or obligation

Knowledge & Rules

policy - Formal rule or requirement
procedure - Step-based instructions
exception - Override or edge case
definition - Meaning of a term

Questions & Uncertainty

question - Explicit query
assumption - Belief without confirmation
unknown - Explicit lack of knowledge

Risk & Assessment

risk - Potential harm or issue
assessment - Evaluation or judgment
status_update - Progress or state change

Meta/Administrative

reference - Pointer to external info
note - Contextual but non-actionable

2.3 Domain Tags

Organization-specific topic areas for filtering:

Domain	Description
`roster`	Scheduling, shifts, availability
`payroll`	Pay, timesheets, deductions
`hr`	Employment, contracts, policies
`incidents`	Safety, accidents, near-misses
`participants`	Client/participant matters
`vehicles`	Fleet, maintenance, bookings
`training`	Qualifications, certifications
`ndis`	Funding, plans, compliance
`facilities`	Buildings, equipment, maintenance
`communications`	Internal comms, announcements

3. Ingestion Pipeline

3.1 Overview

Raw Content
    |
    v
+-------------------------------------------------------+
|  Stage 1: INGEST                                      |
|  - Store raw content immutably                        |
|  - Classify container type                            |
|  - Record provenance (source, author, timestamp)      |
+-------------------------------------------------------+
    |
    v
+-------------------------------------------------------+
|  Stage 2: EXTRACT (LLM Pass 1)                        |
|  - Decompose into atomic semantic units               |
|  - Classify each unit by type                         |
|  - Identify entities (staff, participants, etc.)      |
|  - Tag with domains                                   |
+-------------------------------------------------------+
    |
    v
+-------------------------------------------------------+
|  Stage 3: VERIFY (LLM Pass 2)                         |
|  - Quality check extraction                           |
|  - Flag low confidence units                          |
|  - Identify potential gaps                            |
|  - Check entity resolution accuracy                   |
+-------------------------------------------------------+
    |
    v
+-------------------------------------------------------+
|  Stage 4: ENRICH (LLM Pass 3 + Rules)                 |
|  - Add inferred metadata                              |
|  - Resolve entity references to IDs                   |
|  - Generate search-optimized text                     |
|  - Create unit summary + context variants             |
+-------------------------------------------------------+
    |
    v
+-------------------------------------------------------+
|  Stage 5: EMBED (API Call)                            |
|  - Generate multiple embeddings per unit:             |
|    * Atomic (just the unit text)                      |
|    * Contextual (unit + surrounding context)          |
|    * Summary (synthesized overview)                   |
|  - Store all strategies with tags                     |
+-------------------------------------------------------+
    |
    v
+-------------------------------------------------------+
|  Stage 6: INDEX                                       |
|  - Store in unified schema                            |
|  - Update search indexes                              |
|  - Link to source entities                            |
+-------------------------------------------------------+

3.2 Stage Details

Stage 1: Ingest

Store the raw content exactly as received:

{
  container_type: 'staff_report',
  origin_system: 'shift_notes',
  origin_subtype: 'daily_report',
  external_id: 'shift-2026-01-21-venue-a',
  author_id: 'staff-uuid',
  author_display: 'Jane Smith',
  created_at: '2026-01-21T08:00:00Z',
  raw_text: '... original content ...',
  raw_payload: { /* any structured metadata */ }
}

Immutability rule: Raw content never changes. If source updates, create new version.

Stage 2: Extract

LLM prompt breaks content into atomic units:

Input: "Sarah arrived at 8am looking tired. She mentioned her dog was sick overnight. 
During morning activities she was engaged but needed reminders about her medication at 10am.
Note: Sarah's NDIS plan review is due next month."

Output:
- [state_fact] Sarah appeared tired on arrival at 8:00 AM
- [general_fact] Sarah's dog was sick overnight (context for tiredness)
- [assessment] Sarah was engaged during morning activities
- [action_required] Sarah needed medication reminder at 10:00 AM
- [event] Medication reminder given at 10:00 AM (implied completed)
- [status_update] Sarah's NDIS plan review due next month

Each unit is self-contained and searchable independently.

Stage 3: Verify

Second LLM pass reviews extraction quality:

Questions to answer:
- Did we miss any facts?
- Are entities correctly identified?
- Any ambiguous interpretations?
- Confidence score for each unit?

Low-confidence units get flagged for human review or excluded from high-trust searches.

Stage 4: Enrich

Add metadata that helps with search:

{
  unit_type: 'action_required',
  text_for_embedding: 'Sarah needed medication reminder at 10:00 AM on January 21, 2026',
  domain_tags: ['participants', 'health'],
  entity_refs: {
    participant_ids: ['sarah-uuid'],
    staff_ids: ['author-uuid']
  },
  time: {
    effective_at: '2026-01-21T10:00:00Z',
    is_time_sensitive: true
  },
  authority: {
    author_role: 'support_worker',
    is_official_record: true
  },
  safety: {
    pii_present: true,
    medical_sensitive: true
  }
}

Stage 5: Embed

Generate multiple embeddings for each unit:

Strategy	What It Embeds	Good For
`atomic`	Just the unit text	Precise fact retrieval
`contextual`	Unit + 2-3 surrounding units	Understanding in context
`summary`	Synthesized source overview	Broad topic search

All use text-embedding-3-large (3072 dimensions) for consistency.

Stage 6: Index

Store everything in the unified schema with proper indexes for:

Vector similarity (HNSW index)
Domain tag filtering (GIN index)
Entity lookups (B-tree on entity_refs)
Time range queries (B-tree on effective_at)

4. Search Pipeline

4.1 Query Processing

User Query: "What happened with Sarah's medication last week?"
                    |
                    v
+-------------------------------------------------------+
|  Step 1: QUERY ANALYSIS                               |
|  - Extract intent (information retrieval)             |
|  - Identify entities (Sarah, medication)              |
|  - Identify time constraints (last week)              |
|  - Identify domains (participants, health)            |
+-------------------------------------------------------+
                    |
                    v
+-------------------------------------------------------+
|  Step 2: QUERY EXPANSION (Optional)                   |
|  LLM generates search variants:                       |
|  - "Sarah medication reminder"                        |
|  - "Sarah health medication administration"           |
|  - "participant medication compliance"                |
+-------------------------------------------------------+
                    |
                    v
+-------------------------------------------------------+
|  Step 3: HYBRID SEARCH                                |
|  Combine multiple signals:                            |
|  - Vector similarity (semantic match)                 |
|  - Entity filter (participant = Sarah)                |
|  - Domain filter (health, participants)               |
|  - Time filter (last 7 days)                          |
|  - Authority weight (official records higher)         |
+-------------------------------------------------------+
                    |
                    v
+-------------------------------------------------------+
|  Step 4: RESULT FUSION                                |
|  - Merge results from all strategies                  |
|  - De-duplicate by source                             |
|  - Score by combined relevance                        |
+-------------------------------------------------------+
                    |
                    v
+-------------------------------------------------------+
|  Step 5: LLM RE-RANK (Optional)                       |
|  - Take top 20 candidates                             |
|  - LLM scores relevance to original query             |
|  - Return top 5 with confidence                       |
+-------------------------------------------------------+

4.2 Scoring Formula

final_score = (
  vector_similarity * 0.5
  + entity_match_bonus * 0.2
  + recency_score * 0.15
  + authority_score * 0.15
)

Weights are tunable per use case.

4.3 Search Modes

Mode	Description	Use Case
`precise`	Strict filters, high threshold	Finding specific facts
`exploratory`	Loose filters, broader results	Understanding a topic
`entity_focused`	Filter by entity first	"Everything about X"
`temporal`	Time-weighted heavily	"What happened recently"

5. Data Cleaning & Quality

5.1 Noise Removal by Container Type

Container	What to Remove
`short_message`	Chatter, emojis (unless meaningful), "ok/thanks"
`long_message`	Signatures, legal footers, quoted history, unsubscribe blocks
`staff_report`	Boilerplate headers, template text
`transcript`	Filler words ("um", "like"), false starts
`resource_document`	Table of contents, page numbers, headers/footers

5.2 Entity Resolution

When content mentions people/things:

Extract mention text ("Sarah", "the blue van")
Fuzzy match against known entities
If confident (>80%), link to entity ID
If uncertain, flag for manual resolution
Store both mention text AND resolved ID

5.3 PII & Sensitivity Handling

Flag units containing:

Full names + identifying details
Medical information
Financial data
Contact details
Incident details naming individuals

Flagged units:

Still searchable by authorized users
Excluded from cross-user knowledge extraction
Redacted in certain search contexts

5.4 Quality Metrics

Track per source:

Extraction confidence distribution
Entity resolution rate
Units per source (extraction yield)
Search hit rate (are these units being found?)
False positive rate (found but irrelevant)

6. Current Systems Audit

6.1 Existing Embedding Usage

System	Table	Dims	Status	Upgrade Plan
Discord	`comms.discord_messages`	3072	Active	Migrate to unified schema
Resources	`resources.chunks`	3072	Active	Add semantic decomposition
Chat Context	`core_ops.session_summaries`	1536	Active	Upgrade dims + add cross-user
HR Policies	`hr_policies.policy_vectors`	1536	Active	Upgrade dims + decomposition
History Ribbon	`core_ops.history_ribbon_*`	768	Legacy	Deprecate or upgrade

6.2 Migration Priority

Discord - High volume, needs decomposition
Resources/KB - Important reference material
HR Policies - Critical for compliance queries
Chat Context - Enable cross-user learning
History Ribbon - Evaluate usage first

7. Unified Schema

7.1 Sources Table (Immutable Raw Content)

CREATE TABLE embeddings.sources (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  container_type VARCHAR(50) NOT NULL,
  origin_system VARCHAR(50) NOT NULL,
  origin_subtype VARCHAR(100),
  external_id VARCHAR(255),
  author_id VARCHAR(255),
  author_display VARCHAR(255),
  created_at TIMESTAMPTZ NOT NULL,
  raw_text TEXT NOT NULL,
  raw_payload JSONB,
  extraction_status VARCHAR(20) DEFAULT 'pending',
  extracted_at TIMESTAMPTZ,
  schema_version INT DEFAULT 1,
  UNIQUE (origin_system, external_id)
);

7.2 Units Table (Searchable Atoms)

CREATE TABLE embeddings.units (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source_id UUID REFERENCES embeddings.sources(id),
  unit_type VARCHAR(50) NOT NULL,
  text_for_embedding TEXT NOT NULL,
  supporting_text TEXT,
  
  -- Multiple embedding strategies
  embedding_atomic vector(3072),
  embedding_contextual vector(3072),
  embedding_summary vector(3072),
  embedding_model VARCHAR(50),
  
  -- Filtering metadata
  domain_tags TEXT[],
  topic_tags TEXT[],
  entity_refs JSONB DEFAULT '{}',
  record_scope VARCHAR(20),
  
  -- Time relevance
  effective_at TIMESTAMPTZ,
  is_time_sensitive BOOLEAN DEFAULT false,
  
  -- Authority
  author_role VARCHAR(100),
  is_official_record BOOLEAN DEFAULT false,
  
  -- Quality
  confidence DECIMAL(3,2) DEFAULT 1.0,
  
  -- Safety
  pii_present BOOLEAN DEFAULT false,
  medical_sensitive BOOLEAN DEFAULT false
);

7.3 Indexes

-- Vector similarity (HNSW for fast ANN)
CREATE INDEX idx_units_atomic ON embeddings.units 
  USING hnsw (embedding_atomic vector_cosine_ops);
CREATE INDEX idx_units_contextual ON embeddings.units 
  USING hnsw (embedding_contextual vector_cosine_ops);

-- Filtering
CREATE INDEX idx_units_domain ON embeddings.units USING GIN(domain_tags);
CREATE INDEX idx_units_entities ON embeddings.units USING GIN(entity_refs);
CREATE INDEX idx_units_time ON embeddings.units(effective_at);
CREATE INDEX idx_units_official ON embeddings.units(is_official_record) 
  WHERE is_official_record;

8. Implementation Status

Last updated: 2026-01-21

Architecture Implemented

Unified Server-Integrated Processor - Instead of a separate worker script, the embedding processor runs inside the main server as a continuous background loop. This ensures:

Single code path for backfill AND ongoing content
No "different versions" problem between scripts
Self-healing (auto-restarts with server)
Monitorable via API endpoints

┌─────────────────────────────────────────────────────────────┐
│                    Content Sources                          │
├─────────────────┬─────────────────┬─────────────────────────┤
│ Discord Bot     │ Resources KB    │ Future (emails, etc)    │
│ (polling sync)  │ (PDF/MD ingest) │                         │
└────────┬────────┴────────┬────────┴────────┬────────────────┘
         │                 │                 │
         │  ingestToNew    │  ingestChunk    │
         │  Embeddings()   │  ToNewEmbed()   │
         ▼                 ▼                 ▼
┌─────────────────────────────────────────────────────────────┐
│              embeddings.sources (pending queue)             │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│          embedding-processor.js (runs in server)            │
│                                                             │
│   while (running) {                                         │
│     1. Check for pending sources                            │
│     2. If found, process one through multi-pass LLM         │
│     3. Store units + embeddings                             │
│     4. Sleep (1s normal, 500ms overnight)                   │
│     5. Repeat                                               │
│   }                                                         │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│              embeddings.units (searchable atoms)            │
│   - 3 embeddings per unit (atomic, contextual, summary)     │
│   - Rich metadata (entities, sentiment, intent, domains)    │
│   - Linked back to original source                          │
└─────────────────────────────────────────────────────────────┘

Phase 1: Foundation ✅ COMPLETE

Create embeddings schema
Create sources and units tables
Create extraction_queue table
Create indexes (IVFFlat for 3072 dims)
Build ingest functions for each source type

Phase 2: Pipeline ✅ COMPLETE

Build LLM extraction prompts (anti-hallucination rules)
Build verification pass (confidence scoring)
Build enrichment pass (entities, sentiment, intent, why_context)
Build multi-strategy embedding (atomic, contextual, summary)
Thread context fetching (Discord: 10 preceding messages)

Phase 3: Migration ✅ COMPLETE (Queued)

Discord messages ingested (115,740 sources queued)
Resources/KB ingested (89 sources queued)
HR Policies - future
Chat Context - future

Phase 4: Search ✅ COMPLETE

Build searchWithSource() - returns units + original content
Multi-strategy search (tries atomic, contextual, summary)
Metadata filtering (domains, entities, sentiment)
API endpoint: GET /api/v1/dashboard/semantic-search
API endpoint: GET /api/v1/dashboard/semantic-stats
LLM re-ranking - future enhancement

Phase 5: Integration ✅ COMPLETE

Discord search page uses new semantic search (with legacy fallback)
Processor control endpoints (pause/resume)
Discord bot hooks into new system (new messages auto-queued)
Resource KB hooks into new system (new docs auto-queued)
Update Reggie SMS to use new search - future
Update analytics widgets - future

Files Created/Modified

File	Purpose
`backend/services/embedding-extraction.js`	Multi-pass LLM pipeline
`backend/services/embedding-processor.js`	Server-integrated background worker
`backend/services/embedding-worker.js`	CLI tool for bulk ingestion/status
`backend/routes_v1p/dashboard.js`	API endpoints for search & stats
`bot/discord-kb-sync.js`	Added new embedding ingest hook
`backend/services/resource-kb.js`	Added new embedding ingest hook
`admin/src/js/pages/page_discord_search.js`	Updated to use semantic search

SQL Migrations

File	Purpose
`20260121_embeddings_unified_schema.sql`	Core schema (sources, units, queue)
`20260121_embeddings_metadata_enrichment.sql`	Enrichment columns + functions

Current Queue Status

As of 2026-01-21:

115,829 sources queued for processing
Discord: 115,740 messages
Resources: 89 chunks
Processing rate: ~30s per source (multi-pass LLM)
Estimated time: Worker runs continuously, prioritizes new content

To Start Processing

Just start the server - the processor starts automatically:
```
cd backend
node server.js
```

Monitor progress via API or CLI:

# CLI status
node services/embedding-worker.js status

# Or via API
curl http://localhost:3009/api/v1/dashboard/semantic-stats

Control the processor if needed:

# Pause processing
curl -X POST http://localhost:3009/api/v1/dashboard/embedding-processor/pause

# Resume processing  
curl -X POST http://localhost:3009/api/v1/dashboard/embedding-processor/resume

9. Cost Estimates

Item	Unit Cost	Monthly Volume	Monthly Cost
Embedding (3072 dim)	$0.00013/1K tokens	~500K tokens	~$0.07
LLM Extraction (GPT-4o-mini)	$0.15/1M tokens	~2M tokens	~$0.30
LLM Verification (GPT-4o-mini)	$0.15/1M tokens	~1M tokens	~$0.15
LLM Re-ranking (GPT-4o-mini)	$0.15/1M tokens	~500K tokens	~$0.08
Total			~$5-10/month

10. Success Criteria

Search Quality: 80%+ of test queries return expected results in top 5
Coverage: All major content types indexed (Discord, Resources, Policies, Reports)
Latency: Search returns in <500ms
Freshness: New content indexed within 5 minutes
Maintainability: Adding new source type requires no schema changes

Changelog

Date	Change	Author
2026-01-21	Initial document created	Droid
2026-01-21	Full implementation complete - schema, pipeline, search, integrations	Droid
2026-01-21	Changed to server-integrated processor (vs separate worker script)	Droid
2026-01-21	Added hooks to Discord bot and Resource KB for automatic ingestion	Droid
2026-01-21	Queued 115,829 sources for processing (115,740 Discord + 89 Resources)	Droid

Executive Summary​

1. Philosophy & Approach​

1.1 Quality-First Mindset​

1.2 Embeddings Are Coordinates, Not Symbols​

1.3 Separate Concerns​

1.4 Metadata Outside Vectors​

2. Data Classification​

2.1 Container Types (Stable Set)​

2.2 Unit Types (What We Extract)​

2.3 Domain Tags​

3. Ingestion Pipeline​

3.1 Overview​

3.2 Stage Details​

Stage 1: Ingest​

Stage 2: Extract​

Stage 3: Verify​

Stage 4: Enrich​

Stage 5: Embed​

Stage 6: Index​

4. Search Pipeline​

4.1 Query Processing​

4.2 Scoring Formula​

4.3 Search Modes​

5. Data Cleaning & Quality​

5.1 Noise Removal by Container Type​

5.2 Entity Resolution​

5.3 PII & Sensitivity Handling​

5.4 Quality Metrics​

6. Current Systems Audit​

6.1 Existing Embedding Usage​

6.2 Migration Priority​

7. Unified Schema​

7.1 Sources Table (Immutable Raw Content)​

7.2 Units Table (Searchable Atoms)​

7.3 Indexes​

8. Implementation Status​

Architecture Implemented​

Phase 1: Foundation ✅ COMPLETE​

Phase 2: Pipeline ✅ COMPLETE​

Phase 3: Migration ✅ COMPLETE (Queued)​

Phase 4: Search ✅ COMPLETE​

Phase 5: Integration ✅ COMPLETE​

Files Created/Modified​

SQL Migrations​

Current Queue Status​

To Start Processing​

9. Cost Estimates​

10. Success Criteria​

Changelog​