Skip to main content

Embedding & Semantic Search Strategy

Created: 2026-01-21 Status: Active - Living Document Related: AI/Agent System, Knowledge Base, Chat Context

Executive Summary

This document defines our strategy for embedding, enriching, and searching content across all data sources. Our philosophy is quality over scale - with ~300 users, we can afford multi-pass LLM extraction, parallel embedding strategies, and thorough quality checks that would be cost-prohibitive at scale.

Core principle: Each piece of content is gold that deserves microscopic analysis.


1. Philosophy & Approach

1.1 Quality-First Mindset

We are NOT a scale app. We have:

  • ~20 heavily active users
  • ~100 moderate users
  • ~200 casual users

This means we CAN:

  • Run multi-pass LLM extraction (2-4 rounds per document)
  • Employ parallel embedding strategies (atomic + context + summary)
  • Use expensive models for quality checks (GPT-4o verification)
  • Re-process everything when extraction logic improves
  • Treat each document like it's being prepared for expert review

Cost reality: ~$5-10/month for extremely high-quality search. A single hour of staff time saved pays for a year.

1.2 Embeddings Are Coordinates, Not Symbols

Key mental model:

  • Vectors cannot be "decoded" back to text
  • Search is pure math (cosine similarity)
  • All intelligence happens before (preparation) and after (ranking)
  • The embedding itself is just coordinates in meaning-space

1.3 Separate Concerns

StageResponsibilityWhen Intelligence Happens
IngestionStore raw content (immutable)None - just preservation
ExtractionDecompose into semantic unitsLLM does the thinking
EnrichmentAdd metadata, entities, tagsLLM + rules
EmbeddingGenerate vectorsAPI call (no intelligence)
SearchSimilarity + filtering + rankingApp logic + optional LLM re-rank

1.4 Metadata Outside Vectors

Critical principle:

  • Domain tags, entity refs, authority flags live in columns, not embedded text
  • Never encode time/authority INTO the embedding text
  • Vectors handle meaning; app handles trust/relevance/recency

2. Data Classification

2.1 Container Types (Stable Set)

These are the source formats - unlikely to change often:

Container TypeDescriptionExamples
short_messageBrief informal textDiscord, SMS, chat turns
long_messageExtended text with structureEmails, long posts
staff_reportOperational reportsShift notes, incidents, supervision
transcriptSpoken word converted to textMeeting transcripts, calls
resource_documentOfficial knowledge contentPolicies, procedures, handbooks
structured_recordField-based dataForm submissions, CRM entries, KPIs
task_or_ticketWork items with statusMonday items, action registers
conversation_threadCurated multi-message synthesisThread summaries, decision logs
file_dataExtracted text from filesPDF content, DOCX
image_descriptionVisual content as textOCR, alt text, AI descriptions

Rule: When new source appears, map to existing container. Only add new container if extraction logic fundamentally differs.

2.2 Unit Types (What We Extract)

Each extracted semantic unit gets classified:

Facts & Descriptions

  • general_fact - Timeless objective facts
  • descriptive_fact - Attributes of person/place/object
  • state_fact - Current or recent state
  • historical_fact - Past event or outcome

Actions & Events

  • action_completed - Something done
  • action_in_progress - Ongoing work
  • action_required - Requested/expected action
  • event - Something that happened

Decisions & Intent

  • decision - Explicit choice or resolution
  • intent - Stated plan or aim
  • commitment - Promise or obligation

Knowledge & Rules

  • policy - Formal rule or requirement
  • procedure - Step-based instructions
  • exception - Override or edge case
  • definition - Meaning of a term

Questions & Uncertainty

  • question - Explicit query
  • assumption - Belief without confirmation
  • unknown - Explicit lack of knowledge

Risk & Assessment

  • risk - Potential harm or issue
  • assessment - Evaluation or judgment
  • status_update - Progress or state change

Meta/Administrative

  • reference - Pointer to external info
  • note - Contextual but non-actionable

2.3 Domain Tags

Organization-specific topic areas for filtering:

DomainDescription
rosterScheduling, shifts, availability
payrollPay, timesheets, deductions
hrEmployment, contracts, policies
incidentsSafety, accidents, near-misses
participantsClient/participant matters
vehiclesFleet, maintenance, bookings
trainingQualifications, certifications
ndisFunding, plans, compliance
facilitiesBuildings, equipment, maintenance
communicationsInternal comms, announcements

3. Ingestion Pipeline

3.1 Overview

Raw Content
|
v
+-------------------------------------------------------+
| Stage 1: INGEST |
| - Store raw content immutably |
| - Classify container type |
| - Record provenance (source, author, timestamp) |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 2: EXTRACT (LLM Pass 1) |
| - Decompose into atomic semantic units |
| - Classify each unit by type |
| - Identify entities (staff, participants, etc.) |
| - Tag with domains |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 3: VERIFY (LLM Pass 2) |
| - Quality check extraction |
| - Flag low confidence units |
| - Identify potential gaps |
| - Check entity resolution accuracy |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 4: ENRICH (LLM Pass 3 + Rules) |
| - Add inferred metadata |
| - Resolve entity references to IDs |
| - Generate search-optimized text |
| - Create unit summary + context variants |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 5: EMBED (API Call) |
| - Generate multiple embeddings per unit: |
| * Atomic (just the unit text) |
| * Contextual (unit + surrounding context) |
| * Summary (synthesized overview) |
| - Store all strategies with tags |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 6: INDEX |
| - Store in unified schema |
| - Update search indexes |
| - Link to source entities |
+-------------------------------------------------------+

3.2 Stage Details

Stage 1: Ingest

Store the raw content exactly as received:

{
container_type: 'staff_report',
origin_system: 'shift_notes',
origin_subtype: 'daily_report',
external_id: 'shift-2026-01-21-venue-a',
author_id: 'staff-uuid',
author_display: 'Jane Smith',
created_at: '2026-01-21T08:00:00Z',
raw_text: '... original content ...',
raw_payload: { /* any structured metadata */ }
}

Immutability rule: Raw content never changes. If source updates, create new version.

Stage 2: Extract

LLM prompt breaks content into atomic units:

Input: "Sarah arrived at 8am looking tired. She mentioned her dog was sick overnight. 
During morning activities she was engaged but needed reminders about her medication at 10am.
Note: Sarah's NDIS plan review is due next month."

Output:
- [state_fact] Sarah appeared tired on arrival at 8:00 AM
- [general_fact] Sarah's dog was sick overnight (context for tiredness)
- [assessment] Sarah was engaged during morning activities
- [action_required] Sarah needed medication reminder at 10:00 AM
- [event] Medication reminder given at 10:00 AM (implied completed)
- [status_update] Sarah's NDIS plan review due next month

Each unit is self-contained and searchable independently.

Stage 3: Verify

Second LLM pass reviews extraction quality:

Questions to answer:
- Did we miss any facts?
- Are entities correctly identified?
- Any ambiguous interpretations?
- Confidence score for each unit?

Low-confidence units get flagged for human review or excluded from high-trust searches.

Stage 4: Enrich

Add metadata that helps with search:

{
unit_type: 'action_required',
text_for_embedding: 'Sarah needed medication reminder at 10:00 AM on January 21, 2026',
domain_tags: ['participants', 'health'],
entity_refs: {
participant_ids: ['sarah-uuid'],
staff_ids: ['author-uuid']
},
time: {
effective_at: '2026-01-21T10:00:00Z',
is_time_sensitive: true
},
authority: {
author_role: 'support_worker',
is_official_record: true
},
safety: {
pii_present: true,
medical_sensitive: true
}
}

Stage 5: Embed

Generate multiple embeddings for each unit:

StrategyWhat It EmbedsGood For
atomicJust the unit textPrecise fact retrieval
contextualUnit + 2-3 surrounding unitsUnderstanding in context
summarySynthesized source overviewBroad topic search

All use text-embedding-3-large (3072 dimensions) for consistency.

Stage 6: Index

Store everything in the unified schema with proper indexes for:

  • Vector similarity (HNSW index)
  • Domain tag filtering (GIN index)
  • Entity lookups (B-tree on entity_refs)
  • Time range queries (B-tree on effective_at)

4. Search Pipeline

4.1 Query Processing

User Query: "What happened with Sarah's medication last week?"
|
v
+-------------------------------------------------------+
| Step 1: QUERY ANALYSIS |
| - Extract intent (information retrieval) |
| - Identify entities (Sarah, medication) |
| - Identify time constraints (last week) |
| - Identify domains (participants, health) |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Step 2: QUERY EXPANSION (Optional) |
| LLM generates search variants: |
| - "Sarah medication reminder" |
| - "Sarah health medication administration" |
| - "participant medication compliance" |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Step 3: HYBRID SEARCH |
| Combine multiple signals: |
| - Vector similarity (semantic match) |
| - Entity filter (participant = Sarah) |
| - Domain filter (health, participants) |
| - Time filter (last 7 days) |
| - Authority weight (official records higher) |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Step 4: RESULT FUSION |
| - Merge results from all strategies |
| - De-duplicate by source |
| - Score by combined relevance |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Step 5: LLM RE-RANK (Optional) |
| - Take top 20 candidates |
| - LLM scores relevance to original query |
| - Return top 5 with confidence |
+-------------------------------------------------------+

4.2 Scoring Formula

final_score = (
vector_similarity * 0.5
+ entity_match_bonus * 0.2
+ recency_score * 0.15
+ authority_score * 0.15
)

Weights are tunable per use case.

4.3 Search Modes

ModeDescriptionUse Case
preciseStrict filters, high thresholdFinding specific facts
exploratoryLoose filters, broader resultsUnderstanding a topic
entity_focusedFilter by entity first"Everything about X"
temporalTime-weighted heavily"What happened recently"

5. Data Cleaning & Quality

5.1 Noise Removal by Container Type

ContainerWhat to Remove
short_messageChatter, emojis (unless meaningful), "ok/thanks"
long_messageSignatures, legal footers, quoted history, unsubscribe blocks
staff_reportBoilerplate headers, template text
transcriptFiller words ("um", "like"), false starts
resource_documentTable of contents, page numbers, headers/footers

5.2 Entity Resolution

When content mentions people/things:

  1. Extract mention text ("Sarah", "the blue van")
  2. Fuzzy match against known entities
  3. If confident (>80%), link to entity ID
  4. If uncertain, flag for manual resolution
  5. Store both mention text AND resolved ID

5.3 PII & Sensitivity Handling

Flag units containing:

  • Full names + identifying details
  • Medical information
  • Financial data
  • Contact details
  • Incident details naming individuals

Flagged units:

  • Still searchable by authorized users
  • Excluded from cross-user knowledge extraction
  • Redacted in certain search contexts

5.4 Quality Metrics

Track per source:

  • Extraction confidence distribution
  • Entity resolution rate
  • Units per source (extraction yield)
  • Search hit rate (are these units being found?)
  • False positive rate (found but irrelevant)

6. Current Systems Audit

6.1 Existing Embedding Usage

SystemTableDimsStatusUpgrade Plan
Discordcomms.discord_messages3072ActiveMigrate to unified schema
Resourcesresources.chunks3072ActiveAdd semantic decomposition
Chat Contextcore_ops.session_summaries1536ActiveUpgrade dims + add cross-user
HR Policieshr_policies.policy_vectors1536ActiveUpgrade dims + decomposition
History Ribboncore_ops.history_ribbon_*768LegacyDeprecate or upgrade

6.2 Migration Priority

  1. Discord - High volume, needs decomposition
  2. Resources/KB - Important reference material
  3. HR Policies - Critical for compliance queries
  4. Chat Context - Enable cross-user learning
  5. History Ribbon - Evaluate usage first

7. Unified Schema

7.1 Sources Table (Immutable Raw Content)

CREATE TABLE embeddings.sources (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
container_type VARCHAR(50) NOT NULL,
origin_system VARCHAR(50) NOT NULL,
origin_subtype VARCHAR(100),
external_id VARCHAR(255),
author_id VARCHAR(255),
author_display VARCHAR(255),
created_at TIMESTAMPTZ NOT NULL,
raw_text TEXT NOT NULL,
raw_payload JSONB,
extraction_status VARCHAR(20) DEFAULT 'pending',
extracted_at TIMESTAMPTZ,
schema_version INT DEFAULT 1,
UNIQUE (origin_system, external_id)
);

7.2 Units Table (Searchable Atoms)

CREATE TABLE embeddings.units (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_id UUID REFERENCES embeddings.sources(id),
unit_type VARCHAR(50) NOT NULL,
text_for_embedding TEXT NOT NULL,
supporting_text TEXT,

-- Multiple embedding strategies
embedding_atomic vector(3072),
embedding_contextual vector(3072),
embedding_summary vector(3072),
embedding_model VARCHAR(50),

-- Filtering metadata
domain_tags TEXT[],
topic_tags TEXT[],
entity_refs JSONB DEFAULT '{}',
record_scope VARCHAR(20),

-- Time relevance
effective_at TIMESTAMPTZ,
is_time_sensitive BOOLEAN DEFAULT false,

-- Authority
author_role VARCHAR(100),
is_official_record BOOLEAN DEFAULT false,

-- Quality
confidence DECIMAL(3,2) DEFAULT 1.0,

-- Safety
pii_present BOOLEAN DEFAULT false,
medical_sensitive BOOLEAN DEFAULT false
);

7.3 Indexes

-- Vector similarity (HNSW for fast ANN)
CREATE INDEX idx_units_atomic ON embeddings.units
USING hnsw (embedding_atomic vector_cosine_ops);
CREATE INDEX idx_units_contextual ON embeddings.units
USING hnsw (embedding_contextual vector_cosine_ops);

-- Filtering
CREATE INDEX idx_units_domain ON embeddings.units USING GIN(domain_tags);
CREATE INDEX idx_units_entities ON embeddings.units USING GIN(entity_refs);
CREATE INDEX idx_units_time ON embeddings.units(effective_at);
CREATE INDEX idx_units_official ON embeddings.units(is_official_record)
WHERE is_official_record;

8. Implementation Status

Last updated: 2026-01-21

Architecture Implemented

Unified Server-Integrated Processor - Instead of a separate worker script, the embedding processor runs inside the main server as a continuous background loop. This ensures:

  • Single code path for backfill AND ongoing content
  • No "different versions" problem between scripts
  • Self-healing (auto-restarts with server)
  • Monitorable via API endpoints
┌─────────────────────────────────────────────────────────────┐
│ Content Sources │
├─────────────────┬─────────────────┬─────────────────────────┤
│ Discord Bot │ Resources KB │ Future (emails, etc) │
│ (polling sync) │ (PDF/MD ingest) │ │
└────────┬────────┴────────┬────────┴────────┬────────────────┘
│ │ │
│ ingestToNew │ ingestChunk │
│ Embeddings() │ ToNewEmbed() │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ embeddings.sources (pending queue) │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ embedding-processor.js (runs in server) │
│ │
│ while (running) { │
│ 1. Check for pending sources │
│ 2. If found, process one through multi-pass LLM │
│ 3. Store units + embeddings │
│ 4. Sleep (1s normal, 500ms overnight) │
│ 5. Repeat │
│ } │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ embeddings.units (searchable atoms) │
│ - 3 embeddings per unit (atomic, contextual, summary) │
│ - Rich metadata (entities, sentiment, intent, domains) │
│ - Linked back to original source │
└─────────────────────────────────────────────────────────────┘

Phase 1: Foundation ✅ COMPLETE

  • Create embeddings schema
  • Create sources and units tables
  • Create extraction_queue table
  • Create indexes (IVFFlat for 3072 dims)
  • Build ingest functions for each source type

Phase 2: Pipeline ✅ COMPLETE

  • Build LLM extraction prompts (anti-hallucination rules)
  • Build verification pass (confidence scoring)
  • Build enrichment pass (entities, sentiment, intent, why_context)
  • Build multi-strategy embedding (atomic, contextual, summary)
  • Thread context fetching (Discord: 10 preceding messages)

Phase 3: Migration ✅ COMPLETE (Queued)

  • Discord messages ingested (115,740 sources queued)
  • Resources/KB ingested (89 sources queued)
  • HR Policies - future
  • Chat Context - future

Phase 4: Search ✅ COMPLETE

  • Build searchWithSource() - returns units + original content
  • Multi-strategy search (tries atomic, contextual, summary)
  • Metadata filtering (domains, entities, sentiment)
  • API endpoint: GET /api/v1/dashboard/semantic-search
  • API endpoint: GET /api/v1/dashboard/semantic-stats
  • LLM re-ranking - future enhancement

Phase 5: Integration ✅ COMPLETE

  • Discord search page uses new semantic search (with legacy fallback)
  • Processor control endpoints (pause/resume)
  • Discord bot hooks into new system (new messages auto-queued)
  • Resource KB hooks into new system (new docs auto-queued)
  • Update Reggie SMS to use new search - future
  • Update analytics widgets - future

Files Created/Modified

FilePurpose
backend/services/embedding-extraction.jsMulti-pass LLM pipeline
backend/services/embedding-processor.jsServer-integrated background worker
backend/services/embedding-worker.jsCLI tool for bulk ingestion/status
backend/routes_v1p/dashboard.jsAPI endpoints for search & stats
bot/discord-kb-sync.jsAdded new embedding ingest hook
backend/services/resource-kb.jsAdded new embedding ingest hook
admin/src/js/pages/page_discord_search.jsUpdated to use semantic search

SQL Migrations

FilePurpose
20260121_embeddings_unified_schema.sqlCore schema (sources, units, queue)
20260121_embeddings_metadata_enrichment.sqlEnrichment columns + functions

Current Queue Status

As of 2026-01-21:

  • 115,829 sources queued for processing
  • Discord: 115,740 messages
  • Resources: 89 chunks
  • Processing rate: ~30s per source (multi-pass LLM)
  • Estimated time: Worker runs continuously, prioritizes new content

To Start Processing

  1. Just start the server - the processor starts automatically:

    cd backend
    node server.js
  2. Monitor progress via API or CLI:

    # CLI status
    node services/embedding-worker.js status

    # Or via API
    curl http://localhost:3009/api/v1/dashboard/semantic-stats
  3. Control the processor if needed:

    # Pause processing
    curl -X POST http://localhost:3009/api/v1/dashboard/embedding-processor/pause

    # Resume processing
    curl -X POST http://localhost:3009/api/v1/dashboard/embedding-processor/resume

9. Cost Estimates

ItemUnit CostMonthly VolumeMonthly Cost
Embedding (3072 dim)$0.00013/1K tokens~500K tokens~$0.07
LLM Extraction (GPT-4o-mini)$0.15/1M tokens~2M tokens~$0.30
LLM Verification (GPT-4o-mini)$0.15/1M tokens~1M tokens~$0.15
LLM Re-ranking (GPT-4o-mini)$0.15/1M tokens~500K tokens~$0.08
Total~$5-10/month

10. Success Criteria

  1. Search Quality: 80%+ of test queries return expected results in top 5
  2. Coverage: All major content types indexed (Discord, Resources, Policies, Reports)
  3. Latency: Search returns in <500ms
  4. Freshness: New content indexed within 5 minutes
  5. Maintainability: Adding new source type requires no schema changes

Changelog

DateChangeAuthor
2026-01-21Initial document createdDroid
2026-01-21Full implementation complete - schema, pipeline, search, integrationsDroid
2026-01-21Changed to server-integrated processor (vs separate worker script)Droid
2026-01-21Added hooks to Discord bot and Resource KB for automatic ingestionDroid
2026-01-21Queued 115,829 sources for processing (115,740 Discord + 89 Resources)Droid