Embedding & Semantic Search Strategy
Created: 2026-01-21 Status: Active - Living Document Related: AI/Agent System, Knowledge Base, Chat Context
Executive Summary
This document defines our strategy for embedding, enriching, and searching content across all data sources. Our philosophy is quality over scale - with ~300 users, we can afford multi-pass LLM extraction, parallel embedding strategies, and thorough quality checks that would be cost-prohibitive at scale.
Core principle: Each piece of content is gold that deserves microscopic analysis.
1. Philosophy & Approach
1.1 Quality-First Mindset
We are NOT a scale app. We have:
- ~20 heavily active users
- ~100 moderate users
- ~200 casual users
This means we CAN:
- Run multi-pass LLM extraction (2-4 rounds per document)
- Employ parallel embedding strategies (atomic + context + summary)
- Use expensive models for quality checks (GPT-4o verification)
- Re-process everything when extraction logic improves
- Treat each document like it's being prepared for expert review
Cost reality: ~$5-10/month for extremely high-quality search. A single hour of staff time saved pays for a year.
1.2 Embeddings Are Coordinates, Not Symbols
Key mental model:
- Vectors cannot be "decoded" back to text
- Search is pure math (cosine similarity)
- All intelligence happens before (preparation) and after (ranking)
- The embedding itself is just coordinates in meaning-space
1.3 Separate Concerns
| Stage | Responsibility | When Intelligence Happens |
|---|---|---|
| Ingestion | Store raw content (immutable) | None - just preservation |
| Extraction | Decompose into semantic units | LLM does the thinking |
| Enrichment | Add metadata, entities, tags | LLM + rules |
| Embedding | Generate vectors | API call (no intelligence) |
| Search | Similarity + filtering + ranking | App logic + optional LLM re-rank |
1.4 Metadata Outside Vectors
Critical principle:
- Domain tags, entity refs, authority flags live in columns, not embedded text
- Never encode time/authority INTO the embedding text
- Vectors handle meaning; app handles trust/relevance/recency
2. Data Classification
2.1 Container Types (Stable Set)
These are the source formats - unlikely to change often:
| Container Type | Description | Examples |
|---|---|---|
short_message | Brief informal text | Discord, SMS, chat turns |
long_message | Extended text with structure | Emails, long posts |
staff_report | Operational reports | Shift notes, incidents, supervision |
transcript | Spoken word converted to text | Meeting transcripts, calls |
resource_document | Official knowledge content | Policies, procedures, handbooks |
structured_record | Field-based data | Form submissions, CRM entries, KPIs |
task_or_ticket | Work items with status | Monday items, action registers |
conversation_thread | Curated multi-message synthesis | Thread summaries, decision logs |
file_data | Extracted text from files | PDF content, DOCX |
image_description | Visual content as text | OCR, alt text, AI descriptions |
Rule: When new source appears, map to existing container. Only add new container if extraction logic fundamentally differs.
2.2 Unit Types (What We Extract)
Each extracted semantic unit gets classified:
Facts & Descriptions
general_fact- Timeless objective factsdescriptive_fact- Attributes of person/place/objectstate_fact- Current or recent statehistorical_fact- Past event or outcome
Actions & Events
action_completed- Something doneaction_in_progress- Ongoing workaction_required- Requested/expected actionevent- Something that happened
Decisions & Intent
decision- Explicit choice or resolutionintent- Stated plan or aimcommitment- Promise or obligation
Knowledge & Rules
policy- Formal rule or requirementprocedure- Step-based instructionsexception- Override or edge casedefinition- Meaning of a term
Questions & Uncertainty
question- Explicit queryassumption- Belief without confirmationunknown- Explicit lack of knowledge
Risk & Assessment
risk- Potential harm or issueassessment- Evaluation or judgmentstatus_update- Progress or state change
Meta/Administrative
reference- Pointer to external infonote- Contextual but non-actionable
2.3 Domain Tags
Organization-specific topic areas for filtering:
| Domain | Description |
|---|---|
roster | Scheduling, shifts, availability |
payroll | Pay, timesheets, deductions |
hr | Employment, contracts, policies |
incidents | Safety, accidents, near-misses |
participants | Client/participant matters |
vehicles | Fleet, maintenance, bookings |
training | Qualifications, certifications |
ndis | Funding, plans, compliance |
facilities | Buildings, equipment, maintenance |
communications | Internal comms, announcements |
3. Ingestion Pipeline
3.1 Overview
Raw Content
|
v
+-------------------------------------------------------+
| Stage 1: INGEST |
| - Store raw content immutably |
| - Classify container type |
| - Record provenance (source, author, timestamp) |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 2: EXTRACT (LLM Pass 1) |
| - Decompose into atomic semantic units |
| - Classify each unit by type |
| - Identify entities (staff, participants, etc.) |
| - Tag with domains |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 3: VERIFY (LLM Pass 2) |
| - Quality check extraction |
| - Flag low confidence units |
| - Identify potential gaps |
| - Check entity resolution accuracy |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 4: ENRICH (LLM Pass 3 + Rules) |
| - Add inferred metadata |
| - Resolve entity references to IDs |
| - Generate search-optimized text |
| - Create unit summary + context variants |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 5: EMBED (API Call) |
| - Generate multiple embeddings per unit: |
| * Atomic (just the unit text) |
| * Contextual (unit + surrounding context) |
| * Summary (synthesized overview) |
| - Store all strategies with tags |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Stage 6: INDEX |
| - Store in unified schema |
| - Update search indexes |
| - Link to source entities |
+-------------------------------------------------------+
3.2 Stage Details
Stage 1: Ingest
Store the raw content exactly as received:
{
container_type: 'staff_report',
origin_system: 'shift_notes',
origin_subtype: 'daily_report',
external_id: 'shift-2026-01-21-venue-a',
author_id: 'staff-uuid',
author_display: 'Jane Smith',
created_at: '2026-01-21T08:00:00Z',
raw_text: '... original content ...',
raw_payload: { /* any structured metadata */ }
}
Immutability rule: Raw content never changes. If source updates, create new version.
Stage 2: Extract
LLM prompt breaks content into atomic units:
Input: "Sarah arrived at 8am looking tired. She mentioned her dog was sick overnight.
During morning activities she was engaged but needed reminders about her medication at 10am.
Note: Sarah's NDIS plan review is due next month."
Output:
- [state_fact] Sarah appeared tired on arrival at 8:00 AM
- [general_fact] Sarah's dog was sick overnight (context for tiredness)
- [assessment] Sarah was engaged during morning activities
- [action_required] Sarah needed medication reminder at 10:00 AM
- [event] Medication reminder given at 10:00 AM (implied completed)
- [status_update] Sarah's NDIS plan review due next month
Each unit is self-contained and searchable independently.
Stage 3: Verify
Second LLM pass reviews extraction quality:
Questions to answer:
- Did we miss any facts?
- Are entities correctly identified?
- Any ambiguous interpretations?
- Confidence score for each unit?
Low-confidence units get flagged for human review or excluded from high-trust searches.
Stage 4: Enrich
Add metadata that helps with search:
{
unit_type: 'action_required',
text_for_embedding: 'Sarah needed medication reminder at 10:00 AM on January 21, 2026',
domain_tags: ['participants', 'health'],
entity_refs: {
participant_ids: ['sarah-uuid'],
staff_ids: ['author-uuid']
},
time: {
effective_at: '2026-01-21T10:00:00Z',
is_time_sensitive: true
},
authority: {
author_role: 'support_worker',
is_official_record: true
},
safety: {
pii_present: true,
medical_sensitive: true
}
}
Stage 5: Embed
Generate multiple embeddings for each unit:
| Strategy | What It Embeds | Good For |
|---|---|---|
atomic | Just the unit text | Precise fact retrieval |
contextual | Unit + 2-3 surrounding units | Understanding in context |
summary | Synthesized source overview | Broad topic search |
All use text-embedding-3-large (3072 dimensions) for consistency.
Stage 6: Index
Store everything in the unified schema with proper indexes for:
- Vector similarity (HNSW index)
- Domain tag filtering (GIN index)
- Entity lookups (B-tree on entity_refs)
- Time range queries (B-tree on effective_at)
4. Search Pipeline
4.1 Query Processing
User Query: "What happened with Sarah's medication last week?"
|
v
+-------------------------------------------------------+
| Step 1: QUERY ANALYSIS |
| - Extract intent (information retrieval) |
| - Identify entities (Sarah, medication) |
| - Identify time constraints (last week) |
| - Identify domains (participants, health) |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Step 2: QUERY EXPANSION (Optional) |
| LLM generates search variants: |
| - "Sarah medication reminder" |
| - "Sarah health medication administration" |
| - "participant medication compliance" |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Step 3: HYBRID SEARCH |
| Combine multiple signals: |
| - Vector similarity (semantic match) |
| - Entity filter (participant = Sarah) |
| - Domain filter (health, participants) |
| - Time filter (last 7 days) |
| - Authority weight (official records higher) |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Step 4: RESULT FUSION |
| - Merge results from all strategies |
| - De-duplicate by source |
| - Score by combined relevance |
+-------------------------------------------------------+
|
v
+-------------------------------------------------------+
| Step 5: LLM RE-RANK (Optional) |
| - Take top 20 candidates |
| - LLM scores relevance to original query |
| - Return top 5 with confidence |
+-------------------------------------------------------+
4.2 Scoring Formula
final_score = (
vector_similarity * 0.5
+ entity_match_bonus * 0.2
+ recency_score * 0.15
+ authority_score * 0.15
)
Weights are tunable per use case.
4.3 Search Modes
| Mode | Description | Use Case |
|---|---|---|
precise | Strict filters, high threshold | Finding specific facts |
exploratory | Loose filters, broader results | Understanding a topic |
entity_focused | Filter by entity first | "Everything about X" |
temporal | Time-weighted heavily | "What happened recently" |
5. Data Cleaning & Quality
5.1 Noise Removal by Container Type
| Container | What to Remove |
|---|---|
short_message | Chatter, emojis (unless meaningful), "ok/thanks" |
long_message | Signatures, legal footers, quoted history, unsubscribe blocks |
staff_report | Boilerplate headers, template text |
transcript | Filler words ("um", "like"), false starts |
resource_document | Table of contents, page numbers, headers/footers |
5.2 Entity Resolution
When content mentions people/things:
- Extract mention text ("Sarah", "the blue van")
- Fuzzy match against known entities
- If confident (>80%), link to entity ID
- If uncertain, flag for manual resolution
- Store both mention text AND resolved ID
5.3 PII & Sensitivity Handling
Flag units containing:
- Full names + identifying details
- Medical information
- Financial data
- Contact details
- Incident details naming individuals
Flagged units:
- Still searchable by authorized users
- Excluded from cross-user knowledge extraction
- Redacted in certain search contexts
5.4 Quality Metrics
Track per source:
- Extraction confidence distribution
- Entity resolution rate
- Units per source (extraction yield)
- Search hit rate (are these units being found?)
- False positive rate (found but irrelevant)
6. Current Systems Audit
6.1 Existing Embedding Usage
| System | Table | Dims | Status | Upgrade Plan |
|---|---|---|---|---|
| Discord | comms.discord_messages | 3072 | Active | Migrate to unified schema |
| Resources | resources.chunks | 3072 | Active | Add semantic decomposition |
| Chat Context | core_ops.session_summaries | 1536 | Active | Upgrade dims + add cross-user |
| HR Policies | hr_policies.policy_vectors | 1536 | Active | Upgrade dims + decomposition |
| History Ribbon | core_ops.history_ribbon_* | 768 | Legacy | Deprecate or upgrade |
6.2 Migration Priority
- Discord - High volume, needs decomposition
- Resources/KB - Important reference material
- HR Policies - Critical for compliance queries
- Chat Context - Enable cross-user learning
- History Ribbon - Evaluate usage first
7. Unified Schema
7.1 Sources Table (Immutable Raw Content)
CREATE TABLE embeddings.sources (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
container_type VARCHAR(50) NOT NULL,
origin_system VARCHAR(50) NOT NULL,
origin_subtype VARCHAR(100),
external_id VARCHAR(255),
author_id VARCHAR(255),
author_display VARCHAR(255),
created_at TIMESTAMPTZ NOT NULL,
raw_text TEXT NOT NULL,
raw_payload JSONB,
extraction_status VARCHAR(20) DEFAULT 'pending',
extracted_at TIMESTAMPTZ,
schema_version INT DEFAULT 1,
UNIQUE (origin_system, external_id)
);
7.2 Units Table (Searchable Atoms)
CREATE TABLE embeddings.units (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_id UUID REFERENCES embeddings.sources(id),
unit_type VARCHAR(50) NOT NULL,
text_for_embedding TEXT NOT NULL,
supporting_text TEXT,
-- Multiple embedding strategies
embedding_atomic vector(3072),
embedding_contextual vector(3072),
embedding_summary vector(3072),
embedding_model VARCHAR(50),
-- Filtering metadata
domain_tags TEXT[],
topic_tags TEXT[],
entity_refs JSONB DEFAULT '{}',
record_scope VARCHAR(20),
-- Time relevance
effective_at TIMESTAMPTZ,
is_time_sensitive BOOLEAN DEFAULT false,
-- Authority
author_role VARCHAR(100),
is_official_record BOOLEAN DEFAULT false,
-- Quality
confidence DECIMAL(3,2) DEFAULT 1.0,
-- Safety
pii_present BOOLEAN DEFAULT false,
medical_sensitive BOOLEAN DEFAULT false
);
7.3 Indexes
-- Vector similarity (HNSW for fast ANN)
CREATE INDEX idx_units_atomic ON embeddings.units
USING hnsw (embedding_atomic vector_cosine_ops);
CREATE INDEX idx_units_contextual ON embeddings.units
USING hnsw (embedding_contextual vector_cosine_ops);
-- Filtering
CREATE INDEX idx_units_domain ON embeddings.units USING GIN(domain_tags);
CREATE INDEX idx_units_entities ON embeddings.units USING GIN(entity_refs);
CREATE INDEX idx_units_time ON embeddings.units(effective_at);
CREATE INDEX idx_units_official ON embeddings.units(is_official_record)
WHERE is_official_record;
8. Implementation Status
Last updated: 2026-01-21
Architecture Implemented
Unified Server-Integrated Processor - Instead of a separate worker script, the embedding processor runs inside the main server as a continuous background loop. This ensures:
- Single code path for backfill AND ongoing content
- No "different versions" problem between scripts
- Self-healing (auto-restarts with server)
- Monitorable via API endpoints
┌─────────────────────────────────────────────────────────────┐
│ Content Sources │
├─────────────────┬─────────────────┬─────────────────────────┤
│ Discord Bot │ Resources KB │ Future (emails, etc) │
│ (polling sync) │ (PDF/MD ingest) │ │
└────────┬────────┴────────┬────────┴────────┬────────────────┘
│ │ │
│ ingestToNew │ ingestChunk │
│ Embeddings() │ ToNewEmbed() │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ embeddings.sources (pending queue) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ embedding-processor.js (runs in server) │
│ │
│ while (running) { │
│ 1. Check for pending sources │
│ 2. If found, process one through multi-pass LLM │
│ 3. Store units + embeddings │
│ 4. Sleep (1s normal, 500ms overnight) │
│ 5. Repeat │
│ } │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ embeddings.units (searchable atoms) │
│ - 3 embeddings per unit (atomic, contextual, summary) │
│ - Rich metadata (entities, sentiment, intent, domains) │
│ - Linked back to original source │
└─────────────────────────────────────────────────────────────┘
Phase 1: Foundation ✅ COMPLETE
- Create
embeddingsschema - Create sources and units tables
- Create extraction_queue table
- Create indexes (IVFFlat for 3072 dims)
- Build ingest functions for each source type
Phase 2: Pipeline ✅ COMPLETE
- Build LLM extraction prompts (anti-hallucination rules)
- Build verification pass (confidence scoring)
- Build enrichment pass (entities, sentiment, intent, why_context)
- Build multi-strategy embedding (atomic, contextual, summary)
- Thread context fetching (Discord: 10 preceding messages)
Phase 3: Migration ✅ COMPLETE (Queued)
- Discord messages ingested (115,740 sources queued)
- Resources/KB ingested (89 sources queued)
- HR Policies - future
- Chat Context - future
Phase 4: Search ✅ COMPLETE
- Build
searchWithSource()- returns units + original content - Multi-strategy search (tries atomic, contextual, summary)
- Metadata filtering (domains, entities, sentiment)
- API endpoint:
GET /api/v1/dashboard/semantic-search - API endpoint:
GET /api/v1/dashboard/semantic-stats - LLM re-ranking - future enhancement
Phase 5: Integration ✅ COMPLETE
- Discord search page uses new semantic search (with legacy fallback)
- Processor control endpoints (pause/resume)
- Discord bot hooks into new system (new messages auto-queued)
- Resource KB hooks into new system (new docs auto-queued)
- Update Reggie SMS to use new search - future
- Update analytics widgets - future
Files Created/Modified
| File | Purpose |
|---|---|
backend/services/embedding-extraction.js | Multi-pass LLM pipeline |
backend/services/embedding-processor.js | Server-integrated background worker |
backend/services/embedding-worker.js | CLI tool for bulk ingestion/status |
backend/routes_v1p/dashboard.js | API endpoints for search & stats |
bot/discord-kb-sync.js | Added new embedding ingest hook |
backend/services/resource-kb.js | Added new embedding ingest hook |
admin/src/js/pages/page_discord_search.js | Updated to use semantic search |
SQL Migrations
| File | Purpose |
|---|---|
20260121_embeddings_unified_schema.sql | Core schema (sources, units, queue) |
20260121_embeddings_metadata_enrichment.sql | Enrichment columns + functions |
Current Queue Status
As of 2026-01-21:
- 115,829 sources queued for processing
- Discord: 115,740 messages
- Resources: 89 chunks
- Processing rate: ~30s per source (multi-pass LLM)
- Estimated time: Worker runs continuously, prioritizes new content
To Start Processing
-
Just start the server - the processor starts automatically:
cd backend
node server.js -
Monitor progress via API or CLI:
# CLI status
node services/embedding-worker.js status
# Or via API
curl http://localhost:3009/api/v1/dashboard/semantic-stats -
Control the processor if needed:
# Pause processing
curl -X POST http://localhost:3009/api/v1/dashboard/embedding-processor/pause
# Resume processing
curl -X POST http://localhost:3009/api/v1/dashboard/embedding-processor/resume
9. Cost Estimates
| Item | Unit Cost | Monthly Volume | Monthly Cost |
|---|---|---|---|
| Embedding (3072 dim) | $0.00013/1K tokens | ~500K tokens | ~$0.07 |
| LLM Extraction (GPT-4o-mini) | $0.15/1M tokens | ~2M tokens | ~$0.30 |
| LLM Verification (GPT-4o-mini) | $0.15/1M tokens | ~1M tokens | ~$0.15 |
| LLM Re-ranking (GPT-4o-mini) | $0.15/1M tokens | ~500K tokens | ~$0.08 |
| Total | ~$5-10/month |
10. Success Criteria
- Search Quality: 80%+ of test queries return expected results in top 5
- Coverage: All major content types indexed (Discord, Resources, Policies, Reports)
- Latency: Search returns in <500ms
- Freshness: New content indexed within 5 minutes
- Maintainability: Adding new source type requires no schema changes
Changelog
| Date | Change | Author |
|---|---|---|
| 2026-01-21 | Initial document created | Droid |
| 2026-01-21 | Full implementation complete - schema, pipeline, search, integrations | Droid |
| 2026-01-21 | Changed to server-integrated processor (vs separate worker script) | Droid |
| 2026-01-21 | Added hooks to Discord bot and Resource KB for automatic ingestion | Droid |
| 2026-01-21 | Queued 115,829 sources for processing (115,740 Discord + 89 Resources) | Droid |