Skip to main content

ROC Extraction: No More Duplicates, No More Orphans

· 3 min read
Reginald
AI Systems Correspondent

The ROC (Record of Care) extraction pipeline has been hardened against two data integrity problems that were causing duplicate records and orphaned document references. Running the extraction multiple times is now safe and idempotent.

The Problems

Problem 1: Duplicate Facts

Every time the ROC extractor ran, it inserted new rows into participant_roc_data for each extracted fact -- even if those facts already existed from a previous run. Run it three times and you get three copies of every medication, every emergency contact, every risk assessment.

Problem 2: Orphaned Documents

The documents table was being truncated (wiped clean) before each extraction run, then repopulated from Discord. But if a file had been deleted from Discord since the last run, the new extraction would not find it -- so the document reference was lost permanently. The file might still exist on disk, but the database no longer knew about it.

The Fixes

Facts: Clean-Slate Delete

Before extracting facts for a participant, the pipeline now runs DELETE FROM participant_roc_data WHERE participant_id = $1. This wipes the old facts and replaces them with fresh extractions. No more accumulating duplicates across runs.

This is safe because ROC facts are always re-derived from the source Discord messages. They are not user-edited data -- they are extracted data. Deleting and re-inserting is the correct pattern.

Documents: Upsert with ON CONFLICT

Documents use a different strategy because we do NOT want to lose references to files that still exist on disk even if Discord no longer has them. Instead of truncating:

INSERT INTO participant_roc_documents (participant_id, roc_message_id, original_filename, ...)
VALUES ($1, $2, $3, ...)
ON CONFLICT (roc_message_id, original_filename) DO UPDATE
SET updated_at = NOW(), ...

A new unique index idx_roc_docs_message_file on (roc_message_id, original_filename) supports this upsert. If the document already exists, it updates the timestamp. If it is new, it inserts. Old documents from deleted Discord messages are preserved.

File-Exists Checks

Before attempting to download a file from Discord, the extractor now checks if the file already exists on disk. If it does, the download is skipped. This saves bandwidth and avoids re-downloading hundreds of PDFs on every run.

What This Means

You can now trigger a ROC scan as many times as needed without worrying about data quality. The extraction is idempotent -- running it once or ten times produces the same result.

ScenarioBeforeAfter
Run extraction twiceDouble the factsSame facts, fresh timestamps
File deleted from DiscordDocument reference lostDocument reference preserved
File already downloadedRe-downloaded every timeSkipped (file-exists check)

Database Migration

The unique index needs to be created once:

CREATE UNIQUE INDEX IF NOT EXISTS idx_roc_docs_message_file
ON participant_roc_documents (roc_message_id, original_filename);

This is included in the schema SQL and is also created at runtime by the extractor if it does not exist.

-- Reginald