Bookface -- Facial Recognition & Safe Arrival

Bookface is the facial recognition subsystem of RABS. It identifies known staff and participants in photos (from Discord and CCTV) and powers two core capabilities: gallery auto-tagging and safe arrival/departure tracking across locations.

1. Purpose

RABS manages a network of locations with ~100 participants out at any given time accompanied by ~60 staff. In a crisis, evacuation, or routine check, knowing who is physically where is critical. Currently this relies on manual sign-in sheets, phone calls, and memory.

Bookface solves this by:

Gallery tagging -- automatically identifying people in Discord photos for newsletters and reports.
Safe arrival tracking -- logging when a known person enters or leaves a premises via CCTV.
Location awareness -- providing a real-time dashboard of who is at which location.
Drill metrics -- measuring evacuation response times and headcount accuracy.

2. Model Selection

The Candidates

Model	Embedding Dims	Architecture	License	Low-Res Performance	ONNX Support	Notes
ArcFace (InsightFace)	512	ResNet-100/50	MIT	Good	Native	Industry standard, best accuracy-to-speed ratio
AdaFace	512	ResNet-100	MIT	Excellent	Via export	Specifically designed for low-quality/low-res images
FaceNet	128	Inception-ResNet	Apache 2.0	Moderate	Via export	Google's original, well-understood but dated
face-api.js	128	SSD + ResNet	MIT	Moderate	TF.js native	Pure JS, no native deps, lower accuracy
QMagFace	512	ResNet-100	Research	Excellent	Via export	Quality-aware, handles mixed-quality galleries
VGG-Face2	2048	SE-ResNet-50	CC BY-SA	Moderate	Via export	Large embeddings, older architecture

Recommendation: ArcFace (primary) + AdaFace (CCTV)

For Discord photos (high resolution, good lighting, cooperative subjects): ArcFace R100 is the best choice. It is the most widely deployed face recognition model, has native ONNX models available from InsightFace, runs efficiently on GPU, and achieves 99.8%+ accuracy on LFW (Labelled Faces in the Wild) benchmark.

For CCTV footage (low resolution, variable lighting, motion blur, uncooperative angles): AdaFace is specifically trained to handle image quality degradation. It uses an adaptive margin function that adjusts confidence thresholds based on detected image quality -- meaning it won't confidently misidentify a blurry face, but will still correctly match a slightly-blurry-but-recognisable one. This is critical for entrance/exit cameras.

Both models produce 512-dim embeddings and can share the same comparison infrastructure. The system can run both and use whichever is appropriate for the source.

Why Not Fine-Tune?

Fine-tuning a face recognition model on your own data is an option but probably unnecessary for this use case:

Arguments against fine-tuning (for now):

ArcFace/AdaFace are trained on millions of faces across all ethnicities, ages, and conditions -- they generalise extremely well
With ~200 known people, the matching problem is simple (small gallery, high embedding quality)
Fine-tuning requires careful data augmentation, validation splits, and risks overfitting to current staff/participants who will change over time
The reference photo library (bookface) already gives you per-person adaptation without touching model weights

When fine-tuning WOULD make sense:

If recognition accuracy on your specific CCTV cameras drops below acceptable thresholds after testing
If you have a challenging edge case (e.g. cameras with extreme IR distortion, specific lighting conditions)
If you expand to hundreds of locations with thousands of people where the embedding space gets crowded

Recommended approach: Start with pre-trained models. Measure accuracy. Fine-tune only if needed, and only on the CCTV model (not the gallery model which works on high-res photos).

3. Technical Architecture

Hardware

GPU: NVIDIA RTX 5080 (16GB VRAM) -- currently unused on the server
Runtime: ONNX Runtime with CUDA execution provider
Language: Node.js via onnxruntime-node

Performance Estimates (RTX 5080)

Operation	Time per Image	Batch of 100
Face detection (RetinaFace)	~10ms	~1s
Face alignment + crop	~2ms	~200ms
ArcFace embedding (512-dim)	~5ms	~500ms
Gallery comparison (200 people)	~1ms	~100ms
Total per image	~18ms	~1.8s

For CCTV processing at 1 frame per 5 seconds across 10 cameras, that's 2 frames/second -- well within the GPU's capacity with room to spare.

Processing Pipeline

Image Source (Discord / CCTV)
        │
        ▼
  ┌─────────────┐
  │  RetinaFace  │  ← Face detection: find bounding boxes
  │  (detector)  │     Handles multiple faces per image
  └──────┬──────┘
         │  [face crops]
         ▼
  ┌─────────────┐
  │  Face Align  │  ← Normalise to 112x112 using 5-point landmarks
  │  (landmark)  │     Eyes, nose, mouth corners
  └──────┬──────┘
         │  [aligned faces]
         ▼
  ┌─────────────┐
  │  ArcFace /   │  ← Generate 512-dim embedding
  │  AdaFace     │     ArcFace for photos, AdaFace for CCTV
  └──────┬──────┘
         │  [embeddings]
         ▼
  ┌─────────────┐
  │  Matcher     │  ← Compare against bookface reference library
  │  (cosine /   │     Euclidean distance < threshold = match
  │   euclidean) │     Unknown if no match within threshold
  └──────┬──────┘
         │  [identities]
         ▼
  ┌─────────────┐
  │  Store       │  ← Update metadata JSONB + face_count
  │  (postgres)  │     Tag with staff_id / participant_id
  └─────────────┘

Embedding Storage

Face embeddings are stored in the metadata JSONB column of media.discord_media:

{
  "faces": [
    {
      "bbox": [120, 45, 280, 230],
      "confidence": 0.98,
      "embedding": [0.0234, -0.0891, ...],
      "match": {
        "type": "staff",
        "id": 142,
        "name": "Brett",
        "distance": 0.42
      }
    },
    {
      "bbox": [350, 60, 490, 240],
      "confidence": 0.95,
      "embedding": [0.0567, -0.0234, ...],
      "match": null
    }
  ]
}

The second face has "match": null -- detected but not recognised (unknown person / bystander).

Reference Library (Bookface Folder)

admin_drive/media/bookface/
  staff_142/           ← folder name = identity
    front.jpg
    left.jpg
    right.jpg
    smile.jpg
  participant_305/
    front.jpg
    outdoor.jpg

On startup (or when triggered), the worker:

Scans all bookface subfolders
Runs each reference photo through the same detection + embedding pipeline
Averages all embeddings per person into a centroid (mean embedding vector)
Caches the centroid library in memory for fast comparison

Adding new reference photos = drop files in the folder, trigger a re-scan. No retraining, no reprocessing of existing gallery images.

4. Safe Arrival -- CCTV Integration

Concept

Each RABS location has entrance/exit cameras. The system processes frames from these cameras to detect and identify people entering or leaving. This creates an automatic attendance log.

How It Works

CCTV Camera (location entrance)
        │
        │  frame grab every 5 seconds
        ▼
  ┌──────────────┐
  │  Frame Drop   │  ← Camera writes latest frame to a shared folder
  │  (overwrite)  │     e.g. //nas/cctv/location_01/entrance.jpg
  └──────┬───────┘
         │
         ▼
  ┌──────────────┐
  │  Watcher      │  ← Node.js file watcher or polling interval
  │  (per camera) │     Picks up new frames
  └──────┬───────┘
         │
         ▼
  ┌──────────────┐
  │  Recognition  │  ← Same pipeline as gallery (detect → embed → match)
  │  (AdaFace)    │     Uses AdaFace for better low-res handling
  └──────┬───────┘
         │  [identified people]
         ▼
  ┌──────────────┐
  │  Debounce     │  ← Don't log same person every 5 seconds
  │  (15 min gap) │     Only log arrival if not seen in last 15 mins
  └──────┬───────┘
         │
         ▼
  ┌──────────────┐
  │  Log Event    │  ← Insert into arrival/departure table
  │  (postgres)   │     staff_id, location_id, timestamp, direction
  └──────┬───────┘
         │
         ▼
  ┌──────────────┐
  │  Dashboard    │  ← Real-time location board
  │  (websocket)  │     "Cathy arrived at HQ with Alana, Mary, Joe"
  └──────────────┘

Rudimentary Version (Phase 3A)

The quickest path to a working prototype:

At each location, configure the CCTV camera (or a Dropbox-connected camera, or even a cheap USB webcam running a capture script) to save the latest entrance frame to a shared folder on the NAS, overwriting the same file:
```
//192.168.77.10/dev/rabs/storage/admin_drive/cctv/
  location_01_hq/
    entrance.jpg      ← overwritten every 5 seconds
  location_02_south/
    entrance.jpg
  location_03_north/
    entrance.jpg
```
A CCTV watcher worker (similar to media-grabber) polls these files every 5 seconds, runs face detection + recognition using AdaFace, and logs arrivals/departures.
The dashboard shows a simple table: Location | Person | Arrived | Left | Duration.

This version requires NO integration with existing CCTV systems -- just getting one frame image dropped to a network folder. Most IP cameras support FTP upload or RTSP streams that can be captured with a single ffmpeg command:

# Grab one frame every 5 seconds from an RTSP camera stream
ffmpeg -rtsp_transport tcp -i rtsp://camera_ip:554/stream1 \
  -vf "fps=1/5" -update 1 -y //nas/cctv/location_01/entrance.jpg

Full Version (Phase 3B)

Direction detection (entering vs leaving) using frame-to-frame tracking
Multiple cameras per location (entrance, exit, common areas)
Confidence scoring -- only log arrivals above a threshold to avoid false positives
Alert system -- "Participant X hasn't arrived at expected location by 9:30am"
Evacuation mode -- live headcount per location, checklist of who is accounted for
Historical analytics -- average arrival times, attendance patterns, drill response metrics

Arrival/Departure Schema (Future)

CREATE TABLE media.safe_arrival (
  id SERIAL PRIMARY KEY,
  person_type TEXT NOT NULL CHECK (person_type IN ('staff', 'participant')),
  person_id INT NOT NULL,               -- staff_id or participant_id
  location_id INT NOT NULL,             -- references a locations table
  camera_id TEXT,                       -- which camera spotted them
  direction TEXT CHECK (direction IN ('arrive', 'depart', 'unknown')),
  confidence FLOAT NOT NULL,            -- recognition confidence (0-1)
  frame_path TEXT,                      -- path to the frame that triggered the log
  detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  metadata JSONB NOT NULL DEFAULT '{}'  -- bbox, embedding distance, etc.
);

Facial recognition in a care/support environment requires careful handling:

Legal Requirements (Australia)

Privacy Act 1988 and Australian Privacy Principles (APPs) apply
Biometric data (face embeddings) is classified as sensitive information under APP 3
Collection requires consent or a permitted exception (safety of an individual)
A Privacy Impact Assessment (PIA) should be conducted before deployment

Recommended Safeguards

Safeguard	Implementation
Opt-in consent	Staff and participants must explicitly consent to being in the bookface library
Right to withdraw	Deleting a bookface folder removes all reference data immediately
No external transmission	All processing is local (GPU on server, no cloud APIs, no data leaves the LAN)
Data minimisation	See detailed rules below
Purpose limitation	Embeddings are used only for matching, never for profiling or behaviour analysis
Access control	Gallery face tags visible to management only, not general staff
Audit trail	All recognition events logged with confidence scores
CCTV signage	Locations using Safe Arrival must have visible "CCTV in operation" signage
Excluded individuals	Any person can be excluded from recognition (their face is detected but never matched)
Immediate deletion	CCTV frames are deleted immediately after processing -- only the result is stored, never the image

Data Minimisation Rules

The principle is: use the least identifying data possible for the task at hand.

Scenario	What to Store	What NOT to Store
Routine arrival count	Headcount + timestamp only (e.g. "4 people arrived at 9:15")	No names, no face data
Known person arrival	person_id + location + timestamp	No face crop, no embedding, no frame
Gallery auto-tagging	person_id tag in metadata	Embedding stored temporarily for re-matching, can be purged after tagging
Evacuation / drill	Full identification (name + location + time)	Only during active emergency/drill mode
CCTV frames	Deleted immediately after processing	Never retained on disk or in database

Anonymisation hierarchy (use the highest level that satisfies the need):

Count only -- "3 staff, 2 participants arrived" (no identity)
Type only -- "staff member arrived" (no specific person)
Identity -- "Brett arrived" (only when operationally required)

The system should default to the lowest level and only escalate when the specific feature requires it. For example, the daily arrival dashboard can show counts by default, with a "reveal names" button restricted to authorised users.

Participant Considerations

For participants in support/care settings:

Guardian/carer consent may be required depending on participant capacity
Safe Arrival positioning: frame as a safety feature (knowing someone arrived safely) not surveillance
Consider having the recognition system only confirm presence, not log movement patterns within a building
Where a headcount satisfies the safety requirement, do not resolve to individual identity
Participant face embeddings must be deletable on request with no residual data in backups beyond retention policy

6. Model Files & Dependencies

Required Packages

onnxruntime-node          ← ONNX inference engine (CUDA support built-in)
sharp                     ← Image preprocessing (resize, crop, normalise)

ONNX Model Files

Download from InsightFace model zoo and place in a models/ directory:

Model	File	Size	Purpose
RetinaFace	`det_10g.onnx`	~16MB	Face detection
ArcFace R100	`w600k_r50.onnx`	~167MB	Gallery embedding (high-res)
AdaFace IR-101	`adaface_ir101_webface12m.onnx`	~250MB	CCTV embedding (low-res tolerant)

Total: ~433MB of model files. Downloaded once, stored on disk.

CUDA Requirements

NVIDIA driver 535+ (for RTX 5080)
CUDA Toolkit 12.x
cuDNN 8.x or 9.x
onnxruntime-node automatically detects and uses CUDA if available, falls back to CPU

7. Matching Thresholds

Threshold tuning is critical -- too loose and you get false positives (wrong person identified), too strict and you get false negatives (known person tagged as unknown).

Recommended Starting Points

Scenario	Distance Metric	Threshold	Rationale
Gallery photos (ArcFace)	Cosine distance	0.40	High-res, good conditions, prioritise precision
CCTV arrival (AdaFace)	Cosine distance	0.50	Lower-res, allow slightly more tolerance
CCTV evacuation mode	Cosine distance	0.55	In emergency, accept more matches to maximise headcount

Calibration Process

Run recognition on a set of known test photos with ground truth labels
Plot the distribution of match distances for correct matches vs incorrect matches
Choose the threshold that minimises the overlap
Re-calibrate if cameras change, lighting conditions shift, or the population changes significantly

Handling Twins / Look-Alikes

ArcFace embeddings can sometimes confuse similar-looking individuals. Mitigations:

Increase reference photo diversity (more angles, more conditions)
Lower the match threshold for those specific individuals
Use a top-2 match approach: if the top two candidates are very close in distance, flag as "uncertain" rather than committing to one

8. Implementation Phases

Phase 3A: Gallery Auto-Tagging (First)

Effort: 1-2 weeks

Install onnxruntime-node and download ArcFace ONNX model
Build bookface reference library scanner (read folders, generate centroids)
Build background worker to process faces_processed = FALSE photos
Store results in metadata.faces[] JSONB
Add face tags to gallery UI (hover to see names on detected faces)
Add "People" smart album to gallery sidebar

Validation: Process 100 known Discord photos. Measure precision (correct matches / total matches) and recall (correct matches / total known faces in photos). Target: 95%+ precision, 85%+ recall.

Phase 3B: Safe Arrival Prototype (Second)

Effort: 2-3 weeks

Set up frame capture from one test camera (ffmpeg RTSP or file drop)
Build CCTV watcher worker with AdaFace model
Build arrival/departure table and debounce logic
Build simple dashboard showing current location occupancy
Test with controlled entries (known people walking past camera)

Validation: Run for one week at one location. Measure false positive rate (unknown person identified as someone) and false negative rate (known person not identified). Target: <2% false positive, <10% false negative.

Phase 3C: Full Safe Arrival (Third)

Effort: 4-6 weeks

Multi-location rollout
Direction detection (arrive vs depart)
Evacuation mode dashboard
Alert system for missing arrivals
Historical analytics and drill metrics
Privacy impact assessment and consent collection

9. Alternative: Training a Custom Model

If pre-trained models prove insufficient for CCTV conditions at your specific locations, a custom fine-tuned model is an option.

What You Would Need

Training data: 50-100 images per person, ideally captured from the actual CCTV cameras at various times of day
Hardware: Your RTX 5080 is more than sufficient for fine-tuning (typically takes 2-6 hours)
Framework: PyTorch with InsightFace training scripts
Base model: Start from ArcFace R100 pre-trained weights, fine-tune the last few layers

Process

Capture frames from CCTV cameras over 1-2 weeks
Manually label faces in frames (which person is which) -- this is the most labour-intensive step
Split into training (80%) and validation (20%) sets
Fine-tune from pre-trained weights with a low learning rate
Export to ONNX for deployment
A/B test against the pre-trained model to measure improvement

When This Makes Sense

Pre-trained model accuracy drops below 85% on your CCTV footage
Specific lighting conditions (heavy IR, backlighting) cause consistent failures
You've exhausted other options (better cameras, more reference photos, threshold tuning)

When This Does NOT Make Sense

You have fewer than 30 images per person from CCTV (too little data)
The issue is camera quality (fix the camera instead)
The pre-trained model works fine (don't fix what isn't broken)

10. Quick-Start Checklist

For when you're ready to begin implementation:

Verify NVIDIA drivers and CUDA toolkit are installed on the server
Run nvidia-smi to confirm the 5080 is visible
Install onnxruntime-node and sharp in the backend
Download ONNX model files (RetinaFace + ArcFace) -- ~200MB
Populate bookface/ with at least front + left + right photos per person
Run the reference library scanner to generate centroids
Process a test batch of Discord photos and review accuracy
For Safe Arrival: set up one test camera with frame capture to NAS
Conduct privacy impact assessment before production rollout

1. Purpose​

2. Model Selection​

The Candidates​

Recommendation: ArcFace (primary) + AdaFace (CCTV)​

Why Not Fine-Tune?​

3. Technical Architecture​

Hardware​

Performance Estimates (RTX 5080)​

Processing Pipeline​

Embedding Storage​

Reference Library (Bookface Folder)​

4. Safe Arrival -- CCTV Integration​

Concept​

How It Works​

Rudimentary Version (Phase 3A)​

Full Version (Phase 3B)​

Arrival/Departure Schema (Future)​

5. Privacy & Consent​

Legal Requirements (Australia)​

Recommended Safeguards​

Data Minimisation Rules​

Participant Considerations​

6. Model Files & Dependencies​

Required Packages​

ONNX Model Files​

CUDA Requirements​

7. Matching Thresholds​

Recommended Starting Points​

Calibration Process​

Handling Twins / Look-Alikes​

8. Implementation Phases​

Phase 3A: Gallery Auto-Tagging (First)​

Phase 3B: Safe Arrival Prototype (Second)​

Phase 3C: Full Safe Arrival (Third)​

9. Alternative: Training a Custom Model​

What You Would Need​

Process​

When This Makes Sense​

When This Does NOT Make Sense​

10. Quick-Start Checklist​