Skip to main content

Bookface -- Facial Recognition & Safe Arrival

Bookface is the facial recognition subsystem of RABS. It identifies known staff and participants in photos (from Discord and CCTV) and powers two core capabilities: gallery auto-tagging and safe arrival/departure tracking across locations.


1. Purpose

RABS manages a network of locations with ~100 participants out at any given time accompanied by ~60 staff. In a crisis, evacuation, or routine check, knowing who is physically where is critical. Currently this relies on manual sign-in sheets, phone calls, and memory.

Bookface solves this by:

  1. Gallery tagging -- automatically identifying people in Discord photos for newsletters and reports.
  2. Safe arrival tracking -- logging when a known person enters or leaves a premises via CCTV.
  3. Location awareness -- providing a real-time dashboard of who is at which location.
  4. Drill metrics -- measuring evacuation response times and headcount accuracy.

2. Model Selection

The Candidates

ModelEmbedding DimsArchitectureLicenseLow-Res PerformanceONNX SupportNotes
ArcFace (InsightFace)512ResNet-100/50MITGoodNativeIndustry standard, best accuracy-to-speed ratio
AdaFace512ResNet-100MITExcellentVia exportSpecifically designed for low-quality/low-res images
FaceNet128Inception-ResNetApache 2.0ModerateVia exportGoogle's original, well-understood but dated
face-api.js128SSD + ResNetMITModerateTF.js nativePure JS, no native deps, lower accuracy
QMagFace512ResNet-100ResearchExcellentVia exportQuality-aware, handles mixed-quality galleries
VGG-Face22048SE-ResNet-50CC BY-SAModerateVia exportLarge embeddings, older architecture

Recommendation: ArcFace (primary) + AdaFace (CCTV)

For Discord photos (high resolution, good lighting, cooperative subjects): ArcFace R100 is the best choice. It is the most widely deployed face recognition model, has native ONNX models available from InsightFace, runs efficiently on GPU, and achieves 99.8%+ accuracy on LFW (Labelled Faces in the Wild) benchmark.

For CCTV footage (low resolution, variable lighting, motion blur, uncooperative angles): AdaFace is specifically trained to handle image quality degradation. It uses an adaptive margin function that adjusts confidence thresholds based on detected image quality -- meaning it won't confidently misidentify a blurry face, but will still correctly match a slightly-blurry-but-recognisable one. This is critical for entrance/exit cameras.

Both models produce 512-dim embeddings and can share the same comparison infrastructure. The system can run both and use whichever is appropriate for the source.

Why Not Fine-Tune?

Fine-tuning a face recognition model on your own data is an option but probably unnecessary for this use case:

Arguments against fine-tuning (for now):

  • ArcFace/AdaFace are trained on millions of faces across all ethnicities, ages, and conditions -- they generalise extremely well
  • With ~200 known people, the matching problem is simple (small gallery, high embedding quality)
  • Fine-tuning requires careful data augmentation, validation splits, and risks overfitting to current staff/participants who will change over time
  • The reference photo library (bookface) already gives you per-person adaptation without touching model weights

When fine-tuning WOULD make sense:

  • If recognition accuracy on your specific CCTV cameras drops below acceptable thresholds after testing
  • If you have a challenging edge case (e.g. cameras with extreme IR distortion, specific lighting conditions)
  • If you expand to hundreds of locations with thousands of people where the embedding space gets crowded

Recommended approach: Start with pre-trained models. Measure accuracy. Fine-tune only if needed, and only on the CCTV model (not the gallery model which works on high-res photos).


3. Technical Architecture

Hardware

  • GPU: NVIDIA RTX 5080 (16GB VRAM) -- currently unused on the server
  • Runtime: ONNX Runtime with CUDA execution provider
  • Language: Node.js via onnxruntime-node

Performance Estimates (RTX 5080)

OperationTime per ImageBatch of 100
Face detection (RetinaFace)~10ms~1s
Face alignment + crop~2ms~200ms
ArcFace embedding (512-dim)~5ms~500ms
Gallery comparison (200 people)~1ms~100ms
Total per image~18ms~1.8s

For CCTV processing at 1 frame per 5 seconds across 10 cameras, that's 2 frames/second -- well within the GPU's capacity with room to spare.

Processing Pipeline

Image Source (Discord / CCTV)


┌─────────────┐
│ RetinaFace │ ← Face detection: find bounding boxes
│ (detector) │ Handles multiple faces per image
└──────┬──────┘
│ [face crops]

┌─────────────┐
│ Face Align │ ← Normalise to 112x112 using 5-point landmarks
│ (landmark) │ Eyes, nose, mouth corners
└──────┬──────┘
│ [aligned faces]

┌─────────────┐
│ ArcFace / │ ← Generate 512-dim embedding
│ AdaFace │ ArcFace for photos, AdaFace for CCTV
└──────┬──────┘
│ [embeddings]

┌─────────────┐
│ Matcher │ ← Compare against bookface reference library
│ (cosine / │ Euclidean distance < threshold = match
│ euclidean) │ Unknown if no match within threshold
└──────┬──────┘
│ [identities]

┌─────────────┐
│ Store │ ← Update metadata JSONB + face_count
│ (postgres) │ Tag with staff_id / participant_id
└─────────────┘

Embedding Storage

Face embeddings are stored in the metadata JSONB column of media.discord_media:

{
"faces": [
{
"bbox": [120, 45, 280, 230],
"confidence": 0.98,
"embedding": [0.0234, -0.0891, ...],
"match": {
"type": "staff",
"id": 142,
"name": "Brett",
"distance": 0.42
}
},
{
"bbox": [350, 60, 490, 240],
"confidence": 0.95,
"embedding": [0.0567, -0.0234, ...],
"match": null
}
]
}

The second face has "match": null -- detected but not recognised (unknown person / bystander).

Reference Library (Bookface Folder)

admin_drive/media/bookface/
staff_142/ ← folder name = identity
front.jpg
left.jpg
right.jpg
smile.jpg
participant_305/
front.jpg
outdoor.jpg

On startup (or when triggered), the worker:

  1. Scans all bookface subfolders
  2. Runs each reference photo through the same detection + embedding pipeline
  3. Averages all embeddings per person into a centroid (mean embedding vector)
  4. Caches the centroid library in memory for fast comparison

Adding new reference photos = drop files in the folder, trigger a re-scan. No retraining, no reprocessing of existing gallery images.


4. Safe Arrival -- CCTV Integration

Concept

Each RABS location has entrance/exit cameras. The system processes frames from these cameras to detect and identify people entering or leaving. This creates an automatic attendance log.

How It Works

CCTV Camera (location entrance)

│ frame grab every 5 seconds

┌──────────────┐
│ Frame Drop │ ← Camera writes latest frame to a shared folder
│ (overwrite) │ e.g. //nas/cctv/location_01/entrance.jpg
└──────┬───────┘


┌──────────────┐
│ Watcher │ ← Node.js file watcher or polling interval
│ (per camera) │ Picks up new frames
└──────┬───────┘


┌──────────────┐
│ Recognition │ ← Same pipeline as gallery (detect → embed → match)
│ (AdaFace) │ Uses AdaFace for better low-res handling
└──────┬───────┘
│ [identified people]

┌──────────────┐
│ Debounce │ ← Don't log same person every 5 seconds
│ (15 min gap) │ Only log arrival if not seen in last 15 mins
└──────┬───────┘


┌──────────────┐
│ Log Event │ ← Insert into arrival/departure table
│ (postgres) │ staff_id, location_id, timestamp, direction
└──────┬───────┘


┌──────────────┐
│ Dashboard │ ← Real-time location board
│ (websocket) │ "Cathy arrived at HQ with Alana, Mary, Joe"
└──────────────┘

Rudimentary Version (Phase 3A)

The quickest path to a working prototype:

  1. At each location, configure the CCTV camera (or a Dropbox-connected camera, or even a cheap USB webcam running a capture script) to save the latest entrance frame to a shared folder on the NAS, overwriting the same file:

    //192.168.77.10/dev/rabs/storage/admin_drive/cctv/
    location_01_hq/
    entrance.jpg ← overwritten every 5 seconds
    location_02_south/
    entrance.jpg
    location_03_north/
    entrance.jpg
  2. A CCTV watcher worker (similar to media-grabber) polls these files every 5 seconds, runs face detection + recognition using AdaFace, and logs arrivals/departures.

  3. The dashboard shows a simple table: Location | Person | Arrived | Left | Duration.

This version requires NO integration with existing CCTV systems -- just getting one frame image dropped to a network folder. Most IP cameras support FTP upload or RTSP streams that can be captured with a single ffmpeg command:

# Grab one frame every 5 seconds from an RTSP camera stream
ffmpeg -rtsp_transport tcp -i rtsp://camera_ip:554/stream1 \
-vf "fps=1/5" -update 1 -y //nas/cctv/location_01/entrance.jpg

Full Version (Phase 3B)

  • Direction detection (entering vs leaving) using frame-to-frame tracking
  • Multiple cameras per location (entrance, exit, common areas)
  • Confidence scoring -- only log arrivals above a threshold to avoid false positives
  • Alert system -- "Participant X hasn't arrived at expected location by 9:30am"
  • Evacuation mode -- live headcount per location, checklist of who is accounted for
  • Historical analytics -- average arrival times, attendance patterns, drill response metrics

Arrival/Departure Schema (Future)

CREATE TABLE media.safe_arrival (
id SERIAL PRIMARY KEY,
person_type TEXT NOT NULL CHECK (person_type IN ('staff', 'participant')),
person_id INT NOT NULL, -- staff_id or participant_id
location_id INT NOT NULL, -- references a locations table
camera_id TEXT, -- which camera spotted them
direction TEXT CHECK (direction IN ('arrive', 'depart', 'unknown')),
confidence FLOAT NOT NULL, -- recognition confidence (0-1)
frame_path TEXT, -- path to the frame that triggered the log
detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
metadata JSONB NOT NULL DEFAULT '{}' -- bbox, embedding distance, etc.
);

Facial recognition in a care/support environment requires careful handling:

  • Privacy Act 1988 and Australian Privacy Principles (APPs) apply
  • Biometric data (face embeddings) is classified as sensitive information under APP 3
  • Collection requires consent or a permitted exception (safety of an individual)
  • A Privacy Impact Assessment (PIA) should be conducted before deployment
SafeguardImplementation
Opt-in consentStaff and participants must explicitly consent to being in the bookface library
Right to withdrawDeleting a bookface folder removes all reference data immediately
No external transmissionAll processing is local (GPU on server, no cloud APIs, no data leaves the LAN)
Data minimisationSee detailed rules below
Purpose limitationEmbeddings are used only for matching, never for profiling or behaviour analysis
Access controlGallery face tags visible to management only, not general staff
Audit trailAll recognition events logged with confidence scores
CCTV signageLocations using Safe Arrival must have visible "CCTV in operation" signage
Excluded individualsAny person can be excluded from recognition (their face is detected but never matched)
Immediate deletionCCTV frames are deleted immediately after processing -- only the result is stored, never the image

Data Minimisation Rules

The principle is: use the least identifying data possible for the task at hand.

ScenarioWhat to StoreWhat NOT to Store
Routine arrival countHeadcount + timestamp only (e.g. "4 people arrived at 9:15")No names, no face data
Known person arrivalperson_id + location + timestampNo face crop, no embedding, no frame
Gallery auto-taggingperson_id tag in metadataEmbedding stored temporarily for re-matching, can be purged after tagging
Evacuation / drillFull identification (name + location + time)Only during active emergency/drill mode
CCTV framesDeleted immediately after processingNever retained on disk or in database

Anonymisation hierarchy (use the highest level that satisfies the need):

  1. Count only -- "3 staff, 2 participants arrived" (no identity)
  2. Type only -- "staff member arrived" (no specific person)
  3. Identity -- "Brett arrived" (only when operationally required)

The system should default to the lowest level and only escalate when the specific feature requires it. For example, the daily arrival dashboard can show counts by default, with a "reveal names" button restricted to authorised users.

Participant Considerations

For participants in support/care settings:

  • Guardian/carer consent may be required depending on participant capacity
  • Safe Arrival positioning: frame as a safety feature (knowing someone arrived safely) not surveillance
  • Consider having the recognition system only confirm presence, not log movement patterns within a building
  • Where a headcount satisfies the safety requirement, do not resolve to individual identity
  • Participant face embeddings must be deletable on request with no residual data in backups beyond retention policy

6. Model Files & Dependencies

Required Packages

onnxruntime-node          ← ONNX inference engine (CUDA support built-in)
sharp ← Image preprocessing (resize, crop, normalise)

ONNX Model Files

Download from InsightFace model zoo and place in a models/ directory:

ModelFileSizePurpose
RetinaFacedet_10g.onnx~16MBFace detection
ArcFace R100w600k_r50.onnx~167MBGallery embedding (high-res)
AdaFace IR-101adaface_ir101_webface12m.onnx~250MBCCTV embedding (low-res tolerant)

Total: ~433MB of model files. Downloaded once, stored on disk.

CUDA Requirements

  • NVIDIA driver 535+ (for RTX 5080)
  • CUDA Toolkit 12.x
  • cuDNN 8.x or 9.x
  • onnxruntime-node automatically detects and uses CUDA if available, falls back to CPU

7. Matching Thresholds

Threshold tuning is critical -- too loose and you get false positives (wrong person identified), too strict and you get false negatives (known person tagged as unknown).

ScenarioDistance MetricThresholdRationale
Gallery photos (ArcFace)Cosine distance0.40High-res, good conditions, prioritise precision
CCTV arrival (AdaFace)Cosine distance0.50Lower-res, allow slightly more tolerance
CCTV evacuation modeCosine distance0.55In emergency, accept more matches to maximise headcount

Calibration Process

  1. Run recognition on a set of known test photos with ground truth labels
  2. Plot the distribution of match distances for correct matches vs incorrect matches
  3. Choose the threshold that minimises the overlap
  4. Re-calibrate if cameras change, lighting conditions shift, or the population changes significantly

Handling Twins / Look-Alikes

ArcFace embeddings can sometimes confuse similar-looking individuals. Mitigations:

  • Increase reference photo diversity (more angles, more conditions)
  • Lower the match threshold for those specific individuals
  • Use a top-2 match approach: if the top two candidates are very close in distance, flag as "uncertain" rather than committing to one

8. Implementation Phases

Effort: 1-2 weeks

  1. Install onnxruntime-node and download ArcFace ONNX model
  2. Build bookface reference library scanner (read folders, generate centroids)
  3. Build background worker to process faces_processed = FALSE photos
  4. Store results in metadata.faces[] JSONB
  5. Add face tags to gallery UI (hover to see names on detected faces)
  6. Add "People" smart album to gallery sidebar

Validation: Process 100 known Discord photos. Measure precision (correct matches / total matches) and recall (correct matches / total known faces in photos). Target: 95%+ precision, 85%+ recall.

Phase 3B: Safe Arrival Prototype (Second)

Effort: 2-3 weeks

  1. Set up frame capture from one test camera (ffmpeg RTSP or file drop)
  2. Build CCTV watcher worker with AdaFace model
  3. Build arrival/departure table and debounce logic
  4. Build simple dashboard showing current location occupancy
  5. Test with controlled entries (known people walking past camera)

Validation: Run for one week at one location. Measure false positive rate (unknown person identified as someone) and false negative rate (known person not identified). Target: <2% false positive, <10% false negative.

Phase 3C: Full Safe Arrival (Third)

Effort: 4-6 weeks

  1. Multi-location rollout
  2. Direction detection (arrive vs depart)
  3. Evacuation mode dashboard
  4. Alert system for missing arrivals
  5. Historical analytics and drill metrics
  6. Privacy impact assessment and consent collection

9. Alternative: Training a Custom Model

If pre-trained models prove insufficient for CCTV conditions at your specific locations, a custom fine-tuned model is an option.

What You Would Need

  • Training data: 50-100 images per person, ideally captured from the actual CCTV cameras at various times of day
  • Hardware: Your RTX 5080 is more than sufficient for fine-tuning (typically takes 2-6 hours)
  • Framework: PyTorch with InsightFace training scripts
  • Base model: Start from ArcFace R100 pre-trained weights, fine-tune the last few layers

Process

  1. Capture frames from CCTV cameras over 1-2 weeks
  2. Manually label faces in frames (which person is which) -- this is the most labour-intensive step
  3. Split into training (80%) and validation (20%) sets
  4. Fine-tune from pre-trained weights with a low learning rate
  5. Export to ONNX for deployment
  6. A/B test against the pre-trained model to measure improvement

When This Makes Sense

  • Pre-trained model accuracy drops below 85% on your CCTV footage
  • Specific lighting conditions (heavy IR, backlighting) cause consistent failures
  • You've exhausted other options (better cameras, more reference photos, threshold tuning)

When This Does NOT Make Sense

  • You have fewer than 30 images per person from CCTV (too little data)
  • The issue is camera quality (fix the camera instead)
  • The pre-trained model works fine (don't fix what isn't broken)

10. Quick-Start Checklist

For when you're ready to begin implementation:

  • Verify NVIDIA drivers and CUDA toolkit are installed on the server
  • Run nvidia-smi to confirm the 5080 is visible
  • Install onnxruntime-node and sharp in the backend
  • Download ONNX model files (RetinaFace + ArcFace) -- ~200MB
  • Populate bookface/ with at least front + left + right photos per person
  • Run the reference library scanner to generate centroids
  • Process a test batch of Discord photos and review accuracy
  • For Safe Arrival: set up one test camera with frame capture to NAS
  • Conduct privacy impact assessment before production rollout