Bookface -- Facial Recognition & Safe Arrival
Bookface is the facial recognition subsystem of RABS. It identifies known staff and participants in photos (from Discord and CCTV) and powers two core capabilities: gallery auto-tagging and safe arrival/departure tracking across locations.
1. Purpose
RABS manages a network of locations with ~100 participants out at any given time accompanied by ~60 staff. In a crisis, evacuation, or routine check, knowing who is physically where is critical. Currently this relies on manual sign-in sheets, phone calls, and memory.
Bookface solves this by:
- Gallery tagging -- automatically identifying people in Discord photos for newsletters and reports.
- Safe arrival tracking -- logging when a known person enters or leaves a premises via CCTV.
- Location awareness -- providing a real-time dashboard of who is at which location.
- Drill metrics -- measuring evacuation response times and headcount accuracy.
2. Model Selection
The Candidates
| Model | Embedding Dims | Architecture | License | Low-Res Performance | ONNX Support | Notes |
|---|---|---|---|---|---|---|
| ArcFace (InsightFace) | 512 | ResNet-100/50 | MIT | Good | Native | Industry standard, best accuracy-to-speed ratio |
| AdaFace | 512 | ResNet-100 | MIT | Excellent | Via export | Specifically designed for low-quality/low-res images |
| FaceNet | 128 | Inception-ResNet | Apache 2.0 | Moderate | Via export | Google's original, well-understood but dated |
| face-api.js | 128 | SSD + ResNet | MIT | Moderate | TF.js native | Pure JS, no native deps, lower accuracy |
| QMagFace | 512 | ResNet-100 | Research | Excellent | Via export | Quality-aware, handles mixed-quality galleries |
| VGG-Face2 | 2048 | SE-ResNet-50 | CC BY-SA | Moderate | Via export | Large embeddings, older architecture |
Recommendation: ArcFace (primary) + AdaFace (CCTV)
For Discord photos (high resolution, good lighting, cooperative subjects): ArcFace R100 is the best choice. It is the most widely deployed face recognition model, has native ONNX models available from InsightFace, runs efficiently on GPU, and achieves 99.8%+ accuracy on LFW (Labelled Faces in the Wild) benchmark.
For CCTV footage (low resolution, variable lighting, motion blur, uncooperative angles): AdaFace is specifically trained to handle image quality degradation. It uses an adaptive margin function that adjusts confidence thresholds based on detected image quality -- meaning it won't confidently misidentify a blurry face, but will still correctly match a slightly-blurry-but-recognisable one. This is critical for entrance/exit cameras.
Both models produce 512-dim embeddings and can share the same comparison infrastructure. The system can run both and use whichever is appropriate for the source.
Why Not Fine-Tune?
Fine-tuning a face recognition model on your own data is an option but probably unnecessary for this use case:
Arguments against fine-tuning (for now):
- ArcFace/AdaFace are trained on millions of faces across all ethnicities, ages, and conditions -- they generalise extremely well
- With ~200 known people, the matching problem is simple (small gallery, high embedding quality)
- Fine-tuning requires careful data augmentation, validation splits, and risks overfitting to current staff/participants who will change over time
- The reference photo library (bookface) already gives you per-person adaptation without touching model weights
When fine-tuning WOULD make sense:
- If recognition accuracy on your specific CCTV cameras drops below acceptable thresholds after testing
- If you have a challenging edge case (e.g. cameras with extreme IR distortion, specific lighting conditions)
- If you expand to hundreds of locations with thousands of people where the embedding space gets crowded
Recommended approach: Start with pre-trained models. Measure accuracy. Fine-tune only if needed, and only on the CCTV model (not the gallery model which works on high-res photos).
3. Technical Architecture
Hardware
- GPU: NVIDIA RTX 5080 (16GB VRAM) -- currently unused on the server
- Runtime: ONNX Runtime with CUDA execution provider
- Language: Node.js via
onnxruntime-node
Performance Estimates (RTX 5080)
| Operation | Time per Image | Batch of 100 |
|---|---|---|
| Face detection (RetinaFace) | ~10ms | ~1s |
| Face alignment + crop | ~2ms | ~200ms |
| ArcFace embedding (512-dim) | ~5ms | ~500ms |
| Gallery comparison (200 people) | ~1ms | ~100ms |
| Total per image | ~18ms | ~1.8s |
For CCTV processing at 1 frame per 5 seconds across 10 cameras, that's 2 frames/second -- well within the GPU's capacity with room to spare.
Processing Pipeline
Image Source (Discord / CCTV)
│
▼
┌─────────────┐
│ RetinaFace │ ← Face detection: find bounding boxes
│ (detector) │ Handles multiple faces per image
└──────┬──────┘
│ [face crops]
▼
┌─────────────┐
│ Face Align │ ← Normalise to 112x112 using 5-point landmarks
│ (landmark) │ Eyes, nose, mouth corners
└──────┬──────┘
│ [aligned faces]
▼
┌─────────────┐
│ ArcFace / │ ← Generate 512-dim embedding
│ AdaFace │ ArcFace for photos, AdaFace for CCTV
└──────┬──────┘
│ [embeddings]
▼
┌─────────────┐
│ Matcher │ ← Compare against bookface reference library
│ (cosine / │ Euclidean distance < threshold = match
│ euclidean) │ Unknown if no match within threshold
└──────┬──────┘
│ [identities]
▼
┌─────────────┐
│ Store │ ← Update metadata JSONB + face_count
│ (postgres) │ Tag with staff_id / participant_id
└─────────────┘
Embedding Storage
Face embeddings are stored in the metadata JSONB column of media.discord_media:
{
"faces": [
{
"bbox": [120, 45, 280, 230],
"confidence": 0.98,
"embedding": [0.0234, -0.0891, ...],
"match": {
"type": "staff",
"id": 142,
"name": "Brett",
"distance": 0.42
}
},
{
"bbox": [350, 60, 490, 240],
"confidence": 0.95,
"embedding": [0.0567, -0.0234, ...],
"match": null
}
]
}
The second face has "match": null -- detected but not recognised (unknown person / bystander).
Reference Library (Bookface Folder)
admin_drive/media/bookface/
staff_142/ ← folder name = identity
front.jpg
left.jpg
right.jpg
smile.jpg
participant_305/
front.jpg
outdoor.jpg
On startup (or when triggered), the worker:
- Scans all bookface subfolders
- Runs each reference photo through the same detection + embedding pipeline
- Averages all embeddings per person into a centroid (mean embedding vector)
- Caches the centroid library in memory for fast comparison
Adding new reference photos = drop files in the folder, trigger a re-scan. No retraining, no reprocessing of existing gallery images.
4. Safe Arrival -- CCTV Integration
Concept
Each RABS location has entrance/exit cameras. The system processes frames from these cameras to detect and identify people entering or leaving. This creates an automatic attendance log.
How It Works
CCTV Camera (location entrance)
│
│ frame grab every 5 seconds
▼
┌──────────────┐
│ Frame Drop │ ← Camera writes latest frame to a shared folder
│ (overwrite) │ e.g. //nas/cctv/location_01/entrance.jpg
└──────┬───────┘
│
▼
┌──────────────┐
│ Watcher │ ← Node.js file watcher or polling interval
│ (per camera) │ Picks up new frames
└──────┬───────┘
│
▼
┌──────────────┐
│ Recognition │ ← Same pipeline as gallery (detect → embed → match)
│ (AdaFace) │ Uses AdaFace for better low-res handling
└──────┬───────┘
│ [identified people]
▼
┌──────────────┐
│ Debounce │ ← Don't log same person every 5 seconds
│ (15 min gap) │ Only log arrival if not seen in last 15 mins
└──────┬───────┘
│
▼
┌──────────────┐
│ Log Event │ ← Insert into arrival/departure table
│ (postgres) │ staff_id, location_id, timestamp, direction
└──────┬───────┘
│
▼
┌──────────────┐
│ Dashboard │ ← Real-time location board
│ (websocket) │ "Cathy arrived at HQ with Alana, Mary, Joe"
└──────────────┘
Rudimentary Version (Phase 3A)
The quickest path to a working prototype:
-
At each location, configure the CCTV camera (or a Dropbox-connected camera, or even a cheap USB webcam running a capture script) to save the latest entrance frame to a shared folder on the NAS, overwriting the same file:
//192.168.77.10/dev/rabs/storage/admin_drive/cctv/
location_01_hq/
entrance.jpg ← overwritten every 5 seconds
location_02_south/
entrance.jpg
location_03_north/
entrance.jpg -
A CCTV watcher worker (similar to media-grabber) polls these files every 5 seconds, runs face detection + recognition using AdaFace, and logs arrivals/departures.
-
The dashboard shows a simple table: Location | Person | Arrived | Left | Duration.
This version requires NO integration with existing CCTV systems -- just getting one frame image dropped to a network folder. Most IP cameras support FTP upload or RTSP streams that can be captured with a single ffmpeg command:
# Grab one frame every 5 seconds from an RTSP camera stream
ffmpeg -rtsp_transport tcp -i rtsp://camera_ip:554/stream1 \
-vf "fps=1/5" -update 1 -y //nas/cctv/location_01/entrance.jpg
Full Version (Phase 3B)
- Direction detection (entering vs leaving) using frame-to-frame tracking
- Multiple cameras per location (entrance, exit, common areas)
- Confidence scoring -- only log arrivals above a threshold to avoid false positives
- Alert system -- "Participant X hasn't arrived at expected location by 9:30am"
- Evacuation mode -- live headcount per location, checklist of who is accounted for
- Historical analytics -- average arrival times, attendance patterns, drill response metrics
Arrival/Departure Schema (Future)
CREATE TABLE media.safe_arrival (
id SERIAL PRIMARY KEY,
person_type TEXT NOT NULL CHECK (person_type IN ('staff', 'participant')),
person_id INT NOT NULL, -- staff_id or participant_id
location_id INT NOT NULL, -- references a locations table
camera_id TEXT, -- which camera spotted them
direction TEXT CHECK (direction IN ('arrive', 'depart', 'unknown')),
confidence FLOAT NOT NULL, -- recognition confidence (0-1)
frame_path TEXT, -- path to the frame that triggered the log
detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
metadata JSONB NOT NULL DEFAULT '{}' -- bbox, embedding distance, etc.
);
5. Privacy & Consent
Facial recognition in a care/support environment requires careful handling:
Legal Requirements (Australia)
- Privacy Act 1988 and Australian Privacy Principles (APPs) apply
- Biometric data (face embeddings) is classified as sensitive information under APP 3
- Collection requires consent or a permitted exception (safety of an individual)
- A Privacy Impact Assessment (PIA) should be conducted before deployment
Recommended Safeguards
| Safeguard | Implementation |
|---|---|
| Opt-in consent | Staff and participants must explicitly consent to being in the bookface library |
| Right to withdraw | Deleting a bookface folder removes all reference data immediately |
| No external transmission | All processing is local (GPU on server, no cloud APIs, no data leaves the LAN) |
| Data minimisation | See detailed rules below |
| Purpose limitation | Embeddings are used only for matching, never for profiling or behaviour analysis |
| Access control | Gallery face tags visible to management only, not general staff |
| Audit trail | All recognition events logged with confidence scores |
| CCTV signage | Locations using Safe Arrival must have visible "CCTV in operation" signage |
| Excluded individuals | Any person can be excluded from recognition (their face is detected but never matched) |
| Immediate deletion | CCTV frames are deleted immediately after processing -- only the result is stored, never the image |
Data Minimisation Rules
The principle is: use the least identifying data possible for the task at hand.
| Scenario | What to Store | What NOT to Store |
|---|---|---|
| Routine arrival count | Headcount + timestamp only (e.g. "4 people arrived at 9:15") | No names, no face data |
| Known person arrival | person_id + location + timestamp | No face crop, no embedding, no frame |
| Gallery auto-tagging | person_id tag in metadata | Embedding stored temporarily for re-matching, can be purged after tagging |
| Evacuation / drill | Full identification (name + location + time) | Only during active emergency/drill mode |
| CCTV frames | Deleted immediately after processing | Never retained on disk or in database |
Anonymisation hierarchy (use the highest level that satisfies the need):
- Count only -- "3 staff, 2 participants arrived" (no identity)
- Type only -- "staff member arrived" (no specific person)
- Identity -- "Brett arrived" (only when operationally required)
The system should default to the lowest level and only escalate when the specific feature requires it. For example, the daily arrival dashboard can show counts by default, with a "reveal names" button restricted to authorised users.
Participant Considerations
For participants in support/care settings:
- Guardian/carer consent may be required depending on participant capacity
- Safe Arrival positioning: frame as a safety feature (knowing someone arrived safely) not surveillance
- Consider having the recognition system only confirm presence, not log movement patterns within a building
- Where a headcount satisfies the safety requirement, do not resolve to individual identity
- Participant face embeddings must be deletable on request with no residual data in backups beyond retention policy
6. Model Files & Dependencies
Required Packages
onnxruntime-node ← ONNX inference engine (CUDA support built-in)
sharp ← Image preprocessing (resize, crop, normalise)
ONNX Model Files
Download from InsightFace model zoo and place in a models/ directory:
| Model | File | Size | Purpose |
|---|---|---|---|
| RetinaFace | det_10g.onnx | ~16MB | Face detection |
| ArcFace R100 | w600k_r50.onnx | ~167MB | Gallery embedding (high-res) |
| AdaFace IR-101 | adaface_ir101_webface12m.onnx | ~250MB | CCTV embedding (low-res tolerant) |
Total: ~433MB of model files. Downloaded once, stored on disk.
CUDA Requirements
- NVIDIA driver 535+ (for RTX 5080)
- CUDA Toolkit 12.x
- cuDNN 8.x or 9.x
onnxruntime-nodeautomatically detects and uses CUDA if available, falls back to CPU
7. Matching Thresholds
Threshold tuning is critical -- too loose and you get false positives (wrong person identified), too strict and you get false negatives (known person tagged as unknown).
Recommended Starting Points
| Scenario | Distance Metric | Threshold | Rationale |
|---|---|---|---|
| Gallery photos (ArcFace) | Cosine distance | 0.40 | High-res, good conditions, prioritise precision |
| CCTV arrival (AdaFace) | Cosine distance | 0.50 | Lower-res, allow slightly more tolerance |
| CCTV evacuation mode | Cosine distance | 0.55 | In emergency, accept more matches to maximise headcount |
Calibration Process
- Run recognition on a set of known test photos with ground truth labels
- Plot the distribution of match distances for correct matches vs incorrect matches
- Choose the threshold that minimises the overlap
- Re-calibrate if cameras change, lighting conditions shift, or the population changes significantly
Handling Twins / Look-Alikes
ArcFace embeddings can sometimes confuse similar-looking individuals. Mitigations:
- Increase reference photo diversity (more angles, more conditions)
- Lower the match threshold for those specific individuals
- Use a top-2 match approach: if the top two candidates are very close in distance, flag as "uncertain" rather than committing to one
8. Implementation Phases
Phase 3A: Gallery Auto-Tagging (First)
Effort: 1-2 weeks
- Install
onnxruntime-nodeand download ArcFace ONNX model - Build bookface reference library scanner (read folders, generate centroids)
- Build background worker to process
faces_processed = FALSEphotos - Store results in
metadata.faces[]JSONB - Add face tags to gallery UI (hover to see names on detected faces)
- Add "People" smart album to gallery sidebar
Validation: Process 100 known Discord photos. Measure precision (correct matches / total matches) and recall (correct matches / total known faces in photos). Target: 95%+ precision, 85%+ recall.
Phase 3B: Safe Arrival Prototype (Second)
Effort: 2-3 weeks
- Set up frame capture from one test camera (ffmpeg RTSP or file drop)
- Build CCTV watcher worker with AdaFace model
- Build arrival/departure table and debounce logic
- Build simple dashboard showing current location occupancy
- Test with controlled entries (known people walking past camera)
Validation: Run for one week at one location. Measure false positive rate (unknown person identified as someone) and false negative rate (known person not identified). Target: <2% false positive, <10% false negative.
Phase 3C: Full Safe Arrival (Third)
Effort: 4-6 weeks
- Multi-location rollout
- Direction detection (arrive vs depart)
- Evacuation mode dashboard
- Alert system for missing arrivals
- Historical analytics and drill metrics
- Privacy impact assessment and consent collection
9. Alternative: Training a Custom Model
If pre-trained models prove insufficient for CCTV conditions at your specific locations, a custom fine-tuned model is an option.
What You Would Need
- Training data: 50-100 images per person, ideally captured from the actual CCTV cameras at various times of day
- Hardware: Your RTX 5080 is more than sufficient for fine-tuning (typically takes 2-6 hours)
- Framework: PyTorch with InsightFace training scripts
- Base model: Start from ArcFace R100 pre-trained weights, fine-tune the last few layers
Process
- Capture frames from CCTV cameras over 1-2 weeks
- Manually label faces in frames (which person is which) -- this is the most labour-intensive step
- Split into training (80%) and validation (20%) sets
- Fine-tune from pre-trained weights with a low learning rate
- Export to ONNX for deployment
- A/B test against the pre-trained model to measure improvement
When This Makes Sense
- Pre-trained model accuracy drops below 85% on your CCTV footage
- Specific lighting conditions (heavy IR, backlighting) cause consistent failures
- You've exhausted other options (better cameras, more reference photos, threshold tuning)
When This Does NOT Make Sense
- You have fewer than 30 images per person from CCTV (too little data)
- The issue is camera quality (fix the camera instead)
- The pre-trained model works fine (don't fix what isn't broken)
10. Quick-Start Checklist
For when you're ready to begin implementation:
- Verify NVIDIA drivers and CUDA toolkit are installed on the server
- Run
nvidia-smito confirm the 5080 is visible - Install
onnxruntime-nodeandsharpin the backend - Download ONNX model files (RetinaFace + ArcFace) -- ~200MB
- Populate
bookface/with at least front + left + right photos per person - Run the reference library scanner to generate centroids
- Process a test batch of Discord photos and review accuracy
- For Safe Arrival: set up one test camera with frame capture to NAS
- Conduct privacy impact assessment before production rollout