Face Detection — SCRFD¶

Overview¶

SCRFD (Sample-and-Computation-Redistribution Face Detector) is the detection backbone. Published by InsightFace (DeepInsight), it achieves state-of-the-art results on WIDER FACE by redistributing computation across network stages based on the difficulty of detecting faces at each scale.

Variants¶

Model	WIDER FACE AP (Easy/Med/Hard)	Params	FLOPs	Inference (VGA)
SCRFD-10G	95.16 / 93.87 / 83.05	3.86M	10G	~5ms
SCRFD-2.5G	93.78 / 92.16 / 77.87	0.67M	2.5G	~4ms
SCRFD-500M	90.57 / 88.12 / 68.51	0.57M	500M	~3ms

All models use 3 feature map strides (8, 16, 32) with 2 anchors per location. KPS variants include 5-point facial landmark regression.

Preprocessing¶

Letterbox resize: Scale image to fit within input size (default 640x640) maintaining aspect ratio
Zero-pad: Place resized image in top-left corner of a zero-filled canvas
Normalize: cv2.dnn.blobFromImage with mean=(127.5, 127.5, 127.5), scale=1/128, swapRB=True
Output: [1, 3, 640, 640] float32 tensor (NCHW, RGB, range ~[-1, 1])

det_scale = new_height / original_height  # Used to map detections back

Model Outputs¶

For SCRFD with keypoints (9 outputs total):

Index	Name	Shape	Description
0-2	scores	`(1, N_i, 1)`	Detection confidence per anchor at stride 8/16/32
3-5	bboxes	`(1, N_i, 4)`	Distance predictions (left, top, right, bottom)
6-8	keypoints	`(1, N_i, 10)`	5-point landmark offsets (x, y pairs)

Where N_i = 2 * H_i * W_i (2 anchors per location).

Postprocessing¶

Anchor Generation¶

centers = np.mgrid[:height, :width][::-1]  # (x, y) grid
centers = centers * stride                   # Scale to pixel coords
centers = stack([centers] * 2, axis=1)       # Duplicate for 2 anchors

Distance-to-BBox Conversion¶

x1 = anchor_x - pred_left * stride
y1 = anchor_y - pred_top * stride
x2 = anchor_x + pred_right * stride
y2 = anchor_y + pred_bottom * stride

Non-Maximum Suppression (NMS)¶

Standard IoU-based NMS with configurable threshold (default 0.4). Implemented in pure numpy — sufficient for typical face counts (< 100 pre-NMS candidates).

Post-NMS¶

Scale bounding boxes back by 1 / det_scale
Clip to original image dimensions
Sort by confidence descending
Apply max_faces limit

Landmarks¶

Five facial landmarks in order: 1. Left eye 2. Right eye 3. Nose tip 4. Left mouth corner 5. Right mouth corner

Encoded as distance offsets from anchor centers, decoded the same way as bounding boxes but for (x, y) pairs.