Skip to content

Face Detection — SCRFD

Overview

SCRFD (Sample-and-Computation-Redistribution Face Detector) is the detection backbone. Published by InsightFace (DeepInsight), it achieves state-of-the-art results on WIDER FACE by redistributing computation across network stages based on the difficulty of detecting faces at each scale.

Variants

Model WIDER FACE AP (Easy/Med/Hard) Params FLOPs Inference (VGA)
SCRFD-10G 95.16 / 93.87 / 83.05 3.86M 10G ~5ms
SCRFD-2.5G 93.78 / 92.16 / 77.87 0.67M 2.5G ~4ms
SCRFD-500M 90.57 / 88.12 / 68.51 0.57M 500M ~3ms

All models use 3 feature map strides (8, 16, 32) with 2 anchors per location. KPS variants include 5-point facial landmark regression.

Preprocessing

  1. Letterbox resize: Scale image to fit within input size (default 640x640) maintaining aspect ratio
  2. Zero-pad: Place resized image in top-left corner of a zero-filled canvas
  3. Normalize: cv2.dnn.blobFromImage with mean=(127.5, 127.5, 127.5), scale=1/128, swapRB=True
  4. Output: [1, 3, 640, 640] float32 tensor (NCHW, RGB, range ~[-1, 1])
det_scale = new_height / original_height  # Used to map detections back

Model Outputs

For SCRFD with keypoints (9 outputs total):

Index Name Shape Description
0-2 scores (1, N_i, 1) Detection confidence per anchor at stride 8/16/32
3-5 bboxes (1, N_i, 4) Distance predictions (left, top, right, bottom)
6-8 keypoints (1, N_i, 10) 5-point landmark offsets (x, y pairs)

Where N_i = 2 * H_i * W_i (2 anchors per location).

Postprocessing

Anchor Generation

centers = np.mgrid[:height, :width][::-1]  # (x, y) grid
centers = centers * stride                   # Scale to pixel coords
centers = stack([centers] * 2, axis=1)       # Duplicate for 2 anchors

Distance-to-BBox Conversion

x1 = anchor_x - pred_left * stride
y1 = anchor_y - pred_top * stride
x2 = anchor_x + pred_right * stride
y2 = anchor_y + pred_bottom * stride

Non-Maximum Suppression (NMS)

Standard IoU-based NMS with configurable threshold (default 0.4). Implemented in pure numpy — sufficient for typical face counts (< 100 pre-NMS candidates).

Post-NMS

  1. Scale bounding boxes back by 1 / det_scale
  2. Clip to original image dimensions
  3. Sort by confidence descending
  4. Apply max_faces limit

Landmarks

Five facial landmarks in order: 1. Left eye 2. Right eye 3. Nose tip 4. Left mouth corner 5. Right mouth corner

Encoded as distance offsets from anchor centers, decoded the same way as bounding boxes but for (x, y) pairs.