Anti-spoofing detection has become a necessity for face recognition systems
due to the security threat posed by spoofing attacks. Despite great success in
traditional attacks, most deep-learning-based methods perform poorly in 3D
masks, which can highly simulate real faces in appearance and structure,
suffering generalizability insufficiency while focusing only on the spatial
domain with single frame input. This has been mitigated by the recent
introduction of a biomedical technology called rPPG (remote
photoplethysmography). However, rPPG-based methods are sensitive to noisy
interference and require at least one second (> 25 frames) of observation time,
which induces high computational overhead. To address these challenges, we
propose a novel 3D mask detection framework, called FASTEN
(Flow-Attention-based Spatio-Temporal aggrEgation Network). We tailor the
network for focusing more on fine-grained details in large movements, which can
eliminate redundant spatio-temporal feature interference and quickly capture
splicing traces of 3D masks in fewer frames. Our proposed network contains
three key modules: 1) a facial optical flow network to obtain non-RGB
inter-frame flow information; 2) flow attention to assign different
significance to each frame; 3) spatio-temporal aggregation to aggregate
high-level spatial features and temporal transition features. Through extensive
experiments, FASTEN only requires five frames of input and outperforms eight
competitors for both intra-dataset and cross-dataset evaluations in terms of
multiple detection metrics. Moreover, FASTEN has been deployed in real-world
mobile devices for practical 3D mask detection.Comment: 13 pages, 5 figures. Accepted to NeurIPS 202