37 research outputs found
GhostVLAD for set-based face recognition
The objective of this paper is to learn a compact representation of image
sets for template-based face recognition. We make the following contributions:
first, we propose a network architecture which aggregates and embeds the face
descriptors produced by deep convolutional neural networks into a compact
fixed-length representation. This compact representation requires minimal
memory storage and enables efficient similarity computation. Second, we propose
a novel GhostVLAD layer that includes {\em ghost clusters}, that do not
contribute to the aggregation. We show that a quality weighting on the input
faces emerges automatically such that informative images contribute more than
those with low quality, and that the ghost clusters enhance the network's
ability to deal with poor quality images. Third, we explore how input feature
dimension, number of clusters and different training techniques affect the
recognition performance. Given this analysis, we train a network that far
exceeds the state-of-the-art on the IJB-B face recognition dataset. This is
currently one of the most challenging public benchmarks, and we surpass the
state-of-the-art on both the identification and verification protocols.Comment: Accepted by ACCV 201
Face Selection for Improving Set-to-Set Face Verification
V této práci navrhujeme statistický model, který predikuje výkonnost natrénovaného systému pro verifikaci tváří na základě analýzy kvality vstupních obrázků. Základní část navrhovaného modelu je konvoluční neuronová síť s názvem CNN-FQ, která klasifikuje vstupní obrazky tváří na nízko kvalitní anebo vysoce kvalitní. Pojem kvality není definován explicitně, ale učí se z chyb, které verifikační systém dělá při vyhodnocení trojic obličejů. Naučenou CNN-FQ jsme použili pro verifikaci identit popsaných sadou obrázků, abychom snížili negativní dopad nízko kvalitních fotografií při jejich agregaci do deskriptoru šablony. Při 1:1 Verifikaci s použitím IJB-B protokolu se ukázalo, že použití predikce kvality z CNN-FQ při agregaci šablony vede k vyšší přesnosti rozpoznávání v porovnání s dříve používanými metodami odhadu kvality obrázků tváře.In this thesis we propose a statistical model which predicts performance of a pre-trained face verification system based on analysing quality of input images. A core part of the proposed model is a convolutional neural network, named CNN-FQ, which marks the input facial image as low-quality or high-quaity one. The concept of quality is not defined explicitly, but instead it is learned from mistakes the verification system makes when ranking triplets of faces. We applied the CNN-FQ in a set-based face verification to down-weight negative impact of low-quality faces when aggregating them to a template descriptor. It is shown on IJB-B 1:1 Face Verification benchmark that using CNN-FQ quality predictor for template aggregation leads to consistently higher recognition accuracy if compared to previously used face quality scores
Utterance-level Aggregation For Speaker Recognition In The Wild
The objective of this paper is speaker recognition "in the wild"-where
utterances may be of variable length and also contain irrelevant signals.
Crucial elements in the design of deep networks for this task are the type of
trunk (frame level) network, and the method of temporal aggregation. We propose
a powerful speaker recognition deep network, using a "thin-ResNet" trunk
architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate
features across time, that can be trained end-to-end. We show that our network
achieves state of the art performance by a significant margin on the VoxCeleb1
test set for speaker recognition, whilst requiring fewer parameters than
previous methods. We also investigate the effect of utterance length on
performance, and conclude that for "in the wild" data, a longer length is
beneficial.Comment: To appear in: International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2019. (Oral Presentation