Audio-visual person recognition (AVPR) has received extensive attention.
However, most datasets used for AVPR research so far are collected in
constrained environments, and thus cannot reflect the true performance of AVPR
systems in real-world scenarios. To meet the request for research on AVPR in
unconstrained conditions, this paper presents a multi-genre AVPR dataset
collected `in the wild', named CN-Celeb-AV. This dataset contains more than
419k video segments from 1,136 persons from public media. In particular, we put
more emphasis on two real-world complexities: (1) data in multiple genres; (2)
segments with partial information. A comprehensive study was conducted to
compare CN-Celeb-AV with two popular public AVPR benchmark datasets, and the
results demonstrated that CN-Celeb-AV is more in line with real-world scenarios
and can be regarded as a new benchmark dataset for AVPR research. The dataset
also involves a development set that can be used to boost the performance of
AVPR systems in real-life situations. The dataset is free for researchers and
can be downloaded from http://cnceleb.org/.Comment: INTERSPEECH 202