20 research outputs found
VGGFace2: A dataset for recognising faces across pose and age
In this paper, we introduce a new large-scale face dataset named VGGFace2.
The dataset contains 3.31 million images of 9131 subjects, with an average of
362.6 images for each subject. Images are downloaded from Google Image Search
and have large variations in pose, age, illumination, ethnicity and profession
(e.g. actors, athletes, politicians). The dataset was collected with three
goals in mind: (i) to have both a large number of identities and also a large
number of images for each identity; (ii) to cover a large range of pose, age
and ethnicity; and (iii) to minimize the label noise. We describe how the
dataset was collected, in particular the automated and manual filtering stages
to ensure a high accuracy for the images of each identity. To assess face
recognition performance using the new dataset, we train ResNet-50 (with and
without Squeeze-and-Excitation blocks) Convolutional Neural Networks on
VGGFace2, on MS- Celeb-1M, and on their union, and show that training on
VGGFace2 leads to improved recognition performance over pose and age. Finally,
using the models trained on these datasets, we demonstrate state-of-the-art
performance on all the IARPA Janus face recognition benchmarks, e.g. IJB-A,
IJB-B and IJB-C, exceeding the previous state-of-the-art by a large margin.
Datasets and models are publicly available.Comment: This paper has been accepted by IEEE Conference on Automatic Face and
Gesture Recognition (F&G), 2018. (Oral
Template Adaptation for Face Verification and Identification
Face recognition performance evaluation has traditionally focused on
one-to-one verification, popularized by the Labeled Faces in the Wild dataset
for imagery and the YouTubeFaces dataset for videos. In contrast, the newly
released IJB-A face recognition dataset unifies evaluation of one-to-many face
identification with one-to-one face verification over templates, or sets of
imagery and videos for a subject. In this paper, we study the problem of
template adaptation, a form of transfer learning to the set of media in a
template. Extensive performance evaluations on IJB-A show a surprising result,
that perhaps the simplest method of template adaptation, combining deep
convolutional network features with template specific linear SVMs, outperforms
the state-of-the-art by a wide margin. We study the effects of template size,
negative set construction and classifier fusion on performance, then compare
template adaptation to convolutional networks with metric learning, 2D and 3D
alignment. Our unexpected conclusion is that these other methods, when combined
with template adaptation, all achieve nearly the same top performance on IJB-A
for template-based face verification and identification
CrossScore: towards multi-view image evaluation and scoring
We introduce a novel cross-reference image quality assessment method that effectively fills the gap in the image assessment landscape, complementing the array of established evaluation schemes – ranging from full-reference metrics like SSIM, no-reference metrics such as NIQE, to general-reference metrics including FID, and Multi-modal-reference metrics, e.g., CLIPScore. Utilising a neural network with the cross-attention mechanism and a unique data collection pipeline from NVS optimisation, our method enables accurate image quality assessment without requiring ground truth references. By comparing a query image against multiple views of the same scene, our method addresses the limitations of existing metrics in novel view synthesis (NVS) and similar tasks where direct reference images are unavailable. Experimental results show that our method is closely correlated to the full-reference metric SSIM, while not requiring ground truth references
Modeling Multimodal Cues in a Deep Learning-based Framework for Emotion Recognition in the Wild
In this paper, we propose a multimodal deep learning architecture for emotion recognition in video regarding our participation to the audio-video based sub-challenge of the Emotion Recognition in the Wild 2017 challenge. Our model combines cues from multiple video modalities, including static facial features, motion patterns related to the evolution of the human expression over time, and audio information. Specifically, it is composed of three sub-networks trained separately: the first and second ones extract static visual features and dynamic patterns through 2D and 3D Convolutional Neural Networks (CNN), while the third one consists in a pretrained audio network which is used to extract useful deep acoustic signals from video. In the audio branch, we also apply Long Short Term Memory (LSTM) networks in order to capture the temporal evolution of the audio features. To identify and exploit possible relationships among different modalities, we propose a fusion network that merges cues from the different modalities in one representation. The proposed architecture outperforms the challenge baselines (38.81% and 40.47%): we achieve an accuracy of 50.39% and 49.92% respectively on the validation and the testing data
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception
We introduce the Aria Digital Twin (ADT) - an egocentric dataset captured
using Aria glasses with extensive object, environment, and human level ground
truth. This ADT release contains 200 sequences of real-world activities
conducted by Aria wearers in two real indoor scenes with 398 object instances
(324 stationary and 74 dynamic). Each sequence consists of: a) raw data of two
monochrome camera streams, one RGB camera stream, two IMU streams; b) complete
sensor calibration; c) ground truth data including continuous
6-degree-of-freedom (6DoF) poses of the Aria devices, object 6DoF poses, 3D eye
gaze vectors, 3D human poses, 2D image segmentations, image depth maps; and d)
photo-realistic synthetic renderings. To the best of our knowledge, there is no
existing egocentric dataset with a level of accuracy, photo-realism and
comprehensiveness comparable to ADT. By contributing ADT to the research
community, our mission is to set a new standard for evaluation in the
egocentric machine perception domain, which includes very challenging research
problems such as 3D object detection and tracking, scene reconstruction and
understanding, sim-to-real learning, human pose prediction - while also
inspiring new machine perception tasks for augmented reality (AR) applications.
To kick start exploration of the ADT research use cases, we evaluated several
existing state-of-the-art methods for object detection, segmentation and image
translation tasks that demonstrate the usefulness of ADT as a benchmarking
dataset
AXES at TRECVID 2012: KIS, INS, and MED
The AXES project participated in the interactive instance search task (INS), the known-item search task (KIS), and the multimedia event detection task (MED) for TRECVid 2012. As in our TRECVid 2011 system, we used nearly identical search systems and user interfaces for both INS and KIS. Our interactive INS and KIS systems focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our KIS experiments were media professionals from the BBC; our INS experiments were carried out by students and researchers at Dublin City University. We performed comparatively well in both experiments. Our best KIS run found 13 of the 25 topics, and our best INS runs outperformed all other submitted runs in terms of P@100. For MED, the system presented was based on a minimal number of low-level descriptors, which we chose to be as large as computationally feasible. These descriptors are aggregated to produce high-dimensional video-level signatures, which are used to train a set of linear classifiers. Our MED system achieved the second-best score of all submitted runs in the main track, and best score in the ad-hoc track, suggesting that a simple system based on state-of-the-art low-level descriptors can give relatively high performance. This paper describes in detail our KIS, INS, and MED systems and the results and findings of our experiments
EgoBlur: Responsible Innovation in Aria
Project Aria pushes the frontiers of Egocentric AI with large-scale
real-world data collection using purposely designed glasses with privacy first
approach. To protect the privacy of bystanders being recorded by the glasses,
our research protocols are designed to ensure recorded video is processed by an
AI anonymization model that removes bystander faces and vehicle license plates.
Detected face and license plate regions are processed with a Gaussian blur such
that these personal identification information (PII) regions are obscured. This
process helps to ensure that anonymized versions of the video is retained for
research purposes. In Project Aria, we have developed a state-of-the-art
anonymization system EgoBlur. In this paper, we present extensive analysis of
EgoBlur on challenging datasets comparing its performance with other
state-of-the-art systems from industry and academia including extensive
Responsible AI analysis on recently released Casual Conversations V2 dataset
The AXES research video search system
We will demonstrate a multimedia content information retrieval engine developed for audiovisual digital libraries targeted at academic researchers and journalists. It is the second of three multimedia IR systems being developed by the AXES project1. The system brings together traditional text IR and state-of-the-art content indexing and retrieval technologies to allow users to search and browse digital libraries in novel ways. Key features include: metadata and ASR search and filtering, on-the-fly visual concept classification (categories, faces, places, and logos), and similarity search (instances and faces)
The AXES submissions at TrecVid 2013
The AXES project participated in the interactive instance search task (INS), the semantic indexing task (SIN) the multimedia event recounting task (MER), and the multimedia event detection task (MED) for TRECVid 2013. Our interactive INS focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our INS experiments were carried out by students and researchers at Dublin City University. Our best INS runs performed on par with the top ranked INS runs in terms of P@10 and P@30, and around the median in terms of mAP.
For SIN, MED and MER, we use systems based on state- of-the-art local low-level descriptors for motion, image, and sound, as well as high-level features to capture speech and text and the visual and audio stream respectively. The low-level descriptors were aggregated by means of Fisher vectors into high- dimensional video-level signatures, the high-level features are aggregated into bag-of-word histograms. Using these features we train linear classifiers, and use early and late-fusion to combine the different features. Our MED system achieved the best score of all submitted runs in the main track, as well as in the ad-hoc track.
This paper describes in detail our INS, MER, and MED systems and the results and findings of our experimen
AXES at TRECVid 2012: KIS, INS, and MED
International audienceThe AXES project participated in the interactive instance search task (INS), the known-item search task (KIS), and the multimedia event detection task (MED) for TRECVid 2012. As in our TRECVid 2011 system, we used nearly identical search systems and user interfaces for both INS and KIS. Our interactive INS and KIS systems focused this year on using classifiers trained at query time with positive examples collected from external search engines. Participants in our KIS experiments were media professionals from the BBC; our INS experiments were carried out by students and researchers at Dublin City University. We performed comparatively well in both experiments. Our best KIS run found 13 of the 25 topics, and our best INS runs outperformed all other submitted runs in terms of P@100. For MED, the system presented was based on a minimal number of low-level descriptors, which we chose to be as large as computationally feasible. These descriptors are aggregated to produce high-dimensional video-level signatures, which are used to train a set of linear classifiers. Our MED system achieved the second-best score of all submitted runs in the main track, and best score in the ad-hoc track, suggesting that a simple system based on state-of-the-art low-level descriptors can give relatively high performance. This paper describes in detail our KIS, INS, and MED systems and the results and findings of our experiments