1,103 research outputs found
VoxCeleb2: Deep Speaker Recognition
The objective of this paper is speaker recognition under noisy and
unconstrained conditions.
We make two key contributions. First, we introduce a very large-scale
audio-visual speaker recognition dataset collected from open-source media.
Using a fully automated pipeline, we curate VoxCeleb2 which contains over a
million utterances from over 6,000 speakers. This is several times larger than
any publicly available speaker recognition dataset.
Second, we develop and compare Convolutional Neural Network (CNN) models and
training strategies that can effectively recognise identities from voice under
various conditions. The models trained on the VoxCeleb2 dataset surpass the
performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be
downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 .
1806.05622v2: minor fixes; 5 page
Cross-view Semantic Alignment for Livestreaming Product Recognition
Live commerce is the act of selling products online through live streaming.
The customer's diverse demands for online products introduce more challenges to
Livestreaming Product Recognition. Previous works have primarily focused on
fashion clothing data or utilize single-modal input, which does not reflect the
real-world scenario where multimodal data from various categories are present.
In this paper, we present LPR4M, a large-scale multimodal dataset that covers
34 categories, comprises 3 modalities (image, video, and text), and is 50x
larger than the largest publicly available dataset. LPR4M contains diverse
videos and noise modality pairs while exhibiting a long-tailed distribution,
resembling real-world problems. Moreover, a cRoss-vIew semantiC alignmEnt
(RICE) model is proposed to learn discriminative instance features from the
image and video views of the products. This is achieved through instance-level
contrastive learning and cross-view patch-level feature propagation. A novel
Patch Feature Reconstruction loss is proposed to penalize the semantic
misalignment between cross-view patches. Extensive experiments demonstrate the
effectiveness of RICE and provide insights into the importance of dataset
diversity and expressivity. The dataset and code are available at
https://github.com/adxcreative/RICEComment: Accepted to ICCV202
k-Same-Siamese-GAN: k-Same Algorithm with Generative Adversarial Network for Facial Image De-identification with Hyperparameter Tuning and Mixed Precision Training
For a data holder, such as a hospital or a government entity, who has a
privately held collection of personal data, in which the revealing and/or
processing of the personal identifiable data is restricted and prohibited by
law. Then, "how can we ensure the data holder does conceal the identity of each
individual in the imagery of personal data while still preserving certain
useful aspects of the data after de-identification?" becomes a challenge issue.
In this work, we propose an approach towards high-resolution facial image
de-identification, called k-Same-Siamese-GAN, which leverages the
k-Same-Anonymity mechanism, the Generative Adversarial Network, and the
hyperparameter tuning methods. Moreover, to speed up model training and reduce
memory consumption, the mixed precision training technique is also applied to
make kSS-GAN provide guarantees regarding privacy protection on close-form
identities and be trained much more efficiently as well. Finally, to validate
its applicability, the proposed work has been applied to actual datasets - RafD
and CelebA for performance testing. Besides protecting privacy of
high-resolution facial images, the proposed system is also justified for its
ability in automating parameter tuning and breaking through the limitation of
the number of adjustable parameters
Minimum margin loss for deep face recognition
Face recognition has achieved great progress owing to the fast development of
the deep neural network in the past a few years. As an important part of deep
neural networks, a number of the loss functions have been proposed which
significantly improve the state-of-the-art methods. In this paper, we proposed
a new loss function called Minimum Margin Loss (MML) which aims at enlarging
the margin of those overclose class centre pairs so as to enhance the
discriminative ability of the deep features. MML supervises the training
process together with the Softmax Loss and the Centre Loss, and also makes up
the defect of Softmax + Centre Loss. The experimental results on MegaFace, LFW
and YTF datasets show that the proposed method achieves the state-of-the-art
performance, which demonstrates the effectiveness of the proposed MML
QDTrack: Quasi-Dense Similarity Learning for Appearance-Only Multiple Object Tracking
Similarity learning has been recognized as a crucial step for object
tracking. However, existing multiple object tracking methods only use sparse
ground truth matching as the training objective, while ignoring the majority of
the informative regions in images. In this paper, we present Quasi-Dense
Similarity Learning, which densely samples hundreds of object regions on a pair
of images for contrastive learning. We combine this similarity learning with
multiple existing object detectors to build Quasi-Dense Tracking (QDTrack),
which does not require displacement regression or motion priors. We find that
the resulting distinctive feature space admits a simple nearest neighbor search
at inference time for object association. In addition, we show that our
similarity learning scheme is not limited to video data, but can learn
effective instance similarity even from static input, enabling a competitive
tracking performance without training on videos or using tracking supervision.
We conduct extensive experiments on a wide variety of popular MOT benchmarks.
We find that, despite its simplicity, QDTrack rivals the performance of
state-of-the-art tracking methods on all benchmarks and sets a new
state-of-the-art on the large-scale BDD100K MOT benchmark, while introducing
negligible computational overhead to the detector
Deep Active Learning for Computer Vision: Past and Future
As an important data selection schema, active learning emerges as the
essential component when iterating an Artificial Intelligence (AI) model. It
becomes even more critical given the dominance of deep neural network based
models, which are composed of a large number of parameters and data hungry, in
application. Despite its indispensable role for developing AI models, research
on active learning is not as intensive as other research directions. In this
paper, we present a review of active learning through deep active learning
approaches from the following perspectives: 1) technical advancements in active
learning, 2) applications of active learning in computer vision, 3) industrial
systems leveraging or with potential to leverage active learning for data
iteration, 4) current limitations and future research directions. We expect
this paper to clarify the significance of active learning in a modern AI model
manufacturing process and to bring additional research attention to active
learning. By addressing data automation challenges and coping with automated
machine learning systems, active learning will facilitate democratization of AI
technologies by boosting model production at scale.Comment: Accepted by APSIPA Transactions on Signal and Information Processin
Towards Open World Object Detection
Humans have a natural instinct to identify unknown object instances in their
environments. The intrinsic curiosity about these unknown instances aids in
learning about them, when the corresponding knowledge is eventually available.
This motivates us to propose a novel computer vision problem called: `Open
World Object Detection', where a model is tasked to: 1) identify objects that
have not been introduced to it as `unknown', without explicit supervision to do
so, and 2) incrementally learn these identified unknown categories without
forgetting previously learned classes, when the corresponding labels are
progressively received. We formulate the problem, introduce a strong evaluation
protocol and provide a novel solution, which we call ORE: Open World Object
Detector, based on contrastive clustering and energy based unknown
identification. Our experimental evaluation and ablation studies analyze the
efficacy of ORE in achieving Open World objectives. As an interesting
by-product, we find that identifying and characterizing unknown instances helps
to reduce confusion in an incremental object detection setting, where we
achieve state-of-the-art performance, with no extra methodological effort. We
hope that our work will attract further research into this newly identified,
yet crucial research direction.Comment: To appear in CVPR 2021 as an ORAL paper. Code is available in
https://github.com/JosephKJ/OWO
- …