592 research outputs found
Exploring Object Relation in Mean Teacher for Cross-Domain Detection
Rendering synthetic data (e.g., 3D CAD-rendered images) to generate
annotations for learning deep models in vision tasks has attracted increasing
attention in recent years. However, simply applying the models learnt on
synthetic images may lead to high generalization error on real images due to
domain shift. To address this issue, recent progress in cross-domain
recognition has featured the Mean Teacher, which directly simulates
unsupervised domain adaptation as semi-supervised learning. The domain gap is
thus naturally bridged with consistency regularization in a teacher-student
scheme. In this work, we advance this Mean Teacher paradigm to be applicable
for cross-domain detection. Specifically, we present Mean Teacher with Object
Relations (MTOR) that novelly remolds Mean Teacher under the backbone of Faster
R-CNN by integrating the object relations into the measure of consistency cost
between teacher and student modules. Technically, MTOR firstly learns
relational graphs that capture similarities between pairs of regions for
teacher and student respectively. The whole architecture is then optimized with
three consistency regularizations: 1) region-level consistency to align the
region-level predictions between teacher and student, 2) inter-graph
consistency for matching the graph structures between teacher and student, and
3) intra-graph consistency to enhance the similarity between regions of same
class within the graph of student. Extensive experiments are conducted on the
transfers across Cityscapes, Foggy Cityscapes, and SIM10k, and superior results
are reported when comparing to state-of-the-art approaches. More remarkably, we
obtain a new record of single model: 22.8% of mAP on Syn2Real detection
dataset.Comment: CVPR 2019; The codes and model of our MTOR are publicly available at:
https://github.com/caiqi/mean-teacher-cross-domain-detectio
Multimodal Grounding for Sequence-to-Sequence Speech Recognition
Humans are capable of processing speech by making use of multiple sensory
modalities. For example, the environment where a conversation takes place
generally provides semantic and/or acoustic context that helps us to resolve
ambiguities or to recall named entities. Motivated by this, there have been
many works studying the integration of visual information into the speech
recognition pipeline. Specifically, in our previous work, we propose a
multistep visual adaptive training approach which improves the accuracy of an
audio-based Automatic Speech Recognition (ASR) system. This approach, however,
is not end-to-end as it requires fine-tuning the whole model with an adaptation
layer. In this paper, we propose novel end-to-end multimodal ASR systems and
compare them to the adaptive approach by using a range of visual
representations obtained from state-of-the-art convolutional neural networks.
We show that adaptive training is effective for S2S models leading to an
absolute improvement of 1.4% in word error rate. As for the end-to-end systems,
although they perform better than baseline, the improvements are slightly less
than adaptive training, 0.8 absolute WER reduction in single-best models. Using
ensemble decoding, end-to-end models reach a WER of 15% which is the lowest
score among all systems.Comment: ICASSP 201
Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles
In the dynamic realm of deepfake detection, this work presents an innovative
approach to validate video content. The methodology blends advanced
2-dimensional and 3-dimensional Convolutional Neural Networks. The 3D model is
uniquely tailored to capture spatiotemporal features via sliding filters,
extending through both spatial and temporal dimensions. This configuration
enables nuanced pattern recognition in pixel arrangement and temporal evolution
across frames. Simultaneously, the 2D model leverages EfficientNet
architecture, harnessing auto-scaling in Convolutional Neural Networks.
Notably, this ensemble integrates Voting Ensembles and Adaptive Weighted
Ensembling. Strategic prioritization of the 3-dimensional model's output
capitalizes on its exceptional spatio-temporal feature extraction. Experimental
validation underscores the effectiveness of this strategy, showcasing its
potential in countering deepfake generation's deceptive practices.Comment: 6 pages, 2 figure
Unsupervised Learning of Visual Representations using Videos
Is strong supervision necessary for learning a good visual representation? Do
we really need millions of semantically-labeled images to train a Convolutional
Neural Network (CNN)? In this paper, we present a simple yet surprisingly
powerful approach for unsupervised learning of CNN. Specifically, we use
hundreds of thousands of unlabeled videos from the web to learn visual
representations. Our key idea is that visual tracking provides the supervision.
That is, two patches connected by a track should have similar visual
representation in deep feature space since they probably belong to the same
object or object part. We design a Siamese-triplet network with a ranking loss
function to train this CNN representation. Without using a single image from
ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train
an ensemble of unsupervised networks that achieves 52% mAP (no bounding box
regression). This performance comes tantalizingly close to its
ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We
also show that our unsupervised network can perform competitively in other
tasks such as surface-normal estimation
- …