4,399 research outputs found
Look, Listen and Learn - A Multimodal LSTM for Speaker Identification
Speaker identification refers to the task of localizing the face of a person
who has the same identity as the ongoing voice in a video. This task not only
requires collective perception over both visual and auditory signals, the
robustness to handle severe quality degradations and unconstrained content
variations are also indispensable. In this paper, we describe a novel
multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies
both visual and auditory modalities from the beginning of each sequence input.
The key idea is to extend the conventional LSTM by not only sharing weights
across time steps, but also sharing weights across modalities. We show that
modeling the temporal dependency across face and voice can significantly
improve the robustness to content quality degradations and variations. We also
found that our multimodal LSTM is robustness to distractors, namely the
non-speaking identities. We applied our multimodal LSTM to The Big Bang Theory
dataset and showed that our system outperforms the state-of-the-art systems in
speaker identification with lower false alarm rate and higher recognition
accuracy.Comment: The 30th AAAI Conference on Artificial Intelligence (AAAI-16
DocLangID: Improving Few-Shot Training to Identify the Language of Historical Documents
Language identification describes the task of recognizing the language of
written text in documents. This information is crucial because it can be used
to support the analysis of a document's vocabulary and context. Supervised
learning methods in recent years have advanced the task of language
identification. However, these methods usually require large labeled datasets,
which often need to be included for various domains of images, such as
documents or scene images. In this work, we propose DocLangID, a transfer
learning approach to identify the language of unlabeled historical documents.
We achieve this by first leveraging labeled data from a different but related
domain of historical documents. Secondly, we implement a distance-based
few-shot learning approach to adapt a convolutional neural network to new
languages of the unlabeled dataset. By introducing small amounts of manually
labeled examples from the set of unlabeled images, our feature extractor
develops a better adaptability towards new and different data distributions of
historical documents. We show that such a model can be effectively fine-tuned
for the unlabeled set of images by only reusing the same few-shot examples. We
showcase our work across 10 languages that mostly use the Latin script. Our
experiments on historical documents demonstrate that our combined approach
improves the language identification performance, achieving 74% recognition
accuracy on the four unseen languages of the unlabeled dataset.Comment: 6 pages (including references and excluding appendix
Tracking by Prediction: A Deep Generative Model for Mutli-Person localisation and Tracking
Current multi-person localisation and tracking systems have an over reliance
on the use of appearance models for target re-identification and almost no
approaches employ a complete deep learning solution for both objectives. We
present a novel, complete deep learning framework for multi-person localisation
and tracking. In this context we first introduce a light weight sequential
Generative Adversarial Network architecture for person localisation, which
overcomes issues related to occlusions and noisy detections, typically found in
a multi person environment. In the proposed tracking framework we build upon
recent advances in pedestrian trajectory prediction approaches and propose a
novel data association scheme based on predicted trajectories. This removes the
need for computationally expensive person re-identification systems based on
appearance features and generates human like trajectories with minimal
fragmentation. The proposed method is evaluated on multiple public benchmarks
including both static and dynamic cameras and is capable of generating
outstanding performance, especially among other recently proposed deep neural
network based approaches.Comment: To appear in IEEE Winter Conference on Applications of Computer
Vision (WACV), 201
- …