51,417 research outputs found
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.Comment: Accepted at ECCV-201
A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Despite the recent advances in opinion mining for written reviews, few works
have tackled the problem on other sources of reviews. In light of this issue,
we propose a multi-modal approach for mining fine-grained opinions from video
reviews that is able to determine the aspects of the item under review that are
being discussed and the sentiment orientation towards them. Our approach works
at the sentence level without the need for time annotations and uses features
derived from the audio, video and language transcriptions of its contents. We
evaluate our approach on two datasets and show that leveraging the video and
audio modalities consistently provides increased performance over text-only
baselines, providing evidence these extra modalities are key in better
understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202
Symbol Emergence in Robotics: A Survey
Humans can learn the use of language through physical interaction with their
environment and semiotic communication with other people. It is very important
to obtain a computational understanding of how humans can form a symbol system
and obtain semiotic skills through their autonomous mental development.
Recently, many studies have been conducted on the construction of robotic
systems and machine-learning methods that can learn the use of language through
embodied multimodal interaction with their environment and other systems.
Understanding human social interactions and developing a robot that can
smoothly communicate with human users in the long term, requires an
understanding of the dynamics of symbol systems and is crucially important. The
embodied cognition and social interaction of participants gradually change a
symbol system in a constructive manner. In this paper, we introduce a field of
research called symbol emergence in robotics (SER). SER is a constructive
approach towards an emergent symbol system. The emergent symbol system is
socially self-organized through both semiotic communications and physical
interactions with autonomous cognitive developmental agents, i.e., humans and
developmental robots. Specifically, we describe some state-of-art research
topics concerning SER, e.g., multimodal categorization, word discovery, and a
double articulation analysis, that enable a robot to obtain words and their
embodied meanings from raw sensory--motor information, including visual
information, haptic information, auditory information, and acoustic speech
signals, in a totally unsupervised manner. Finally, we suggest future
directions of research in SER.Comment: submitted to Advanced Robotic
End-to-End Attention-based Large Vocabulary Speech Recognition
Many of the current state-of-the-art Large Vocabulary Continuous Speech
Recognition Systems (LVCSR) are hybrids of neural networks and Hidden Markov
Models (HMMs). Most of these systems contain separate components that deal with
the acoustic modelling, language modelling and sequence decoding. We
investigate a more direct approach in which the HMM is replaced with a
Recurrent Neural Network (RNN) that performs sequence prediction directly at
the character level. Alignment between the input features and the desired
character sequence is learned automatically by an attention mechanism built
into the RNN. For each predicted character, the attention mechanism scans the
input sequence and chooses relevant frames. We propose two methods to speed up
this operation: limiting the scan to a subset of most promising frames and
pooling over time the information contained in neighboring frames, thereby
reducing source sequence length. Integrating an n-gram language model into the
decoding process yields recognition accuracies similar to other HMM-free
RNN-based approaches
- …