4,579 research outputs found
A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Despite the recent advances in opinion mining for written reviews, few works
have tackled the problem on other sources of reviews. In light of this issue,
we propose a multi-modal approach for mining fine-grained opinions from video
reviews that is able to determine the aspects of the item under review that are
being discussed and the sentiment orientation towards them. Our approach works
at the sentence level without the need for time annotations and uses features
derived from the audio, video and language transcriptions of its contents. We
evaluate our approach on two datasets and show that leveraging the video and
audio modalities consistently provides increased performance over text-only
baselines, providing evidence these extra modalities are key in better
understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202
Crossmodal Attentive Skill Learner
This paper presents the Crossmodal Attentive Skill Learner (CASL), integrated
with the recently-introduced Asynchronous Advantage Option-Critic (A2OC)
architecture [Harb et al., 2017] to enable hierarchical reinforcement learning
across multiple sensory inputs. We provide concrete examples where the approach
not only improves performance in a single task, but accelerates transfer to new
tasks. We demonstrate the attention mechanism anticipates and identifies useful
latent features, while filtering irrelevant sensor modalities during execution.
We modify the Arcade Learning Environment [Bellemare et al., 2013] to support
audio queries, and conduct evaluations of crossmodal learning in the Atari 2600
game Amidar. Finally, building on the recent work of Babaeizadeh et al. [2017],
we open-source a fast hybrid CPU-GPU implementation of CASL.Comment: International Conference on Autonomous Agents and Multiagent Systems
(AAMAS) 2018, NIPS 2017 Deep Reinforcement Learning Symposiu
Listening while Speaking and Visualizing: Improving ASR through Multimodal Chain
Previously, a machine speech chain, which is based on sequence-to-sequence
deep learning, was proposed to mimic speech perception and production behavior.
Such chains separately processed listening and speaking by automatic speech
recognition (ASR) and text-to-speech synthesis (TTS) and simultaneously enabled
them to teach each other in semi-supervised learning when they received
unpaired data. Unfortunately, this speech chain study is limited to speech and
textual modalities. In fact, natural communication is actually multimodal and
involves both auditory and visual sensory systems. Although the said speech
chain reduces the requirement of having a full amount of paired data, in this
case we still need a large amount of unpaired data. In this research, we take a
further step and construct a multimodal chain and design a closely knit chain
architecture that combines ASR, TTS, image captioning, and image production
models into a single framework. The framework allows the training of each
component without requiring a large number of parallel multimodal data. Our
experimental results also show that an ASR can be further trained without
speech and text data and cross-modal data augmentation remains possible through
our proposed chain, which improves the ASR performance.Comment: Accepted in IEEE ASRU 201
- …