10,423 research outputs found
EmbraceNet for Activity: A Deep Multimodal Fusion Architecture for Activity Recognition
Human activity recognition using multiple sensors is a challenging but
promising task in recent decades. In this paper, we propose a deep multimodal
fusion model for activity recognition based on the recently proposed feature
fusion architecture named EmbraceNet. Our model processes each sensor data
independently, combines the features with the EmbraceNet architecture, and
post-processes the fused feature to predict the activity. In addition, we
propose additional processes to boost the performance of our model. We submit
the results obtained from our proposed model to the SHL recognition challenge
with the team name "Yonsei-MCML."Comment: Accepted in HASCA at ACM UbiComp/ISWC 2019, won the 2nd place in the
SHL Recognition Challenge 201
CentralNet: a Multilayer Approach for Multimodal Fusion
This paper proposes a novel multimodal fusion approach, aiming to produce
best possible decisions by integrating information coming from multiple media.
While most of the past multimodal approaches either work by projecting the
features of different modalities into the same space, or by coordinating the
representations of each modality through the use of constraints, our approach
borrows from both visions. More specifically, assuming each modality can be
processed by a separated deep convolutional network, allowing to take decisions
independently from each modality, we introduce a central network linking the
modality specific networks. This central network not only provides a common
feature embedding but also regularizes the modality specific networks through
the use of multi-task learning. The proposed approach is validated on 4
different computer vision tasks on which it consistently improves the accuracy
of existing multimodal fusion approaches
AJILE Movement Prediction: Multimodal Deep Learning for Natural Human Neural Recordings and Video
Developing useful interfaces between brains and machines is a grand challenge
of neuroengineering. An effective interface has the capacity to not only
interpret neural signals, but predict the intentions of the human to perform an
action in the near future; prediction is made even more challenging outside
well-controlled laboratory experiments. This paper describes our approach to
detect and to predict natural human arm movements in the future, a key
challenge in brain computer interfacing that has never before been attempted.
We introduce the novel Annotated Joints in Long-term ECoG (AJILE) dataset;
AJILE includes automatically annotated poses of 7 upper body joints for four
human subjects over 670 total hours (more than 72 million frames), along with
the corresponding simultaneously acquired intracranial neural recordings. The
size and scope of AJILE greatly exceeds all previous datasets with movements
and electrocorticography (ECoG), making it possible to take a deep learning
approach to movement prediction. We propose a multimodal model that combines
deep convolutional neural networks (CNN) with long short-term memory (LSTM)
blocks, leveraging both ECoG and video modalities. We demonstrate that our
models are able to detect movements and predict future movements up to 800 msec
before movement initiation. Further, our multimodal movement prediction models
exhibit resilience to simulated ablation of input neural signals. We believe a
multimodal approach to natural neural decoding that takes context into account
is critical in advancing bioelectronic technologies and human neuroscience
Multispectral Deep Neural Networks for Pedestrian Detection
Multispectral pedestrian detection is essential for around-the-clock
applications, e.g., surveillance and autonomous driving. We deeply analyze
Faster R-CNN for multispectral pedestrian detection task and then model it into
a convolutional network (ConvNet) fusion problem. Further, we discover that
ConvNet-based pedestrian detectors trained by color or thermal images
separately provide complementary information in discriminating human instances.
Thus there is a large potential to improve pedestrian detection by using color
and thermal images in DNNs simultaneously. We carefully design four ConvNet
fusion architectures that integrate two-branch ConvNets on different DNNs
stages, all of which yield better performance compared with the baseline
detector. Our experimental results on KAIST pedestrian benchmark show that the
Halfway Fusion model that performs fusion on the middle-level convolutional
features outperforms the baseline method by 11% and yields a missing rate 3.5%
lower than the other proposed architectures.Comment: 13 pages, 8 figures, BMVC 2016 ora
A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Despite the recent advances in opinion mining for written reviews, few works
have tackled the problem on other sources of reviews. In light of this issue,
we propose a multi-modal approach for mining fine-grained opinions from video
reviews that is able to determine the aspects of the item under review that are
being discussed and the sentiment orientation towards them. Our approach works
at the sentence level without the need for time annotations and uses features
derived from the audio, video and language transcriptions of its contents. We
evaluate our approach on two datasets and show that leveraging the video and
audio modalities consistently provides increased performance over text-only
baselines, providing evidence these extra modalities are key in better
understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202
- …