9 research outputs found
Universal EEG Encoder for Learning Diverse Intelligent Tasks
Brain Computer Interfaces (BCI) have become very popular with
Electroencephalography (EEG) being one of the most commonly used signal
acquisition techniques. A major challenge in BCI studies is the individualistic
analysis required for each task. Thus, task-specific feature extraction and
classification are performed, which fails to generalize to other tasks with
similar time-series EEG input data. To this end, we design a GRU-based
universal deep encoding architecture to extract meaningful features from
publicly available datasets for five diverse EEG-based classification tasks.
Our network can generate task and format-independent data representation and
outperform the state of the art EEGNet architecture on most experiments. We
also compare our results with CNN-based, and Autoencoder networks, in turn
performing local, spatial, temporal and unsupervised analysis on the data
Video Face Super-Resolution with Motion-Adaptive Feedback Cell
Video super-resolution (VSR) methods have recently achieved a remarkable
success due to the development of deep convolutional neural networks (CNN).
Current state-of-the-art CNN methods usually treat the VSR problem as a large
number of separate multi-frame super-resolution tasks, at which a batch of low
resolution (LR) frames is utilized to generate a single high resolution (HR)
frame, and running a slide window to select LR frames over the entire video
would obtain a series of HR frames. However, duo to the complex temporal
dependency between frames, with the number of LR input frames increase, the
performance of the reconstructed HR frames become worse. The reason is in that
these methods lack the ability to model complex temporal dependencies and hard
to give an accurate motion estimation and compensation for VSR process. Which
makes the performance degrade drastically when the motion in frames is complex.
In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but
effective block, which can efficiently capture the motion compensation and feed
it back to the network in an adaptive way. Our approach efficiently utilizes
the information of the inter-frame motion, the dependence of the network on
motion estimation and compensation method can be avoid. In addition, benefiting
from the excellent nature of MAFC, the network can achieve better performance
in the case of extremely complex motion scenarios. Extensive evaluations and
comparisons validate the strengths of our approach, and the experimental
results demonstrated that the proposed framework is outperform the
state-of-the-art methods.Comment: To appear in AAAI 202
Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA
Deep learning has successfully shown excellent performance in learning joint
representations between different data modalities. Unfortunately, little
research focuses on cross-modal correlation learning where temporal structures
of different data modalities, such as audio and video, should be taken into
account. Music video retrieval by given musical audio is a natural way to
search and interact with music contents. In this work, we study cross-modal
music video retrieval in terms of emotion similarity. Particularly, audio of an
arbitrary length is used to retrieve a longer or full-length music video. To
this end, we propose a novel audio-visual embedding algorithm by Supervised
Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into
a shared space to bridge the semantic gap between audio and video. This also
preserves the similarity between audio and visual contents from different
videos with the same class label and the temporal structure. The contribution
of our approach is mainly manifested in the two aspects: i) We propose to
select top k audio chunks by attention-based Long Short-Term Memory
(LSTM)model, which can represent good audio summarization with local
properties. ii) We propose an end-to-end deep model for cross-modal
audio-visual learning where S-DCCA is trained to learn the semantic correlation
between audio and visual modalities. Due to the lack of music video dataset, we
construct 10K music video dataset from YouTube 8M dataset. Some promising
results such as MAP and precision-recall show that our proposed model can be
applied to music video retrieval.Comment: 8 pages, 9 figures. Accepted by ISM 201