4,658 research outputs found
Hidden Two-Stream Convolutional Networks for Action Recognition
Analyzing videos of human actions involves understanding the temporal
relationships among video frames. State-of-the-art action recognition
approaches rely on traditional optical flow estimation methods to pre-compute
motion information for CNNs. Such a two-stage approach is computationally
expensive, storage demanding, and not end-to-end trainable. In this paper, we
present a novel CNN architecture that implicitly captures motion information
between adjacent frames. We name our approach hidden two-stream CNNs because it
only takes raw video frames as input and directly predicts action classes
without explicitly computing optical flow. Our end-to-end approach is 10x
faster than its two-stage baseline. Experimental results on four challenging
action recognition datasets: UCF101, HMDB51, THUMOS14 and ActivityNet v1.2 show
that our approach significantly outperforms the previous best real-time
approaches.Comment: Accepted at ACCV 2018, camera ready. Code available at
https://github.com/bryanyzhu/Hidden-Two-Strea
Meta predictive learning model of natural languages
Large language models based on self-attention mechanisms have achieved
astonishing performances not only in natural language itself, but also in a
variety of tasks of different nature. However, regarding processing language,
our human brain may not operate using the same principle. Then, a debate is
established on the connection between brain computation and artificial
self-supervision adopted in large language models. One of most influential
hypothesis in brain computation is the predictive coding framework, which
proposes to minimize the prediction error by local learning. However, the role
of predictive coding and the associated credit assignment in language
processing remains unknown. Here, we propose a mean-field learning model within
the predictive coding framework, assuming that the synaptic weight of each
connection follows a spike and slab distribution, and only the distribution is
trained. This meta predictive learning is successfully validated on classifying
handwritten digits where pixels are input to the network in sequence, and on
the toy and real language corpus. Our model reveals that most of the
connections become deterministic after learning, while the output connections
have a higher level of variability. The performance of the resulting network
ensemble changes continuously with data load, further improving with more
training data, in analogy with the emergent behavior of large language models.
Therefore, our model provides a starting point to investigate the physics and
biology correspondences of the language processing and the unexpected general
intelligence.Comment: 23 pages, 6 figures, codes are available in the main text with the
lin
Audio self-supervised learning: a survey
Inspired by the humans' cognitive ability to generalise knowledge and skills,
Self-Supervised Learning (SSL) targets at discovering general representations
from large-scale data without requiring human annotations, which is an
expensive and time consuming task. Its success in the fields of computer vision
and natural language processing have prompted its recent adoption into the
field of audio and speech processing. Comprehensive reviews summarising the
knowledge in audio SSL are currently missing. To fill this gap, in the present
work, we provide an overview of the SSL methods used for audio and speech
processing applications. Herein, we also summarise the empirical works that
exploit the audio modality in multi-modal SSL frameworks, and the existing
suitable benchmarks to evaluate the power of SSL in the computer audition
domain. Finally, we discuss some open problems and point out the future
directions on the development of audio SSL
Applied Deep Learning: Case Studies in Computer Vision and Natural Language Processing
Deep learning has proved to be successful for many computer vision and natural language processing applications. In this dissertation, three studies have been conducted to show the efficacy of deep learning models for computer vision and natural language processing. In the first study, an efficient deep learning model was proposed for seagrass scar detection in multispectral images which produced robust, accurate scars mappings. In the second study, an arithmetic deep learning model was developed to fuse multi-spectral images collected at different times with different resolutions to generate high-resolution images for downstream tasks including change detection, object detection, and land cover classification. In addition, a super-resolution deep model was implemented to further enhance remote sensing images. In the third study, a deep learning-based framework was proposed for fact-checking on social media to spot fake scientific news. The framework leveraged deep learning, information retrieval, and natural language processing techniques to retrieve pertinent scholarly papers for given scientific news and evaluate the credibility of the news
- …