312 research outputs found
A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends
In today's digital age, Convolutional Neural Networks (CNNs), a subset of
Deep Learning (DL), are widely used for various computer vision tasks such as
image classification, object detection, and image segmentation. There are
numerous types of CNNs designed to meet specific needs and requirements,
including 1D, 2D, and 3D CNNs, as well as dilated, grouped, attention,
depthwise convolutions, and NAS, among others. Each type of CNN has its unique
structure and characteristics, making it suitable for specific tasks. It's
crucial to gain a thorough understanding and perform a comparative analysis of
these different CNN types to understand their strengths and weaknesses.
Furthermore, studying the performance, limitations, and practical applications
of each type of CNN can aid in the development of new and improved
architectures in the future. We also dive into the platforms and frameworks
that researchers utilize for their research or development from various
perspectives. Additionally, we explore the main research fields of CNN like 6D
vision, generative models, and meta-learning. This survey paper provides a
comprehensive examination and comparison of various CNN architectures,
highlighting their architectural differences and emphasizing their respective
advantages, disadvantages, applications, challenges, and future trends
NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding
In the task of emotion recognition from videos, a key improvement has been to
focus on emotions over time rather than a single frame. There are many
architectures to address this task such as GRUs, LSTMs, Self-Attention,
Transformers, and Temporal Convolutional Networks (TCNs). However, these
methods suffer from high memory usage, large amounts of operations, or poor
gradients. We propose a method known as Neighborhood Attention with
Convolutions TCN (NAC-TCN) which incorporates the benefits of attention and
Temporal Convolutional Networks while ensuring that causal relationships are
understood which results in a reduction in computation and memory cost. We
accomplish this by introducing a causal version of Dilated Neighborhood
Attention while incorporating it with convolutions. Our model achieves
comparable, better, or state-of-the-art performance over TCNs, TCAN, LSTMs, and
GRUs while requiring fewer parameters on standard emotion recognition datasets.
We publish our code online for easy reproducibility and use in other projects.Comment: 8 pages, presented at ICVIP 202
Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition
Pretrained contextual word representations in NLP have greatly improved
performance on various downstream tasks. For speech, we propose contextual
frame representations that capture phonetic information at the acoustic frame
level and can be used for utterance-level language, speaker, and speech
recognition. These representations come from the frame-wise intermediate
representations of an end-to-end, self-attentive ASR model (SAN-CTC) on spoken
utterances. We first train the model on the Fisher English corpus with
context-independent phoneme labels, then use its representations at inference
time as features for task-specific models on the NIST LRE07 closed-set language
recognition task and a Fisher speaker recognition task, giving significant
improvements over the state-of-the-art on both (e.g., language EER of 4.68% on
3sec utterances, 23% relative reduction in speaker EER). Results remain
competitive when using a novel dilated convolutional model for language
recognition, or when ASR pretraining is done with character labels only.Comment: submitted to INTERSPEECH 201
Deep audio-visual speech recognition
Decades of research in acoustic speech recognition have led to systems that we use in our everyday life. However, even the most advanced speech recognition systems fail in the presence of noise. The degraded performance can be compensated by introducing visual speech information. However, Visual Speech Recognition (VSR) in naturalistic conditions is very challenging, in part due to the lack of architectures and annotations.
This thesis contributes towards the problem of Audio-Visual Speech Recognition (AVSR) from different aspects. Firstly, we develop AVSR models for isolated words. In contrast to previous state-of-the-art methods that consists of a two-step approach, feature extraction and recognition, we present an End-to-End (E2E) approach inside a deep neural network, and this has led to a significant improvement in audio-only, visual-only and audio-visual experiments. We further replace Bi-directional Gated Recurrent Unit (BGRU) with Temporal Convolutional Networks (TCN) to greatly simplify the training procedure.
Secondly, we extend our AVSR model for continuous speech by presenting a hybrid Connectionist Temporal Classification (CTC)/Attention model, that can be trained in an end-to-end manner. We then propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations.
Next, we present a self-supervised framework, Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech, and find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading.
We also investigate the Lombard effect influence in an end-to-end AVSR system, which is the first work using end-to-end deep architectures and presents results on unseen speakers. We show that even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved.
Lastly, we propose a detection method against adversarial examples in an AVSR system, where the strong correlation between audio and visual streams is leveraged. The synchronisation confidence score is leveraged as a proxy for audio-visual correlation and based on it, we can detect adversarial attacks. We apply recent adversarial attacks on two AVSR models and the experimental results demonstrate that the proposed approach is an effective way for detecting such attacks.Open Acces
Lip-reading with Densely Connected Temporal Convolutional Networks
In this work, we present the Densely Connected Temporal Convolutional Network
(DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional
Networks (TCN) have recently demonstrated great potential in many vision tasks,
its receptive fields are not dense enough to model the complex temporal
dynamics in lip-reading scenarios. To address this problem, we introduce dense
connections into the network to capture more robust temporal features.
Moreover, our approach utilises the Squeeze-and-Excitation block, a
light-weight attention mechanism, to further enhance the model's classification
power. Without bells and whistles, our DC-TCN method has achieved 88.36%
accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the
LRW-1000 dataset, which has surpassed all the baseline methods and is the new
state-of-the-art on both datasets.Comment: WACV 202
- …