56,984 research outputs found
Recurrent Neural Networks for End-to-End Speech Recognition: A Comparative Analysis
Speech Recognition is correctly transcribing the spoken utterances by the machine. A new area that is emerging for the representation of the sequential data, such as Speech Recognition is Deep Learning. Deep Learning frameworks such as Recurrent Neural Networks(RNNs) were successful in replacing the traditional speech models such as Hidden Markov Model and Gaussian mixtures. These frameworks boosted the recognition performances to a large context. RNNs being used for sequence to sequence modeling, is a powerful tool for sequence labeling. End-to-End methods such as Connectionist Temporal Classification(CTC) is used with RNNs for Speech Recognition. This paper represents a comparative analysis of RNNs with End-to-End Speech Recognition. Models are trained with different RNN architectures such as Simple RNN cells(SRNN), Long Short Term Memory(LSTMs), Gated Recurrent Unit(GRUs) and even a bidirectional RNNs using all these is compared on Librispeech corpse
A Multimodal Approach for Dementia Detection from Spontaneous Speech with Tensor Fusion Layer
Alzheimer's disease (AD) is a progressive neurological disorder, meaning that
the symptoms develop gradually throughout the years. It is also the main cause
of dementia, which affects memory, thinking skills, and mental abilities.
Nowadays, researchers have moved their interest towards AD detection from
spontaneous speech, since it constitutes a time-effective procedure. However,
existing state-of-the-art works proposing multimodal approaches do not take
into consideration the inter- and intra-modal interactions and propose early
and late fusion approaches. To tackle these limitations, we propose deep neural
networks, which can be trained in an end-to-end trainable way and capture the
inter- and intra-modal interactions. Firstly, each audio file is converted to
an image consisting of three channels, i.e., log-Mel spectrogram, delta, and
delta-delta. Next, each transcript is passed through a BERT model followed by a
gated self-attention layer. Similarly, each image is passed through a Swin
Transformer followed by an independent gated self-attention layer. Acoustic
features are extracted also from each audio file. Finally, the representation
vectors from the different modalities are fed to a tensor fusion layer for
capturing the inter-modal interactions. Extensive experiments conducted on the
ADReSS Challenge dataset indicate that our introduced approaches obtain
valuable advantages over existing research initiatives reaching Accuracy and
F1-score up to 86.25% and 85.48% respectively.Comment: 2022 IEEE-EMBS International Conference on Biomedical and Health
Informatics (BHI) - Oral Presentatio
Improving speech recognition by revising gated recurrent units
Speech recognition is largely taking advantage of deep learning, showing that
substantial benefits can be obtained by modern Recurrent Neural Networks
(RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which
typically reach state-of-the-art performance in many tasks thanks to their
ability to learn long-term dependencies and robustness to vanishing gradients.
Nevertheless, LSTMs have a rather complex design with three multiplicative
gates, that might impair their efficient implementation. An attempt to simplify
LSTMs has recently led to Gated Recurrent Units (GRUs), which are based on just
two multiplicative gates.
This paper builds on these efforts by further revising GRUs and proposing a
simplified architecture potentially more suitable for speech recognition. The
contribution of this work is two-fold. First, we suggest to remove the reset
gate in the GRU design, resulting in a more efficient single-gate architecture.
Second, we propose to replace tanh with ReLU activations in the state update
equations. Results show that, in our implementation, the revised architecture
reduces the per-epoch training time with more than 30% and consistently
improves recognition performance across different tasks, input features, and
noisy conditions when compared to a standard GRU
- …