8 research outputs found
TabAttention: Learning Attention Conditionally on Tabular Data
Medical data analysis often combines both imaging and tabular data processing
using machine learning algorithms. While previous studies have investigated the
impact of attention mechanisms on deep learning models, few have explored
integrating attention modules and tabular data. In this paper, we introduce
TabAttention, a novel module that enhances the performance of Convolutional
Neural Networks (CNNs) with an attention mechanism that is trained
conditionally on tabular data. Specifically, we extend the Convolutional Block
Attention Module to 3D by adding a Temporal Attention Module that uses
multi-head self-attention to learn attention maps. Furthermore, we enhance all
attention modules by integrating tabular data embeddings. Our approach is
demonstrated on the fetal birth weight (FBW) estimation task, using 92 fetal
abdominal ultrasound video scans and fetal biometry measurements. Our results
indicate that TabAttention outperforms clinicians and existing methods that
rely on tabular and/or imaging data for FBW prediction. This novel approach has
the potential to improve computer-aided diagnosis in various clinical workflows
where imaging and tabular data are combined. We provide a source code for
integrating TabAttention in CNNs at
https://github.com/SanoScience/Tab-Attention.Comment: Accepted for the 26th International Conference on Medical Image
Computing and Computer Assisted Intervention (MICCAI) 202
Knowing What to Listen to: Early Attention for Deep Speech Representation Learning
Deep learning techniques have considerably improved speech processing in
recent years. Speech representations extracted by deep learning models are
being used in a wide range of tasks such as speech recognition, speaker
recognition, and speech emotion recognition. Attention models play an important
role in improving deep learning models. However current attention mechanisms
are unable to attend to fine-grained information items. In this paper we
propose the novel Fine-grained Early Frequency Attention (FEFA) for speech
signals. This model is capable of focusing on information items as small as
frequency bins. We evaluate the proposed model on two popular tasks of speaker
recognition and speech emotion recognition. Two widely used public datasets,
VoxCeleb and IEMOCAP, are used for our experiments. The model is implemented on
top of several prominent deep models as backbone networks to evaluate its
impact on performance compared to the original networks and other related work.
Our experiments show that by adding FEFA to different CNN architectures,
performance is consistently improved by substantial margins, even setting a new
state-of-the-art for the speaker recognition task. We also tested our model
against different levels of added noise showing improvements in robustness and
less sensitivity compared to the backbone networks
Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review
Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined
Interpreting intermediate feature representations of raw-waveform deep CNNs by sonification
The majority of the recent works that address the interpretability of raw waveform based deep neural networks (DNNs) for audio processing focus on interpreting spectral and frequency response information, often limiting to visual and signal theoretic means of interpretation, solely for the first layer. This work proposes sonification, a method to interpret intermediate feature representations of sound event recognition (SER) 1D-convolutional neural networks (1D-CNNs) trained on raw waveforms by mapping these representations back into the discrete-time input signal domain, highlighting substructures in the input that maximally activate a feature map as intelligible acoustic events. Sonification is used to compare supervised and contrastive self-supervised feature representations, observing how the latter learn more acoustically discernible representations, especially in the deeper layers. A metric to quantify acoustic similarity between the interpretations and their corresponding inputs is proposed, and a layer-by-layer analysis of the trained feature representations using this metric supports the observations made
Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks
In speaker recognition, deep neural networks deliver state-of-the-art performance due
to their large capacities and powerful feature extraction abilities. However, this performance can be highly affected by interference from background noise and other speakers.
This thesis focuses on new neural network architectures that are designed to overcome
such interference and thereby improve the robustness of the speaker recognition system.
In order to improve the noise robustness of the speaker recognition model, two
novel network architectures are proposed. The first is the hierarchical attention network, which is able to capture both local and global features in order to improve the
robustness of the network. The experimental results show it can deliver results that
are comparable to the published state-of-the-art methods, reaching 4.28% equal error
rate using the Voxceleb1 training and test sets. The second approach is the speech
enhancement and speaker recognition joint system that consists of two networks; the
first integrates speech enhancement and speaker recognition into one framework to
better filter out noise, while the other makes further use of speaker embeddings input to a speech enhancement network. This provides prior knowledge for the speech
enhancement network which improves its performance. The results show that a joint
system with a speaker dependent speech enhancement model can deliver results that
are comparable to the published state-of-the-art methods, reaching 4.15% equal error
rate using the Voxceleb1 training and test sets.
In order to overcome interfering speaker, two novel approaches are proposed. The
first is referred to as the embedding de-mixing approach that separates the speaker and content properties from a two-speaker signal in an embedding space, rather than
in a signal space. The results show that the de-mixed embeddings are close to the
clean embeddings in terms of quality, and the back-end speaker recognition model can
make use of the de-mixed embeddings to reach 96.9% speaker identification accuracy,
compared to those achieved using clean embeddings (98.5%) on TIMIT dataset. The
second approach is the first end-to-end weakly supervised speaker identification approach based on a novel hierarchical transformer network architecture. The results
show that the proposed model can capture speaker properties from two speakers in
one input utterance. The hierarchical transformer network can reach more than 3%
relative improvement compared to the baselines in all of the test conditions