8 research outputs found

    TabAttention: Learning Attention Conditionally on Tabular Data

    Full text link
    Medical data analysis often combines both imaging and tabular data processing using machine learning algorithms. While previous studies have investigated the impact of attention mechanisms on deep learning models, few have explored integrating attention modules and tabular data. In this paper, we introduce TabAttention, a novel module that enhances the performance of Convolutional Neural Networks (CNNs) with an attention mechanism that is trained conditionally on tabular data. Specifically, we extend the Convolutional Block Attention Module to 3D by adding a Temporal Attention Module that uses multi-head self-attention to learn attention maps. Furthermore, we enhance all attention modules by integrating tabular data embeddings. Our approach is demonstrated on the fetal birth weight (FBW) estimation task, using 92 fetal abdominal ultrasound video scans and fetal biometry measurements. Our results indicate that TabAttention outperforms clinicians and existing methods that rely on tabular and/or imaging data for FBW prediction. This novel approach has the potential to improve computer-aided diagnosis in various clinical workflows where imaging and tabular data are combined. We provide a source code for integrating TabAttention in CNNs at https://github.com/SanoScience/Tab-Attention.Comment: Accepted for the 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 202

    Knowing What to Listen to: Early Attention for Deep Speech Representation Learning

    Full text link
    Deep learning techniques have considerably improved speech processing in recent years. Speech representations extracted by deep learning models are being used in a wide range of tasks such as speech recognition, speaker recognition, and speech emotion recognition. Attention models play an important role in improving deep learning models. However current attention mechanisms are unable to attend to fine-grained information items. In this paper we propose the novel Fine-grained Early Frequency Attention (FEFA) for speech signals. This model is capable of focusing on information items as small as frequency bins. We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition. Two widely used public datasets, VoxCeleb and IEMOCAP, are used for our experiments. The model is implemented on top of several prominent deep models as backbone networks to evaluate its impact on performance compared to the original networks and other related work. Our experiments show that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins, even setting a new state-of-the-art for the speaker recognition task. We also tested our model against different levels of added noise showing improvements in robustness and less sensitivity compared to the backbone networks

    Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review

    Get PDF
    Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined

    Interpreting intermediate feature representations of raw-waveform deep CNNs by sonification

    Get PDF
    The majority of the recent works that address the interpretability of raw waveform based deep neural networks (DNNs) for audio processing focus on interpreting spectral and frequency response information, often limiting to visual and signal theoretic means of interpretation, solely for the first layer. This work proposes sonification, a method to interpret intermediate feature representations of sound event recognition (SER) 1D-convolutional neural networks (1D-CNNs) trained on raw waveforms by mapping these representations back into the discrete-time input signal domain, highlighting substructures in the input that maximally activate a feature map as intelligible acoustic events. Sonification is used to compare supervised and contrastive self-supervised feature representations, observing how the latter learn more acoustically discernible representations, especially in the deeper layers. A metric to quantify acoustic similarity between the interpretations and their corresponding inputs is proposed, and a layer-by-layer analysis of the trained feature representations using this metric supports the observations made

    Improving the Robustness of Speaker Recognition in Noise and Multi-Speaker Conditions Using Deep Neural Networks

    Get PDF
    In speaker recognition, deep neural networks deliver state-of-the-art performance due to their large capacities and powerful feature extraction abilities. However, this performance can be highly affected by interference from background noise and other speakers. This thesis focuses on new neural network architectures that are designed to overcome such interference and thereby improve the robustness of the speaker recognition system. In order to improve the noise robustness of the speaker recognition model, two novel network architectures are proposed. The first is the hierarchical attention network, which is able to capture both local and global features in order to improve the robustness of the network. The experimental results show it can deliver results that are comparable to the published state-of-the-art methods, reaching 4.28% equal error rate using the Voxceleb1 training and test sets. The second approach is the speech enhancement and speaker recognition joint system that consists of two networks; the first integrates speech enhancement and speaker recognition into one framework to better filter out noise, while the other makes further use of speaker embeddings input to a speech enhancement network. This provides prior knowledge for the speech enhancement network which improves its performance. The results show that a joint system with a speaker dependent speech enhancement model can deliver results that are comparable to the published state-of-the-art methods, reaching 4.15% equal error rate using the Voxceleb1 training and test sets. In order to overcome interfering speaker, two novel approaches are proposed. The first is referred to as the embedding de-mixing approach that separates the speaker and content properties from a two-speaker signal in an embedding space, rather than in a signal space. The results show that the de-mixed embeddings are close to the clean embeddings in terms of quality, and the back-end speaker recognition model can make use of the de-mixed embeddings to reach 96.9% speaker identification accuracy, compared to those achieved using clean embeddings (98.5%) on TIMIT dataset. The second approach is the first end-to-end weakly supervised speaker identification approach based on a novel hierarchical transformer network architecture. The results show that the proposed model can capture speaker properties from two speakers in one input utterance. The hierarchical transformer network can reach more than 3% relative improvement compared to the baselines in all of the test conditions
    corecore