16 research outputs found
Speaker verification using attentive multi-scale convolutional recurrent network
In this paper, we propose a speaker verification method by an Attentive
Multi-scale Convolutional Recurrent Network (AMCRN). The proposed AMCRN can
acquire both local spatial information and global sequential information from
the input speech recordings. In the proposed method, logarithm Mel spectrum is
extracted from each speech recording and then fed to the proposed AMCRN for
learning speaker embedding. Afterwards, the learned speaker embedding is fed to
the back-end classifier (such as cosine similarity metric) for scoring in the
testing stage. The proposed method is compared with state-of-the-art methods
for speaker verification. Experimental data are three public datasets that are
selected from two large-scale speech corpora (VoxCeleb1 and VoxCeleb2).
Experimental results show that our method exceeds baseline methods in terms of
equal error rate and minimal detection cost function, and has advantages over
most of baseline methods in terms of computational complexity and memory
requirement. In addition, our method generalizes well across truncated speech
segments with different durations, and the speaker embedding learned by the
proposed AMCRN has stronger generalization ability across two back-end
classifiers.Comment: 21 pages, 6 figures, 8 tables. Accepted for publication in Applied
Soft Computin
Interpreter identification in the Polish Interpreting Corpus
This paper describes automated identification of interpreter voices in the Polish Interpreting Corpus (PINC). After collecting a set of voice samples of interpreters, a deep neural network model was used to match all the utterances from the corpus with specific individuals. The final result is very accurate and provides a considerable saving of time and accuracy off human judgment.Aquest article descriu la identificació automatitzada de veus d'intèrprets al Corpus d'Intèrprets Polonès (Polish Interpreting Corpus, PINC). Després de recollir un conjunt de mostres de veu de diversos intèrprets, s'ha utilitzat un model de xarxa neuronal profunda per fer coincidir les mostres de parla del corpus amb les de cada individu. El resultat final és molt precÃs i proporciona un estalvi considerable de temps i de precisió en la interpretació humana.Este artÃculo describe la identificación automática de voces de intérpretes en el Corpus Polaco de Interpretación. Tras recopilar una serie de muestras de voces de intérpretes, se utilizó un modelo de red neuronal profunda para asociar todas las elocuciones del corpus con individuos especÃficos. El resultado final es muy acertado, lo cual implica un ahorro considerable de tiempo y análisis humano
Knowing What to Listen to: Early Attention for Deep Speech Representation Learning
Deep learning techniques have considerably improved speech processing in
recent years. Speech representations extracted by deep learning models are
being used in a wide range of tasks such as speech recognition, speaker
recognition, and speech emotion recognition. Attention models play an important
role in improving deep learning models. However current attention mechanisms
are unable to attend to fine-grained information items. In this paper we
propose the novel Fine-grained Early Frequency Attention (FEFA) for speech
signals. This model is capable of focusing on information items as small as
frequency bins. We evaluate the proposed model on two popular tasks of speaker
recognition and speech emotion recognition. Two widely used public datasets,
VoxCeleb and IEMOCAP, are used for our experiments. The model is implemented on
top of several prominent deep models as backbone networks to evaluate its
impact on performance compared to the original networks and other related work.
Our experiments show that by adding FEFA to different CNN architectures,
performance is consistently improved by substantial margins, even setting a new
state-of-the-art for the speaker recognition task. We also tested our model
against different levels of added noise showing improvements in robustness and
less sensitivity compared to the backbone networks
Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review
Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined