10,791 research outputs found
Robust ASR using Support Vector Machines
The improved theoretical properties of Support Vector Machines with respect to other machine learning alternatives due to their max-margin training paradigm have led us to suggest them as a good technique for robust speech recognition. However, important shortcomings have had to be circumvented, the most important being the normalisation of the time duration of different realisations of the acoustic speech units.
In this paper, we have compared two approaches in noisy environments: first, a hybrid HMM–SVM solution where a fixed number of frames is selected by means of an HMM segmentation and second, a normalisation kernel called Dynamic Time Alignment Kernel (DTAK) first introduced in Shimodaira et al. [Shimodaira, H., Noma, K., Nakai, M., Sagayama, S., 2001. Support vector machine with dynamic time-alignment kernel for speech recognition. In: Proc. Eurospeech, Aalborg, Denmark, pp. 1841–1844] and based on DTW (Dynamic Time Warping). Special attention has been paid to the adaptation of both alternatives to noisy environments, comparing two types of parameterisations and performing suitable feature normalisation operations. The results show that the DTA Kernel provides important advantages over the baseline HMM system in medium to bad noise conditions, also outperforming the results of the hybrid system.Publicad
LibriMix: An Open-Source Dataset for Generalizable Speech Separation
In recent years, wsj0-2mix has become the reference dataset for
single-channel speech separation. Most deep learning-based speech separation
models today are benchmarked on it. However, recent studies have shown
important performance drops when models trained on wsj0-2mix are evaluated on
other, similar datasets. To address this generalization issue, we created
LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension,
WHAM!. Based on LibriSpeech, LibriMix consists of two- or three-speaker
mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we
achieve competitive performance on all LibriMix versions. In order to fairly
evaluate across datasets, we introduce a third test set based on VCTK for
speech and WHAM! for noise. Our experiments show that the generalization error
is smaller for models trained with LibriMix than with WHAM!, in both clean and
noisy conditions. Aiming towards evaluation in more realistic,
conversation-like scenarios, we also release a sparsely overlapping version of
LibriMix's test set.Comment: submitted to INTERSPEECH 202
Attention-Based Models for Text-Dependent Speaker Verification
Attention-based models have recently shown great performance on a range of
tasks, such as speech recognition, machine translation, and image captioning
due to their ability to summarize relevant information that expands through the
entire length of an input sequence. In this paper, we analyze the usage of
attention mechanisms to the problem of sequence summarization in our end-to-end
text-dependent speaker recognition system. We explore different topologies and
their variants of the attention layer, and compare different pooling methods on
the attention weights. Ultimately, we show that attention-based models can
improves the Equal Error Rate (EER) of our speaker verification system by
relatively 14% compared to our non-attention LSTM baseline model.Comment: Submitted to ICASSP 201
Deep Speaker Feature Learning for Text-independent Speaker Verification
Recently deep neural networks (DNNs) have been used to learn speaker
features. However, the quality of the learned features is not sufficiently
good, so a complex back-end model, either neural or probabilistic, has to be
used to address the residual uncertainty when applied to speaker verification,
just as with raw features. This paper presents a convolutional time-delay deep
neural network structure (CT-DNN) for speaker feature learning. Our
experimental results on the Fisher database demonstrated that this CT-DNN can
produce high-quality speaker features: even with a single feature (0.3 seconds
including the context), the EER can be as low as 7.68%. This effectively
confirmed that the speaker trait is largely a deterministic short-time property
rather than a long-time distributional pattern, and therefore can be extracted
from just dozens of frames.Comment: deep neural networks, speaker verification, speaker featur
Exploring Language-Independent Emotional Acoustic Features via Feature Selection
We propose a novel feature selection strategy to discover
language-independent acoustic features that tend to be responsible for emotions
regardless of languages, linguistics and other factors. Experimental results
suggest that the language-independent feature subset discovered yields the
performance comparable to the full feature set on various emotional speech
corpora.Comment: 15 pages, 2 figures, 6 table
- …