9,449 research outputs found
Scaling Speech Enhancement in Unseen Environments with Noise Embeddings
We address the problem of speech enhancement generalisation to unseen
environments by performing two manipulations. First, we embed an additional
recording from the environment alone, and use this embedding to alter
activations in the main enhancement subnetwork. Second, we scale the number of
noise environments present at training time to 16,784 different environments.
Experiment results show that both manipulations reduce word error rates of a
pretrained speech recognition system and improve enhancement quality according
to a number of performance measures. Specifically, our best model reduces the
word error rate from 34.04% on noisy speech to 15.46% on the enhanced speech.
Enhanced audio samples can be found in
https://speechenhancement.page.link/samples
Sequence Transduction with Recurrent Neural Networks
Many machine learning tasks can be expressed as the transformation---or
\emph{transduction}---of input sequences into output sequences: speech
recognition, machine translation, protein secondary structure prediction and
text-to-speech to name but a few. One of the key challenges in sequence
transduction is learning to represent both the input and output sequences in a
way that is invariant to sequential distortions such as shrinking, stretching
and translating. Recurrent neural networks (RNNs) are a powerful sequence
learning architecture that has proven capable of learning such representations.
However RNNs traditionally require a pre-defined alignment between the input
and output sequences to perform transduction. This is a severe limitation since
\emph{finding} the alignment is the most difficult aspect of many sequence
transduction problems. Indeed, even determining the length of the output
sequence is often challenging. This paper introduces an end-to-end,
probabilistic sequence transduction system, based entirely on RNNs, that is in
principle able to transform any input sequence into any finite, discrete output
sequence. Experimental results for phoneme recognition are provided on the
TIMIT speech corpus.Comment: First published in the International Conference of Machine Learning
(ICML) 2012 Workshop on Representation Learnin
- …