71,318 research outputs found
Learning from Past Mistakes: Improving Automatic Speech Recognition Output via Noisy-Clean Phrase Context Modeling
Automatic speech recognition (ASR) systems often make unrecoverable errors
due to subsystem pruning (acoustic, language and pronunciation models); for
example pruning words due to acoustics using short-term context, prior to
rescoring with long-term context based on linguistics. In this work we model
ASR as a phrase-based noisy transformation channel and propose an error
correction system that can learn from the aggregate errors of all the
independent modules constituting the ASR and attempt to invert those. The
proposed system can exploit long-term context using a neural network language
model and can better choose between existing ASR output possibilities as well
as re-introduce previously pruned or unseen (out-of-vocabulary) phrases. It
provides corrections under poorly performing ASR conditions without degrading
any accurate transcriptions; such corrections are greater on top of
out-of-domain and mismatched data ASR. Our system consistently provides
improvements over the baseline ASR, even when baseline is further optimized
through recurrent neural network language model rescoring. This demonstrates
that any ASR improvements can be exploited independently and that our proposed
system can potentially still provide benefits on highly optimized ASR. Finally,
we present an extensive analysis of the type of errors corrected by our system
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities
The Softmax function on top of a final linear layer is the de facto method to
output probability distributions in neural networks. In many applications such
as language models or text generation, this model has to produce distributions
over large output vocabularies. Recently, this has been shown to have limited
representational capacity due to its connection with the rank bottleneck in
matrix factorization. However, little is known about the limitations of
Linear-Softmax for quantities of practical interest such as cross entropy or
mode estimation, a direction that we explore here. As an efficient and
effective solution to alleviate this issue, we propose to learn parametric
monotonic functions on top of the logits. We theoretically investigate the rank
increasing capabilities of such monotonic functions. Empirically, our method
improves in two different quality metrics over the traditional Linear-Softmax
layer in synthetic and real language model experiments, adding little time or
memory overhead, while being comparable to the more computationally expensive
mixture of Softmaxes
The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge
This paper describes the NTNU ASR system participating in the Interspeech
2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD
group of ISCA. This ASR shared task is made much more challenging due to the
coexisting diversity of non-native and children speaking characteristics. In
the setting of closed-track evaluation, all participants were restricted to
develop their systems merely based on the speech and text corpora provided by
the organizer. To work around this under-resourced issue, we built our ASR
system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the
synergistic power of various data augmentation strategies, including both
utterance- and word-level speed perturbation and spectrogram augmentation,
alongside a simple yet effective data-cleansing approach. All variants of our
ASR system employed an RNN-based language model to rescore the first-pass
recognition hypotheses, which was trained solely on the text dataset released
by the organizer. Our system with the best configuration came out in second
place, resulting in a word error rate (WER) of 17.59 %, while those of the
top-performing, second runner-up and official baseline systems are 15.67%,
18.71%, 35.09%, respectively.Comment: Submitted to Interspeech 2020 Special Session: Shared Task on
Automatic Speech Recognition for Non-Native Children's Speec
The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family
Despite being the standard loss function to train multi-class neural
networks, the log-softmax has two potential limitations. First, it involves
computations that scale linearly with the number of output classes, which can
restrict the size of problems we are able to tackle with current hardware.
Second, it remains unclear how close it matches the task loss such as the top-k
error rate or other non-differentiable evaluation metrics which we aim to
optimize ultimately. In this paper, we introduce an alternative classification
loss function, the Z-loss, which is designed to address these two issues.
Unlike the log-softmax, it has the desirable property of belonging to the
spherical loss family (Vincent et al., 2015), a class of loss functions for
which training can be performed very efficiently with a complexity independent
of the number of output classes. We show experimentally that it significantly
outperforms the other spherical loss functions previously investigated.
Furthermore, we show on a word language modeling task that it also outperforms
the log-softmax with respect to certain ranking scores, such as top-k scores,
suggesting that the Z-loss has the flexibility to better match the task loss.
These qualities thus makes the Z-loss an appealing candidate to train very
efficiently large output networks such as word-language models or other extreme
classification problems. On the One Billion Word (Chelba et al., 2014) dataset,
we are able to train a model with the Z-loss 40 times faster than the
log-softmax and more than 4 times faster than the hierarchical softmax
Recent Progresses in Deep Learning based Acoustic Models (Updated)
In this paper, we summarize recent progresses made in deep learning based
acoustic models and the motivation and insights behind the surveyed techniques.
We first discuss acoustic models that can effectively exploit variable-length
contextual information, such as recurrent neural networks (RNNs), convolutional
neural networks (CNNs), and their various combination with other models. We
then describe acoustic models that are optimized end-to-end with emphasis on
feature representations learned jointly with rest of the system, the
connectionist temporal classification (CTC) criterion, and the attention-based
sequence-to-sequence model. We further illustrate robustness issues in speech
recognition systems, and discuss acoustic model adaptation, speech enhancement
and separation, and robust training strategies. We also cover modeling
techniques that lead to more efficient decoding and discuss possible future
directions in acoustic model research.Comment: This is an updated version with latest literature until ICASSP2018 of
the paper: Dong Yu and Jinyu Li, "Recent Progresses in Deep Learning based
Acoustic Models," vol.4, no.3, IEEE/CAA Journal of Automatica Sinica, 201
Investigation of Large-Margin Softmax in Neural Language Modeling
To encourage intra-class compactness and inter-class separability among
trainable feature vectors, large-margin softmax methods are developed and
widely applied in the face recognition community. The introduction of the
large-margin concept into the softmax is reported to have good properties such
as enhanced discriminative power, less overfitting and well-defined geometric
intuitions. Nowadays, language modeling is commonly approached with neural
networks using softmax and cross entropy. In this work, we are curious to see
if introducing large-margins to neural language models would improve the
perplexity and consequently word error rate in automatic speech recognition.
Specifically, we first implement and test various types of conventional margins
following the previous works in face recognition. To address the distribution
of natural language data, we then compare different strategies for word vector
norm-scaling. After that, we apply the best norm-scaling setup in combination
with various margins and conduct neural language models rescoring experiments
in automatic speech recognition. We find that although perplexity is slightly
deteriorated, neural language models with large-margin softmax can yield word
error rate similar to that of the standard softmax baseline. Finally, expected
margins are analyzed through visualization of word vectors, showing that the
syntactic and semantic relationships are also preserved.Comment: submitted to INTERSPEECH202
Spatial postprocessing of ensemble forecasts for temperature using nonhomogeneous Gaussian regression
Statistical postprocessing techniques are commonly used to improve the skill
of ensembles of numerical weather forecasts. This paper considers spatial
extensions of the well-established nonhomogeneous Gaussian regression (NGR)
postprocessing technique for surface temperature and a recent modification
thereof in which the local climatology is included in the regression model for
a locally adaptive postprocessing. In a comparative study employing 21 h
forecasts from the COSMO-DE ensemble predictive system over Germany, two
approaches for modeling spatial forecast error correlations are considered: A
parametric Gaussian random field model and the ensemble copula coupling
approach which utilizes the spatial rank correlation structure of the raw
ensemble. Additionally, the NGR methods are compared to both univariate and
spatial versions of the ensemble Bayesian model averaging (BMA) postprocessing
technique
Feature Selection and Model Comparison on Microsoft Learning-to-Rank Data Sets
With the rapid advance of the Internet, search engines (e.g., Google, Bing,
Yahoo!) are used by billions of users for each day. The main function of a
search engine is to locate the most relevant webpages corresponding to what the
user requests. This report focuses on the core problem of information
retrieval: how to learn the relevance between a document (very often webpage)
and a query given by user. Our analysis consists of two parts: 1) we use
standard statistical methods to select important features among 137 candidates
given by information retrieval researchers from Microsoft. We find that not all
the features are useful, and give interpretations on the top-selected features;
2) we give baselines on prediction over the real-world dataset MSLR-WEB by
using various learning algorithms. We find that models of boosting trees,
random forest in general achieve the best performance of prediction. This
agrees with the mainstream opinion in information retrieval community that
tree-based algorithms outperform the other candidates for this problem.Comment: 24 page
Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation
Monaural source separation is important for many real world applications. It
is challenging because, with only a single channel of information available,
without any constraints, an infinite number of solutions are possible. In this
paper, we explore joint optimization of masking functions and deep recurrent
neural networks for monaural source separation tasks, including monaural speech
separation, monaural singing voice separation, and speech denoising. The joint
optimization of the deep recurrent neural networks with an extra masking layer
enforces a reconstruction constraint. Moreover, we explore a discriminative
criterion for training neural networks to further enhance the separation
performance. We evaluate the proposed system on the TSP, MIR-1K, and TIMIT
datasets for speech separation, singing voice separation, and speech denoising
tasks, respectively. Our approaches achieve 2.30--4.98 dB SDR gain compared to
NMF models in the speech separation task, 2.30--2.48 dB GNSDR gain and
4.32--5.42 dB GSIR gain compared to existing models in the singing voice
separation task, and outperform NMF and DNN baselines in the speech denoising
task
Learning K-way D-dimensional Discrete Codes for Compact Embedding Representations
Conventional embedding methods directly associate each symbol with a
continuous embedding vector, which is equivalent to applying a linear
transformation based on a "one-hot" encoding of the discrete symbols. Despite
its simplicity, such approach yields the number of parameters that grows
linearly with the vocabulary size and can lead to overfitting. In this work, we
propose a much more compact K-way D-dimensional discrete encoding scheme to
replace the "one-hot" encoding. In the proposed "KD encoding", each symbol is
represented by a -dimensional code with a cardinality of , and the final
symbol embedding vector is generated by composing the code embedding vectors.
To end-to-end learn semantically meaningful codes, we derive a relaxed discrete
optimization approach based on stochastic gradient descent, which can be
generally applied to any differentiable computational graph with an embedding
layer. In our experiments with various applications from natural language
processing to graph convolutional networks, the total size of the embedding
layer can be reduced up to 98\% while achieving similar or better performance.Comment: ICML 2018. arXiv admin note: text overlap with arXiv:1711.0306
- …