280 research outputs found
The Diagonalized Newton Algorithm for Nonnegative Matrix Factorization
Non-negative matrix factorization (NMF) has become a popular machine learning
approach to many problems in text mining, speech and image processing,
bio-informatics and seismic data analysis to name a few. In NMF, a matrix of
non-negative data is approximated by the low-rank product of two matrices with
non-negative entries. In this paper, the approximation quality is measured by
the Kullback-Leibler divergence between the data and its low-rank
reconstruction. The existence of the simple multiplicative update (MU)
algorithm for computing the matrix factors has contributed to the success of
NMF. Despite the availability of algorithms showing faster convergence, MU
remains popular due to its simplicity. In this paper, a diagonalized Newton
algorithm (DNA) is proposed showing faster convergence while the implementation
remains simple and suitable for high-rank problems. The DNA algorithm is
applied to various publicly available data sets, showing a substantial speed-up
on modern hardware.Comment: 8 pages + references; International Conference on Learning
Representations, 201
Improving Source Separation via Multi-Speaker Representations
Lately there have been novel developments in deep learning towards solving
the cocktail party problem. Initial results are very promising and allow for
more research in the domain. One technique that has not yet been explored in
the neural network approach to this task is speaker adaptation. Intuitively,
information on the speakers that we are trying to separate seems fundamentally
important for the speaker separation task. However, retrieving this speaker
information is challenging since the speaker identities are not known a priori
and multiple speakers are simultaneously active. There is thus some sort of
chicken and egg problem. To tackle this, source signals and i-vectors are
estimated alternately. We show that blind multi-speaker adaptation improves the
results of the network and that (in our case) the network is not capable of
adequately retrieving this useful speaker information itself
Multi-encoder attention-based architectures for sound recognition with partial visual assistance
Large-scale sound recognition data sets typically consist of acoustic
recordings obtained from multimedia libraries. As a consequence, modalities
other than audio can often be exploited to improve the outputs of models
designed for associated tasks. Frequently, however, not all contents are
available for all samples of such a collection: For example, the original
material may have been removed from the source platform at some point, and
therefore, non-auditory features can no longer be acquired.
We demonstrate that a multi-encoder framework can be employed to deal with
this issue by applying this method to attention-based deep learning systems,
which are currently part of the state of the art in the domain of sound
recognition. More specifically, we show that the proposed model extension can
successfully be utilized to incorporate partially available visual information
into the operational procedures of such networks, which normally only use
auditory features during training and inference. Experimentally, we verify that
the considered approach leads to improved predictions in a number of evaluation
scenarios pertaining to audio tagging and sound event detection. Additionally,
we scrutinize some properties and limitations of the presented technique.Comment: Submitted to EURASIP Journal on Audio, Speech, and Music Processin
Character-Word LSTM Language Models
We present a Character-Word Long Short-Term Memory Language Model which both
reduces the perplexity with respect to a baseline word-level language model and
reduces the number of parameters of the model. Character information can reveal
structural (dis)similarities between words and can even be used when a word is
out-of-vocabulary, thus improving the modeling of infrequent and unknown words.
By concatenating word and character embeddings, we achieve up to 2.77% relative
improvement on English compared to a baseline model with a similar amount of
parameters and 4.57% on Dutch. Moreover, we also outperform baseline word-level
models with a larger number of parameters
Memory Time Span in LSTMs for Multi-Speaker Source Separation
With deep learning approaches becoming state-of-the-art in many speech (as
well as non-speech) related machine learning tasks, efforts are being taken to
delve into the neural networks which are often considered as a black box. In
this paper it is analyzed how recurrent neural network (RNNs) cope with
temporal dependencies by determining the relevant memory time span in a long
short-term memory (LSTM) cell. This is done by leaking the state variable with
a controlled lifetime and evaluating the task performance. This technique can
be used for any task to estimate the time span the LSTM exploits in that
specific scenario. The focus in this paper is on the task of separating
speakers from overlapping speech. We discern two effects: A long term effect,
probably due to speaker characterization and a short term effect, probably
exploiting phone-size formant tracks
Unsupervised Accent Adaptation Through Masked Language Model Correction Of Discrete Self-Supervised Speech Units
Self-supervised pre-trained speech models have strongly improved speech
recognition, yet they are still sensitive to domain shifts and accented or
atypical speech. Many of these models rely on quantisation or clustering to
learn discrete acoustic units. We propose to correct the discovered discrete
units for accented speech back to a standard pronunciation in an unsupervised
manner. A masked language model is trained on discrete units from a standard
accent and iteratively corrects an accented token sequence by masking
unexpected cluster sequences and predicting their common variant. Small accent
adapter blocks are inserted in the pre-trained model and fine-tuned by
predicting the corrected clusters, which leads to an increased robustness of
the pre-trained model towards a target accent, and this without supervision. We
are able to improve a state-of-the-art HuBERT Large model on a downstream
accented speech recognition task by altering the training regime with the
proposed method.Comment: Submitted to ICASSP202
Multi-candidate missing data imputation for robust speech recognition
The application of Missing Data Techniques (MDT) to increase the noise robustness of HMM/GMM-based large vocabulary speech recognizers is hampered by a large computational burden. The likelihood evaluations imply solving many constrained least squares (CLSQ) optimization problems. As an alternative, researchers have proposed frontend MDT or have made oversimplifying independence assumptions for the backend acoustic model. In this article, we propose a fast Multi-Candidate (MC) approach that solves the per-Gaussian CLSQ problems approximately by selecting the best from a small set of candidate solutions, which are generated as the MDT solutions on a reduced set of cluster Gaussians. Experiments show that the MC MDT runs equally fast as the uncompensated recognizer while achieving the accuracy of the full backend optimization approach. The experiments also show that exploiting the more accurate acoustic model of the backend does pay off in terms of accuracy when compared to frontend MDT. © 2012 Wang and Van hamme; licensee Springer.Wang Y., Van hamme H., ''Multi-candidate missing data imputation for robust speech recognition'', EURASIP journal on audio, speech, and music processing, vol. 17, 20 pp., 2012.status: publishe
- …