18,382 research outputs found
Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems
In this paper, we present a series of complementary approaches to improve the
recognition of underrepresented named entities (NE) in hybrid ASR systems
without compromising overall word error rate performance. The underrepresented
words correspond to rare or out-of-vocabulary (OOV) words in the training data,
and thereby can't be modeled reliably. We begin with graphemic lexicon which
allows to drop the necessity of phonetic models in hybrid ASR. We study it
under different settings and demonstrate its effectiveness in dealing with
underrepresented NEs. Next, we study the impact of neural language model (LM)
with letter-based features derived to handle infrequent words. After that, we
attempt to enrich representations of underrepresented NEs in pretrained neural
LM by borrowing the embedding representations of rich-represented words. This
let us gain significant performance improvement on underrepresented NE
recognition. Finally, we boost the likelihood scores of utterances containing
NEs in the word lattices rescored by neural LMs and gain further performance
improvement. The combination of the aforementioned approaches improves NE
recognition by up to 42% relatively
Neurally-Guided Procedural Models: Amortized Inference for Procedural Graphics Programs using Neural Networks
Probabilistic inference algorithms such as Sequential Monte Carlo (SMC)
provide powerful tools for constraining procedural models in computer graphics,
but they require many samples to produce desirable results. In this paper, we
show how to create procedural models which learn how to satisfy constraints. We
augment procedural models with neural networks which control how the model
makes random choices based on the output it has generated thus far. We call
such models neurally-guided procedural models. As a pre-computation, we train
these models to maximize the likelihood of example outputs generated via SMC.
They are then used as efficient SMC importance samplers, generating
high-quality results with very few samples. We evaluate our method on
L-system-like models with image-based constraints. Given a desired quality
threshold, neurally-guided models can generate satisfactory results up to 10x
faster than unguided models
Advanced Rich Transcription System for Estonian Speech
This paper describes the current TT\"U speech transcription system for
Estonian speech. The system is designed to handle semi-spontaneous speech, such
as broadcast conversations, lecture recordings and interviews recorded in
diverse acoustic conditions. The system is based on the Kaldi toolkit.
Multi-condition training using background noise profiles extracted
automatically from untranscribed data is used to improve the robustness of the
system. Out-of-vocabulary words are recovered using a phoneme n-gram based
decoding subgraph and a FST-based phoneme-to-grapheme model. The system
achieves a word error rate of 8.1% on a test set of broadcast conversations.
The system also performs punctuation recovery and speaker identification.
Speaker identification models are trained using a recently proposed weakly
supervised training method.Comment: Published in Baltic HLT 2018 (putting it on arXiv because Google
Scholar doesn't index it properly
Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline
This paper describes a new baseline system for automatic speech recognition
(ASR) in the CHiME-4 challenge to promote the development of noisy ASR in
speech processing communities by providing 1) state-of-the-art system with a
simplified single system comparable to the complicated top systems in the
challenge, 2) publicly available and reproducible recipe through the main
repository in the Kaldi speech recognition toolkit. The proposed system adopts
generalized eigenvalue beamforming with bidirectional long short-term memory
(LSTM) mask estimation. We also propose to use a time delay neural network
(TDNN) based on the lattice-free version of the maximum mutual information
(LF-MMI) trained with augmented all six microphones plus the enhanced data
after beamforming. Finally, we use a LSTM language model for lattice and n-best
re-scoring. The final system achieved 2.74\% WER for the real test set in the
6-channel track, which corresponds to the 2nd place in the challenge. In
addition, the proposed baseline recipe includes four different speech
enhancement measures, short-time objective intelligibility measure (STOI),
extended STOI (eSTOI), perceptual evaluation of speech quality (PESQ) and
speech distortion ratio (SDR) for the simulation test set. Thus, the recipe
also provides an experimental platform for speech enhancement studies with
these performance measures.Comment: Submitted for Interspeech 201
The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge
This paper describes the NTNU ASR system participating in the Interspeech
2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD
group of ISCA. This ASR shared task is made much more challenging due to the
coexisting diversity of non-native and children speaking characteristics. In
the setting of closed-track evaluation, all participants were restricted to
develop their systems merely based on the speech and text corpora provided by
the organizer. To work around this under-resourced issue, we built our ASR
system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the
synergistic power of various data augmentation strategies, including both
utterance- and word-level speed perturbation and spectrogram augmentation,
alongside a simple yet effective data-cleansing approach. All variants of our
ASR system employed an RNN-based language model to rescore the first-pass
recognition hypotheses, which was trained solely on the text dataset released
by the organizer. Our system with the best configuration came out in second
place, resulting in a word error rate (WER) of 17.59 %, while those of the
top-performing, second runner-up and official baseline systems are 15.67%,
18.71%, 35.09%, respectively.Comment: Submitted to Interspeech 2020 Special Session: Shared Task on
Automatic Speech Recognition for Non-Native Children's Speec
Generating Sequences With Recurrent Neural Networks
This paper shows how Long Short-term Memory recurrent neural networks can be
used to generate complex sequences with long-range structure, simply by
predicting one data point at a time. The approach is demonstrated for text
(where the data are discrete) and online handwriting (where the data are
real-valued). It is then extended to handwriting synthesis by allowing the
network to condition its predictions on a text sequence. The resulting system
is able to generate highly realistic cursive handwriting in a wide variety of
styles.Comment: Thanks to Peng Liu and Sergey Zyrianov for various correction
A new robust feature selection method using variance-based sensitivity analysis
Excluding irrelevant features in a pattern recognition task plays an
important role in maintaining a simpler machine learning model and optimizing
the computational efficiency. Nowadays with the rise of large scale datasets,
feature selection is in great demand as it becomes a central issue when facing
high-dimensional datasets. The present study provides a new measure of saliency
for features by employing a Sensitivity Analysis (SA) technique called the
extended Fourier amplitude sensitivity test, and a well-trained Feedforward
Neural Network (FNN) model, which ultimately leads to the selection of a
promising optimal feature subset. Ideas of the paper are mainly demonstrated
based on adopting FNN model for feature selection in classification problems.
But in the end, a generalization framework is discussed in order to give
insights into the usage in regression problems as well as expressing how other
function approximate models can be deployed. Effectiveness of the proposed
method is verified by result analysis and data visualization for a series of
experiments over several well-known datasets drawn from UCI machine learning
repository.Comment: 9 pages, 4 figure
CNNs-based Acoustic Scene Classification using Multi-Spectrogram Fusion and Label Expansions
Spectrograms have been widely used in Convolutional Neural Networks based
schemes for acoustic scene classification, such as the STFT spectrogram and the
MFCC spectrogram, etc. They have different time-frequency characteristics,
contributing to their own advantages and disadvantages in recognizing acoustic
scenes. In this letter, a novel multi-spectrogram fusion framework is proposed,
making the spectrograms complement each other. In the framework, a single CNN
architecture is applied onto multiple spectrograms for feature extraction. The
deep features extracted from multiple spectrograms are then fused to
discriminate the acoustic scenes. Moreover, motivated by the inter-class
similarities in acoustic scene datasets, a label expansion method is further
proposed in which super-class labels are constructed upon the original classes.
On the help of the expanded labels, the CNN models are transformed into the
multitask learning form to improve the acoustic scene classification by
appending the auxiliary task of super-class classification. To verify the
effectiveness of the proposed methods, intensive experiments have been
performed on the DCASE2017 and the LITIS Rouen datasets. Experimental results
show that the proposed method can achieve promising accuracies on both
datasets. Specifically, accuracies of 0.9744, 0.8865 and 0.7778 are obtained
for the LITIS Rouen dataset, the DCASE Development set and Evaluation set
respectively
Deep Learning over Multi-field Categorical Data: A Case Study on User Response Prediction
Predicting user responses, such as click-through rate and conversion rate,
are critical in many web applications including web search, personalised
recommendation, and online advertising. Different from continuous raw features
that we usually found in the image and audio domains, the input features in web
space are always of multi-field and are mostly discrete and categorical while
their dependencies are little known. Major user response prediction models have
to either limit themselves to linear models or require manually building up
high-order combination features. The former loses the ability of exploring
feature interactions, while the latter results in a heavy computation in the
large feature space. To tackle the issue, we propose two novel models using
deep neural networks (DNNs) to automatically learn effective patterns from
categorical feature interactions and make predictions of users' ad clicks. To
get our DNNs efficiently work, we propose to leverage three feature
transformation methods, i.e., factorisation machines (FMs), restricted
Boltzmann machines (RBMs) and denoising auto-encoders (DAEs). This paper
presents the structure of our models and their efficient training algorithms.
The large-scale experiments with real-world data demonstrate that our methods
work better than major state-of-the-art models
What's Going On in Neural Constituency Parsers? An Analysis
A number of differences have emerged between modern and classic approaches to
constituency parsing in recent years, with structural components like grammars
and feature-rich lexicons becoming less central while recurrent neural network
representations rise in popularity. The goal of this work is to analyze the
extent to which information provided directly by the model structure in
classical systems is still being captured by neural methods. To this end, we
propose a high-performance neural model (92.08 F1 on PTB) that is
representative of recent work and perform a series of investigative
experiments. We find that our model implicitly learns to encode much of the
same information that was explicitly provided by grammars and lexicons in the
past, indicating that this scaffolding can largely be subsumed by powerful
general-purpose neural machinery.Comment: NAACL 201
- …