2,934 research outputs found
Estimating Confusions in the ASR Channel for Improved Topic-based Language Model Adaptation
Human language is a combination of elemental languages/domains/styles that
change across and sometimes within discourses. Language models, which play a
crucial role in speech recognizers and machine translation systems, are
particularly sensitive to such changes, unless some form of adaptation takes
place. One approach to speech language model adaptation is self-training, in
which a language model's parameters are tuned based on automatically
transcribed audio. However, transcription errors can misguide self-training,
particularly in challenging settings such as conversational speech. In this
work, we propose a model that considers the confusions (errors) of the ASR
channel. By modeling the likely confusions in the ASR output instead of using
just the 1-best, we improve self-training efficacy by obtaining a more reliable
reference transcription estimate. We demonstrate improved topic-based language
modeling adaptation results over both 1-best and lattice self-training using
our ASR channel confusion estimates on telephone conversations.Comment: Technical Report 8, Human Language Technology Center of Excellence,
Johns Hopkins Universit
An Empirical Study on Language Model Adaptation Using a Metric of Domain Similarity
Abstract. This paper presents an empirical study on four techniques of language model adaptation, including a maximum a posteriori (MAP) method and three discriminative training models, in the application of Japanese Kana-Kanji conversion. We compare the performance of these methods from various angles by adapting the baseline model to four adaptation domains. In particular, we attempt to interpret the results given in terms of the character error rate (CER) by correlating them with the characteristics of the adaptation domain measured using the information-theoretic notion of cross entropy. We show that such a metric correlates well with the CER performance of the adaptation methods, and also show that the discriminative methods are not only superior to a MAP-based method in terms of achieving larger CER reduction, but are also more robust against the similarity of background and adaptation domains.
Approximation Lasso Methods for Language Modeling
Lasso is a regularization method for parameter estimation in linear models. It optimizes the model parameters with respect to a loss function subject to model complexities. This paper explores the use of lasso for statistical language modeling for text input. Owing to the very large number of parameters, directly optimizing the penalized lasso loss function is impossible. Therefore, we investigate two approximation methods, the boosted lasso (BLasso) and the forward stagewise linear regression (FSLR). Both methods, when used with the exponential loss function, bear strong resemblance to the boosting algorithm which has been used as a discriminative training method for language modeling. Evaluations on the task of Japanese text input show that BLasso is able to produce the best approximation to the lasso solution, and leads to a significant improvement, in terms of character error rate, over boosting and the traditional maximum likelihood estimation.
Speaker recognition by means of restricted Boltzmann machine adaptation
Restricted Boltzmann Machines (RBMs) have shown success in speaker recognition. In this paper, RBMs are investigated in a framework comprising a universal model training and model adaptation. Taking advantage of RBM unsupervised learning algorithm, a global model is trained based on all available background data. This general speaker-independent model, referred to as URBM, is further adapted to the data of a specific speaker to build speaker-dependent model. In order to show its effectiveness, we have applied this framework to two different tasks. It has been used to discriminatively model target and impostor spectral features for classification. It has been also utilized to produce a vector-based representation for speakers. This vector-based representation, similar to i-vector, can be further used for speaker recognition using either cosine scoring or Probabilistic Linear Discriminant Analysis (PLDA). The evaluation is performed on the core test condition of the NIST SRE 2006 database.Peer ReviewedPostprint (author's final draft
Video Summarization with Long Short-term Memory
We propose a novel supervised learning technique for summarizing videos by
automatically selecting keyframes or key subshots. Casting the problem as a
structured prediction problem on sequential data, our main idea is to use Long
Short-Term Memory (LSTM), a special type of recurrent neural networks to model
the variable-range dependencies entailed in the task of video summarization.
Our learning models attain the state-of-the-art results on two benchmark video
datasets. Detailed analysis justifies the design of the models. In particular,
we show that it is crucial to take into consideration the sequential structures
in videos and model them. Besides advances in modeling techniques, we introduce
techniques to address the need of a large number of annotated data for training
complex learning models. There, our main idea is to exploit the existence of
auxiliary annotated video datasets, albeit heterogeneous in visual styles and
contents. Specifically, we show domain adaptation techniques can improve
summarization by reducing the discrepancies in statistical properties across
those datasets.Comment: To appear in ECCV 201
Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification
In this paper a novel cross-device text-independent speaker verification
architecture is proposed. Majority of the state-of-the-art deep architectures
that are used for speaker verification tasks consider Mel-frequency cepstral
coefficients. In contrast, our proposed Siamese convolutional neural network
architecture uses Mel-frequency spectrogram coefficients to benefit from the
dependency of the adjacent spectro-temporal features. Moreover, although
spectro-temporal features have proved to be highly reliable in speaker
verification models, they only represent some aspects of short-term acoustic
level traits of the speaker's voice. However, the human voice consists of
several linguistic levels such as acoustic, lexicon, prosody, and phonetics,
that can be utilized in speaker verification models. To compensate for these
inherited shortcomings in spectro-temporal features, we propose to enhance the
proposed Siamese convolutional neural network architecture by deploying a
multilayer perceptron network to incorporate the prosodic, jitter, and shimmer
features. The proposed end-to-end verification architecture performs feature
extraction and verification simultaneously. This proposed architecture displays
significant improvement over classical signal processing approaches and deep
algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory,
Applications, and Systems (BTAS 2018
Towards Reduced Reference Parametric Models for Estimating Audiovisual Quality in Multimedia Services
We have developed reduced reference parametric models for estimating
perceived quality in audiovisual multimedia services. We have created 144
unique configurations for audiovisual content including various application and
network parameters such as bitrates and distortions in terms of bandwidth,
packet loss rate and jitter. To generate the data needed for model training and
validation we have tasked 24 subjects, in a controlled environment, to rate the
overall audiovisual quality on the absolute category rating (ACR) 5-level
quality scale. We have developed models using Random Forest and Neural Network
based machine learning methods in order to estimate Mean Opinion Scores (MOS)
values. We have used information retrieved from the packet headers and side
information provided as network parameters for model training. Random Forest
based models have performed better in terms of Root Mean Square Error (RMSE)
and Pearson correlation coefficient. The side information proved to be very
effective in developing the model. We have found that, while the model
performance might be improved by replacing the side information with more
accurate bit stream level measurements, they are performing well in estimating
perceived quality in audiovisual multimedia services.Comment: Accepted to ICC 201
Tile2Vec: Unsupervised representation learning for spatially distributed data
Geospatial analysis lacks methods like the word vector representations and
pre-trained networks that significantly boost performance across a wide range
of natural language and computer vision tasks. To fill this gap, we introduce
Tile2Vec, an unsupervised representation learning algorithm that extends the
distributional hypothesis from natural language -- words appearing in similar
contexts tend to have similar meanings -- to spatially distributed data. We
demonstrate empirically that Tile2Vec learns semantically meaningful
representations on three datasets. Our learned representations significantly
improve performance in downstream classification tasks and, similar to word
vectors, visual analogies can be obtained via simple arithmetic in the latent
space.Comment: 8 pages, 4 figures in main text; 9 pages, 11 figures in appendi
Yes, we GAN: Applying Adversarial Techniques for Autonomous Driving
Generative Adversarial Networks (GAN) have gained a lot of popularity from
their introduction in 2014 till present. Research on GAN is rapidly growing and
there are many variants of the original GAN focusing on various aspects of deep
learning. GAN are perceived as the most impactful direction of machine learning
in the last decade. This paper focuses on the application of GAN in autonomous
driving including topics such as advanced data augmentation, loss function
learning, semi-supervised learning, etc. We formalize and review key
applications of adversarial techniques and discuss challenges and open problems
to be addressed.Comment: Accepted for publication in Electronic Imaging, Autonomous Vehicles
and Machines 2019. arXiv admin note: text overlap with arXiv:1606.05908 by
other author
Very Deep Convolutional Neural Networks for Robust Speech Recognition
This paper describes the extension and optimization of our previous work on
very deep convolutional neural networks (CNNs) for effective recognition of
noisy speech in the Aurora 4 task. The appropriate number of convolutional
layers, the sizes of the filters, pooling operations and input feature maps are
all modified: the filter and pooling sizes are reduced and dimensions of input
feature maps are extended to allow adding more convolutional layers.
Furthermore appropriate input padding and input feature map selection
strategies are developed. In addition, an adaptation framework using joint
training of very deep CNN with auxiliary features i-vector and fMLLR features
is developed. These modifications give substantial word error rate reductions
over the standard CNN used as baseline. Finally the very deep CNN is combined
with an LSTM-RNN acoustic model and it is shown that state-level weighted log
likelihood score combination in a joint acoustic model decoding scheme is very
effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%,
further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN
joint decoding.Comment: accepted by SLT 201
- …