Search CORE

2,934 research outputs found

Estimating Confusions in the ASR Channel for Improved Topic-based Language Model Adaptation

Author: Dredze Mark
Karakos Damianos
Khudanpur Sanjeev
Publication venue
Publication date: 20/03/2013
Field of study

Human language is a combination of elemental languages/domains/styles that change across and sometimes within discourses. Language models, which play a crucial role in speech recognizers and machine translation systems, are particularly sensitive to such changes, unless some form of adaptation takes place. One approach to speech language model adaptation is self-training, in which a language model's parameters are tuned based on automatically transcribed audio. However, transcription errors can misguide self-training, particularly in challenging settings such as conversational speech. In this work, we propose a model that considers the confusions (errors) of the ASR channel. By modeling the likely confusions in the ASR output instead of using just the 1-best, we improve self-training efficacy by obtaining a more reliable reference transcription estimate. We demonstrate improved topic-based language modeling adaptation results over both 1-best and lattice self-training using our ASR channel confusion estimates on telephone conversations.Comment: Technical Report 8, Human Language Technology Center of Excellence, Johns Hopkins Universit

arXiv.org e-Print Archive

An Empirical Study on Language Model Adaptation Using a Metric of Domain Similarity

Author: Hisami Suzuki
Jianfeng Gao
Wei Yuan
Publication venue
Publication date: 01/01/2005
Field of study

Abstract. This paper presents an empirical study on four techniques of language model adaptation, including a maximum a posteriori (MAP) method and three discriminative training models, in the application of Japanese Kana-Kanji conversion. We compare the performance of these methods from various angles by adapting the baseline model to four adaptation domains. In particular, we attempt to interpret the results given in terms of the character error rate (CER) by correlating them with the characteristics of the adaptation domain measured using the information-theoretic notion of cross entropy. We show that such a metric correlates well with the CER performance of the adaptation methods, and also show that the discriminative methods are not only superior to a MAP-based method in terms of achieving larger CER reduction, but are also more robust against the similarity of background and adaptation domains.

CiteSeerX

Approximation Lasso Methods for Language Modeling

Author: Bin Yu
Jianfeng Gao
Publication venue
Publication date: 01/01/2006
Field of study

Lasso is a regularization method for parameter estimation in linear models. It optimizes the model parameters with respect to a loss function subject to model complexities. This paper explores the use of lasso for statistical language modeling for text input. Owing to the very large number of parameters, directly optimizing the penalized lasso loss function is impossible. Therefore, we investigate two approximation methods, the boosted lasso (BLasso) and the forward stagewise linear regression (FSLR). Both methods, when used with the exponential loss function, bear strong resemblance to the boosting algorithm which has been used as a discriminative training method for language modeling. Evaluations on the task of Japanese text input show that BLasso is able to produce the best approximation to the lasso solution, and leads to a significant improvement, in terms of character error rate, over boosting and the traditional maximum likelihood estimation.

CiteSeerX

Speaker recognition by means of restricted Boltzmann machine adaptation

Author: Ghahabi Esfahani Omid
Hernando Pericás Francisco Javier
Safari Pooyan
Publication venue: 'Servicio de Publicaciones de la Universidad Autonoma de Madrid'
Publication date: 01/01/2016
Field of study

Restricted Boltzmann Machines (RBMs) have shown success in speaker recognition. In this paper, RBMs are investigated in a framework comprising a universal model training and model adaptation. Taking advantage of RBM unsupervised learning algorithm, a global model is trained based on all available background data. This general speaker-independent model, referred to as URBM, is further adapted to the data of a specific speaker to build speaker-dependent model. In order to show its effectiveness, we have applied this framework to two different tasks. It has been used to discriminatively model target and impostor spectral features for classification. It has been also utilized to produce a vector-based representation for speakers. This vector-based representation, similar to i-vector, can be further used for speaker recognition using either cosine scoring or Probabilistic Linear Discriminant Analysis (PLDA). The evaluation is performed on the core test condition of the NIST SRE 2006 database.Peer ReviewedPostprint (author's final draft

Video Summarization with Long Short-term Memory

Author: Chao Wei-Lun
Grauman Kristen
Sha Fei
Zhang Ke
Publication venue
Publication date: 29/07/2016
Field of study

We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots. Casting the problem as a structured prediction problem on sequential data, our main idea is to use Long Short-Term Memory (LSTM), a special type of recurrent neural networks to model the variable-range dependencies entailed in the task of video summarization. Our learning models attain the state-of-the-art results on two benchmark video datasets. Detailed analysis justifies the design of the models. In particular, we show that it is crucial to take into consideration the sequential structures in videos and model them. Besides advances in modeling techniques, we introduce techniques to address the need of a large number of annotated data for training complex learning models. There, our main idea is to exploit the existence of auxiliary annotated video datasets, albeit heterogeneous in visual styles and contents. Specifically, we show domain adaptation techniques can improve summarization by reducing the discrepancies in statistical properties across those datasets.Comment: To appear in ECCV 201

arXiv.org e-Print Archive

Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification

Author: Dabouei Ali
Dawson Jeremy
Iranmanesh Seyed Mehdi
Kazemi Hadi
Nasrabadi Nasser M.
Soleymani Sobhan
Publication venue
Publication date: 31/07/2018
Field of study

In this paper a novel cross-device text-independent speaker verification architecture is proposed. Majority of the state-of-the-art deep architectures that are used for speaker verification tasks consider Mel-frequency cepstral coefficients. In contrast, our proposed Siamese convolutional neural network architecture uses Mel-frequency spectrogram coefficients to benefit from the dependency of the adjacent spectro-temporal features. Moreover, although spectro-temporal features have proved to be highly reliable in speaker verification models, they only represent some aspects of short-term acoustic level traits of the speaker's voice. However, the human voice consists of several linguistic levels such as acoustic, lexicon, prosody, and phonetics, that can be utilized in speaker verification models. To compensate for these inherited shortcomings in spectro-temporal features, we propose to enhance the proposed Siamese convolutional neural network architecture by deploying a multilayer perceptron network to incorporate the prosodic, jitter, and shimmer features. The proposed end-to-end verification architecture performs feature extraction and verification simultaneously. This proposed architecture displays significant improvement over classical signal processing approaches and deep algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS 2018

arXiv.org e-Print Archive

Towards Reduced Reference Parametric Models for Estimating Audiovisual Quality in Multimedia Services

Author: Demirbilek Edip
Grégoire Jean-Charles
Publication venue
Publication date: 25/04/2016
Field of study

We have developed reduced reference parametric models for estimating perceived quality in audiovisual multimedia services. We have created 144 unique configurations for audiovisual content including various application and network parameters such as bitrates and distortions in terms of bandwidth, packet loss rate and jitter. To generate the data needed for model training and validation we have tasked 24 subjects, in a controlled environment, to rate the overall audiovisual quality on the absolute category rating (ACR) 5-level quality scale. We have developed models using Random Forest and Neural Network based machine learning methods in order to estimate Mean Opinion Scores (MOS) values. We have used information retrieved from the packet headers and side information provided as network parameters for model training. Random Forest based models have performed better in terms of Root Mean Square Error (RMSE) and Pearson correlation coefficient. The side information proved to be very effective in developing the model. We have found that, while the model performance might be improved by replacing the side information with more accurate bit stream level measurements, they are performing well in estimating perceived quality in audiovisual multimedia services.Comment: Accepted to ICC 201

arXiv.org e-Print Archive

Tile2Vec: Unsupervised representation learning for spatially distributed data

Author: Azzari George
Ermon Stefano
Jean Neal
Lobell David
Samar Anshul
Wang Sherrie
Publication venue
Publication date: 30/05/2018
Field of study

Geospatial analysis lacks methods like the word vector representations and pre-trained networks that significantly boost performance across a wide range of natural language and computer vision tasks. To fill this gap, we introduce Tile2Vec, an unsupervised representation learning algorithm that extends the distributional hypothesis from natural language -- words appearing in similar contexts tend to have similar meanings -- to spatially distributed data. We demonstrate empirically that Tile2Vec learns semantically meaningful representations on three datasets. Our learned representations significantly improve performance in downstream classification tasks and, similar to word vectors, visual analogies can be obtained via simple arithmetic in the latent space.Comment: 8 pages, 4 figures in main text; 9 pages, 11 figures in appendi

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Yes, we GAN: Applying Adversarial Techniques for Autonomous Driving

Author: Denny Patrick
Hurych David
Krizek Pavel
Sobh Ibrahim
Uricar Michal
Yogamani Senthil
Publication venue: 'Society for Imaging Science & Technology'
Publication date: 02/02/2020
Field of study

Generative Adversarial Networks (GAN) have gained a lot of popularity from their introduction in 2014 till present. Research on GAN is rapidly growing and there are many variants of the original GAN focusing on various aspects of deep learning. GAN are perceived as the most impactful direction of machine learning in the last decade. This paper focuses on the application of GAN in autonomous driving including topics such as advanced data augmentation, loss function learning, semi-supervised learning, etc. We formalize and review key applications of adversarial techniques and discuss challenges and open problems to be addressed.Comment: Accepted for publication in Electronic Imaging, Autonomous Vehicles and Machines 2019. arXiv admin note: text overlap with arXiv:1606.05908 by other author

arXiv.org e-Print Archive

Very Deep Convolutional Neural Networks for Robust Speech Recognition

Author: Qian Yanmin
Woodland Philip C
Publication venue
Publication date: 02/10/2016
Field of study

This paper describes the extension and optimization of our previous work on very deep convolutional neural networks (CNNs) for effective recognition of noisy speech in the Aurora 4 task. The appropriate number of convolutional layers, the sizes of the filters, pooling operations and input feature maps are all modified: the filter and pooling sizes are reduced and dimensions of input feature maps are extended to allow adding more convolutional layers. Furthermore appropriate input padding and input feature map selection strategies are developed. In addition, an adaptation framework using joint training of very deep CNN with auxiliary features i-vector and fMLLR features is developed. These modifications give substantial word error rate reductions over the standard CNN used as baseline. Finally the very deep CNN is combined with an LSTM-RNN acoustic model and it is shown that state-level weighted log likelihood score combination in a joint acoustic model decoding scheme is very effective. On the Aurora 4 task, the very deep CNN achieves a WER of 8.81%, further 7.99% with auxiliary feature joint training, and 7.09% with LSTM-RNN joint decoding.Comment: accepted by SLT 201

arXiv.org e-Print Archive