14,794 research outputs found
Topic-based mixture language modelling
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling.
A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost
Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation
The neural language models (NLM) achieve strong generalization capability by
learning the dense representation of words and using them to estimate
probability distribution function. However, learning the representation of rare
words is a challenging problem causing the NLM to produce unreliable
probability estimates. To address this problem, we propose a method to enrich
representations of rare words in pre-trained NLM and consequently improve its
probability estimation performance. The proposed method augments the word
embedding matrices of pre-trained NLM while keeping other parameters unchanged.
Specifically, our method updates the embedding vectors of rare words using
embedding vectors of other semantically and syntactically similar words. To
evaluate the proposed method, we enrich the rare street names in the
pre-trained NLM and use it to rescore 100-best hypotheses output from the
Singapore English speech recognition system. The enriched NLM reduces the word
error rate by 6% relative and improves the recognition accuracy of the rare
words by 16% absolute as compared to the baseline NLM.Comment: 5 pages, 2 figures, accepted to INTERSPEECH 201
The Microsoft 2017 Conversational Speech Recognition System
We describe the 2017 version of Microsoft's conversational speech recognition
system, in which we update our 2016 system with recent developments in
neural-network-based acoustic and language modeling to further advance the
state of the art on the Switchboard speech recognition task. The system adds a
CNN-BLSTM acoustic model to the set of model architectures we combined
previously, and includes character-based and dialog session aware LSTM language
models in rescoring. For system combination we adopt a two-stage approach,
whereby subsets of acoustic models are first combined at the senone/frame
level, followed by a word-level voting via confusion networks. We also added a
confusion network rescoring step after system combination. The resulting system
yields a 5.1\% word error rate on the 2000 Switchboard evaluation set
- …