6,512 research outputs found
Variable Word Rate N-grams
The rate of occurrence of words is not uniform but varies from document to
document. Despite this observation, parameters for conventional n-gram language
models are usually derived using the assumption of a constant word rate. In
this paper we investigate the use of variable word rate assumption, modelled by
a Poisson distribution or a continuous mixture of Poissons. We present an
approach to estimating the relative frequencies of words or n-grams taking
prior information of their occurrences into account. Discounting and smoothing
schemes are also considered. Using the Broadcast News task, the approach
demonstrates a reduction of perplexity up to 10%.Comment: 4 pages, 4 figures, ICASSP-200
Topic-based mixture language modelling
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling.
A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost
Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling
Fixed-vocabulary language models fail to account for one of the most
characteristic statistical facts of natural language: the frequent creation and
reuse of new word types. Although character-level language models offer a
partial solution in that they can create word types not attested in the
training corpus, they do not capture the "bursty" distribution of such words.
In this paper, we augment a hierarchical LSTM language model that generates
sequences of word tokens character by character with a caching mechanism that
learns to reuse previously generated words. To validate our model we construct
a new open-vocabulary language modeling corpus (the Multilingual Wikipedia
Corpus, MWC) from comparable Wikipedia articles in 7 typologically diverse
languages and demonstrate the effectiveness of our model across this range of
languages.Comment: ACL 201
- …