354 research outputs found
Producing power-law distributions and damping word frequencies with two-stage language models
Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statisticalmodels that can generically produce power laws, breaking generativemodels into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes-the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process-that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology.48 page(s
Large-alphabet sequence modelling - a comparative study
Most raw data is not binary, but over some often large and structured alphabet. Sometimes it is convenient to deal with binarised data sequence, but typically exploiting the original structure of the data significantly improves performance in many practical applications. In this thesis, we study Martin-Lof random sequences that are maximally incompressible and provide a topological view on the size of the set of random sequences. We also investigate the relationship between binary data compression techniques and modelling natural language text with the latter using raw unbinarised data sequence from a large alphabet. We perform an experimental comparative study for them, including an empirical comparison between Kneser-Ney (KN) variants with regular Context Tree Weighting algorithm (CTW) and phase CTW, and with large-alphabet CTW with different estimators. We also apply the idea of Hutter's adaptive sparse Dirichlet-multinomial coding to the KN method and provide a heuristic to make the discounting parameter adaptive. The KN with this adaptive discounting parameter outperforms the traditional KN method on the Large Calgary corpus
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
Stepwise API usage assistance based on N-gram language models
Software development requires the use of external Application Programming Interfaces
(APIs) in order to reuse libraries and frameworks. Programmers often
struggle with unfamiliar APIs due to their lack of resources or less common design.
Such difficulties often lead to an incorrect sequences of API calls that may
not produce the desired outcome. Language models have shown the ability to
capture regularities in text as well as in code.
In this work we explore the use of n-gram language models and their ability to
capture regularities in API usage through an intrinsic and extrinsic evaluation of
these models on some of the most widely used APIs for the Java programming
language. To achieve this, several language models were trained over a source
code corpora containing several hundreds of GitHub Java projects that use the
desired APIs. In order to fully assess the performance of the language models, we
have selected APIs from multiple domains and vocabulary sizes.
This work allowed us to conclude that n-gram language models are able to capture
the API usage patterns due to their low perplexity values and their high overall
coverage, going up to 100% in some cases, which encouraged us to create a code
completion tool to help programmers stay in the right path when using unknown
APIs while allowing for some exploration.O desenvolvimento de software requer a utilização de Application Programming
Interfaces (APIs) externas com o objectivo de reutilizar bibliotecas e frameworks.
Muitas vezes, os programadores têm dificuldade em utilizar APIs desconhecidas,
devido à falta de recursos ou desenho fora do comum. Essas dificuldades provocam
inúmeras vezes sequências incorrectas de chamadas às APIs que poderão não
produzir o resultado desejado. Os modelos de língua mostraram-se capazes de
capturar regularidades em texto, bem como em código.
Neste trabalho é explorada a utilização de modelos de língua de n-gramas e a sua
capacidade de capturar regularidades na utilização de APIs, através de uma avaliação
intrínseca e extrínseca destes modelos em algumas das APIs mais utilizadas
na linguagem de programação Java. Para alcançar este objectivo, vários modelos
foram treinados sobre repositórios de código do GitHub, contendo centenas
de projectos Java que utilizam estas APIs. Com o objectivo de ter uma avaliação
completa do desempenho dos modelos de língua, foram seleccionadas APIs
de múltiplos domínios e tamanhos de vocabulário.
Este trabalho permite concluir que os modelos de língua de n-gramas são capazes
de capturar padrões de utilização de APIs devido aos seus baixos valores de perplexidade
e a sua alta cobertura, chegando a atingir 100% em alguns casos, o
que levou à criação de uma ferramenta de code completion para guiar os programadores
na utilização de uma API desconhecida, mas mantendo a possibilidade
de a explorar
Reified Context Models
A classic tension exists between exact inference in a simple model and
approximate inference in a complex model. The latter offers expressivity and
thus accuracy, but the former provides coverage of the space, an important
property for confidence estimation and learning with indirect supervision. In
this work, we introduce a new approach, reified context models, to reconcile
this tension. Specifically, we let the amount of context (the arity of the
factors in a graphical model) be chosen "at run-time" by reifying it---that is,
letting this choice itself be a random variable inside the model. Empirically,
we show that our approach obtains expressivity and coverage on three natural
language tasks
Detection is the central problem in real-word spelling correction
Real-word spelling correction differs from non-word spelling correction in
its aims and its challenges. Here we show that the central problem in real-word
spelling correction is detection. Methods from non-word spelling correction,
which focus instead on selection among candidate corrections, do not address
detection adequately, because detection is either assumed in advance or heavily
constrained. As we demonstrate in this paper, merely discriminating between the
intended word and a random close variation of it within the context of a
sentence is a task that can be performed with high accuracy using
straightforward models. Trigram models are sufficient in almost all cases. The
difficulty comes when every word in the sentence is a potential error, with a
large set of possible candidate corrections. Despite their strengths, trigram
models cannot reliably find true errors without introducing many more, at least
not when used in the obvious sequential way without added structure. The
detection task exposes weakness not visible in the selection task
- …