313,204 research outputs found
Variable Word Rate N-grams
The rate of occurrence of words is not uniform but varies from document to
document. Despite this observation, parameters for conventional n-gram language
models are usually derived using the assumption of a constant word rate. In
this paper we investigate the use of variable word rate assumption, modelled by
a Poisson distribution or a continuous mixture of Poissons. We present an
approach to estimating the relative frequencies of words or n-grams taking
prior information of their occurrences into account. Discounting and smoothing
schemes are also considered. Using the Broadcast News task, the approach
demonstrates a reduction of perplexity up to 10%.Comment: 4 pages, 4 figures, ICASSP-200
A Machine learning approach to POS tagging
We have applied inductive learning of statistical decision trees
and relaxation labelling to the Natural Language Processing (NLP)
task of morphosyntactic disambiguation (Part Of Speech Tagging).
The learning process is supervised and obtains a language
model oriented to resolve POS ambiguities. This model consists
of a set of statistical decision trees expressing distribution of
tags and words in some relevant contexts.
The acquired language models are complete enough to be directly
used as sets of POS disambiguation rules, and include more complex
contextual information than simple collections of n-grams usually
used in statistical taggers.
We have implemented a quite simple and fast tagger that has been
tested and evaluated on the Wall Street Journal (WSJ) corpus with
a remarkable accuracy.
However, better results can be obtained by translating the trees
into rules to feed a flexible relaxation labelling based tagger.
In this direction we describe a tagger which is able to use
information of any kind (n-grams, automatically acquired constraints,
linguistically motivated manually written constraints, etc.), and in
particular to incorporate the machine learned decision trees.
Simultaneously, we address the problem of tagging when only
small training material is available, which is crucial in any process
of constructing, from scratch, an annotated corpus. We show that quite
high accuracy can be achieved with our system in this situation.Postprint (published version
CES-479 A Linear Estimation-of-Distribution GP System
We present N-gram GP, an estimation of distribution algorithm for the evolution of linear computer programs. The algorithm learns and samples the joint probability distribution of triplets of instructions (or 3-grams) at the same time as it is learning and sampling a program length distribution. We have tested N-gram GP on symbolic regressions problems where the target function is a polynomial of up to degree 12 and lawn-mower problems with lawn sizes of up to 12 ? 12. Results show that the algorithm is e?ective and scales better on these problems than either linear GP or simple stochastic hill-climbing
A theoretical model for n-gram distribution in big data corpora
There is a wide diversity of applications relying on the identification of the sequences of n consecutive words (n-grams) occurring in corpora. Many studies follow an empirical approach for determining the statistical distribution of the n-grams but are usually constrained by the corpora sizes, which for practical reasons stay far away from Big Data. However, Big Data sizes imply hidden behaviors to the applications, such as extraction of relevant information from Web scale sources. In this paper we propose a theoretical approach for estimating the number of distinct n-grams in each corpus. It is based on the Zipf-Mandelbrot Law and the Poisson distribution, and it allows an efficient estimation of the number of distinct 1-grams, 2-grams, 6-grams, for any corpus size. The proposed model was validated for English and French corpora. We illustrate a practical application of this approach to the extraction of relevant expressions from natural language corpora, and predict its asymptotic behaviour for increasingly large sizes.info:eu-repo/semantics/publishedVersio
AnatomÃa de los antropónimos españoles
The paper presents some results from an analysis of the morphology and lexicon of the Spanish surnames. The analysis includes: the structure, the distribution of lengths and frequencies, the position of letters, the n-grams, the relationship between vocabulary and corpus volume (Zipf-Mandelbrot law) the entropy and the equivalent vocabulary. Some comparisons are made with the Spanish general language and with the USA surnames.Postprint (published version
- …