8,639 research outputs found
A theoretical model for n-gram distribution in big data corpora
There is a wide diversity of applications relying on the identification of the sequences of n consecutive words (n-grams) occurring in corpora. Many studies follow an empirical approach for determining the statistical distribution of the n-grams but are usually constrained by the corpora sizes, which for practical reasons stay far away from Big Data. However, Big Data sizes imply hidden behaviors to the applications, such as extraction of relevant information from Web scale sources. In this paper we propose a theoretical approach for estimating the number of distinct n-grams in each corpus. It is based on the Zipf-Mandelbrot Law and the Poisson distribution, and it allows an efficient estimation of the number of distinct 1-grams, 2-grams, 6-grams, for any corpus size. The proposed model was validated for English and French corpora. We illustrate a practical application of this approach to the extraction of relevant expressions from natural language corpora, and predict its asymptotic behaviour for increasingly large sizes.info:eu-repo/semantics/publishedVersio
Languages cool as they expand: Allometric scaling and the decreasing need for new words
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ‘‘cooling pattern’’ forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature
Population size predicts lexical diversity, but so does the mean sea level - why it is important to correctly account for the structure of temporal data
In order to demonstrate why it is important to correctly account for the
(serial dependent) structure of temporal data, we document an apparently
spectacular relationship between population size and lexical diversity: for
five out of seven investigated languages, there is a strong relationship
between population size and lexical diversity of the primary language in this
country. We show that this relationship is the result of a misspecified model
that does not consider the temporal aspect of the data by presenting a similar
but nonsensical relationship between the global annual mean sea level and
lexical diversity. Given the fact that in the recent past, several studies were
published that present surprising links between different economic, cultural,
political and (socio-)demographical variables on the one hand and cultural or
linguistic characteristics on the other hand, but seem to suffer from exactly
this problem, we explain the cause of the misspecification and show that it has
profound consequences. We demonstrate how simple transformation of the time
series can often solve problems of this type and argue that the evaluation of
the plausibility of a relationship is important in this context. We hope that
our paper will help both researchers and reviewers to understand why it is
important to use special models for the analysis of data with a natural
temporal ordering
Asynchronous Training of Word Embeddings for Large Text Corpora
Word embeddings are a powerful approach for analyzing language and have been
widely popular in numerous tasks in information retrieval and text mining.
Training embeddings over huge corpora is computationally expensive because the
input is typically sequentially processed and parameters are synchronously
updated. Distributed architectures for asynchronous training that have been
proposed either focus on scaling vocabulary sizes and dimensionality or suffer
from expensive synchronization latencies.
In this paper, we propose a scalable approach to train word embeddings by
partitioning the input space instead in order to scale to massive text corpora
while not sacrificing the performance of the embeddings. Our training procedure
does not involve any parameter synchronization except a final sub-model merge
phase that typically executes in a few minutes. Our distributed training scales
seamlessly to large corpus sizes and we get comparable and sometimes even up to
45% performance improvement in a variety of NLP benchmarks using models trained
by our distributed procedure which requires of the time taken by the
baseline approach. Finally we also show that we are robust to missing words in
sub-models and are able to effectively reconstruct word representations.Comment: This paper contains 9 pages and has been accepted in the WSDM201
Neutral evolution and turnover over centuries of English word popularity
Here we test Neutral models against the evolution of English word frequency
and vocabulary at the population scale, as recorded in annual word frequencies
from three centuries of English language books. Against these data, we test
both static and dynamic predictions of two neutral models, including the
relation between corpus size and vocabulary size, frequency distributions, and
turnover within those frequency distributions. Although a commonly used Neutral
model fails to replicate all these emergent properties at once, we find that
modified two-stage Neutral model does replicate the static and dynamic
properties of the corpus data. This two-stage model is meant to represent a
relatively small corpus (population) of English books, analogous to a `canon',
sampled by an exponentially increasing corpus of books in the wider population
of authors. More broadly, this mode -- a smaller neutral model within a larger
neutral model -- could represent more broadly those situations where mass
attention is focused on a small subset of the cultural variants.Comment: 12 pages, 5 figures, 1 tabl
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies
An automatic word classification system has been designed which processes
word unigram and bigram frequency statistics extracted from a corpus of natural
language utterances. The system implements a binary top-down form of word
clustering which employs an average class mutual information metric. Resulting
classifications are hierarchical, allowing variable class granularity. Words
are represented as structural tags --- unique -bit numbers the most
significant bit-patterns of which incorporate class information. Access to a
structural tag immediately provides access to all classification levels for the
corresponding word. The classification system has successfully revealed some of
the structure of English, from the phonemic to the semantic level. The system
has been compared --- directly and indirectly --- with other recent word
classification systems. Class based interpolated language models have been
constructed to exploit the extra information supplied by the classifications
and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil
- …