28,065 research outputs found
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
Human assessments of document similarity
Two studies are reported that examined the reliability of human assessments of document similarity and the association between human ratings and the results of n-gram automatic text analysis (ATA). Human interassessor reliability (IAR) was moderate to poor. However, correlations between average human ratings and n-gram solutions were strong. The average correlation between ATA and individual human solutions was greater than IAR. N-gram length influenced the strength of association, but optimum string length depended on the nature of the text (technical vs. nontechnical). We conclude that the methodology applied in previous studies may have led to overoptimistic views on human reliability, but that an optimal n-gram solution can provide a good approximation of the average human assessment of document similarity, a result that has important implications for future development of document visualization systems
N-Gram Models
Contains fulltext :
227631.pdf (publisher's version ) (Open Access
Bug or Not? Bug Report Classification Using N-Gram IDF
Previous studies have found that a significant number of bug reports are
misclassified between bugs and non-bugs, and that manually classifying bug
reports is a time-consuming task. To address this problem, we propose a bug
reports classification model with N-gram IDF, a theoretical extension of
Inverse Document Frequency (IDF) for handling words and phrases of any length.
N-gram IDF enables us to extract key terms of any length from texts, these key
terms can be used as the features to classify bug reports. We build
classification models with logistic regression and random forest using features
from N-gram IDF and topic modeling, which is widely used in various software
engineering tasks. With a publicly available dataset, our results show that our
N-gram IDF-based models have a superior performance than the topic-based models
on all of the evaluated cases. Our models show promising results and have a
potential to be extended to other software engineering tasks.Comment: 5 pages, ICSME 201
Recursive n-gram hashing is pairwise independent, at best
Many applications use sequences of n consecutive symbols (n-grams). Hashing
these n-grams can be a performance bottleneck. For more speed, recursive hash
families compute hash values by updating previous values. We prove that
recursive hash families cannot be more than pairwise independent. While hashing
by irreducible polynomials is pairwise independent, our implementations either
run in time O(n) or use an exponential amount of memory. As a more scalable
alternative, we make hashing by cyclic polynomials pairwise independent by
ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials
is is twice as fast as hashing by irreducible polynomials. We also show that
randomized Karp-Rabin hash families are not pairwise independent.Comment: See software at https://github.com/lemire/rollinghashcp
- …