15,965 research outputs found

    Optimized Compressed Data Structures for Infinite-order Language Models

    Get PDF
    In recent years highly compact succinct text indexes developed in bioinformatics have spread to the domain of natural language processing, in particular n-gram indexing. One line of research has been to utilize compressed suffix trees as both the text index and the language model. Compressed suffix trees have several favourable properties for compressing n-gram strings and associated satellite data while allowing for both fast access and fast computation of the language model probabilities over the text. When it comes to count based n-gram language models and especially to low-order n-gram models, the Kneser-Ney language model has long been de facto industry standard. Shareghi et al. showed how to utilize a compressed suffix tree to build a highly compact index that is competitive with state-of-the-art language models in space. In addition, they showed how the index can work as a language model and allows computing modified Kneser-Ney probabilities straight from the data structure. This thesis analyzes and extends the works of Shareghi et al. in building a compressed suffix tree based modified Kneser-Ney language model. We explain their solution and present three attempts to improve the approach. Out of the three experiments, one performed far worse than the original approach, but two showed minor gains in time with no real loss in space

    Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

    Get PDF
    Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

    Beyond Word N-Grams

    Full text link
    We describe, analyze, and evaluate experimentally a new probabilistic model for word-sequence prediction in natural language based on prediction suffix trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These mixtures have provably and practically better performance than almost any single model. We evaluate the model on several corpora. The low perplexity achieved by relatively small PST mixture models suggests that they may be an advantageous alternative, both theoretically and practically, to the widely used n-gram models.Comment: 15 pages, one PostScript figure, uses psfig.sty and fullname.sty. Revised version of a paper in the Proceedings of the Third Workshop on Very Large Corpora, MIT, 199

    Language As An Emergent System

    Get PDF

    Streaming Property Testing of Visibly Pushdown Languages

    Get PDF
    In the context of language recognition, we demonstrate the superiority of streaming property testers against streaming algorithms and property testers, when they are not combined. Initiated by Feigenbaum et al., a streaming property tester is a streaming algorithm recognizing a language under the property testing approximation: it must distinguish inputs of the language from those that are ε\varepsilon-far from it, while using the smallest possible memory (rather than limiting its number of input queries). Our main result is a streaming ε\varepsilon-property tester for visibly pushdown languages (VPL) with one-sided error using memory space poly((logn)/ε)\mathrm{poly}((\log n) / \varepsilon). This constructions relies on a (non-streaming) property tester for weighted regular languages based on a previous tester by Alon et al. We provide a simple application of this tester for streaming testing special cases of instances of VPL that are already hard for both streaming algorithms and property testers. Our main algorithm is a combination of an original simulation of visibly pushdown automata using a stack with small height but possible items of linear size. In a second step, those items are replaced by small sketches. Those sketches relies on a notion of suffix-sampling we introduce. This sampling is the key idea connecting our streaming tester algorithm to property testers.Comment: 23 pages. Major modifications in the presentatio

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    On the descriptional complexity of iterative arrays

    Get PDF
    The descriptional complexity of iterative arrays (lAs) is studied. Iterative arrays are a parallel computational model with a sequential processing of the input. It is shown that lAs when compared to deterministic finite automata or pushdown automata may provide savings in size which are not bounded by any recursive function, so-called non-recursive trade-offs. Additional non-recursive trade-offs are proven to exist between lAs working in linear time and lAs working in real time. Furthermore, the descriptional complexity of lAs is compared with cellular automata (CAs) and non-recursive trade-offs are proven between two restricted classes. Finally, it is shown that many decidability questions for lAs are undecidable and not semidecidable
    corecore