459 research outputs found

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    Recursive n-gram hashing is pairwise independent, at best

    Get PDF
    Many applications use sequences of n consecutive symbols (n-grams). Hashing these n-grams can be a performance bottleneck. For more speed, recursive hash families compute hash values by updating previous values. We prove that recursive hash families cannot be more than pairwise independent. While hashing by irreducible polynomials is pairwise independent, our implementations either run in time O(n) or use an exponential amount of memory. As a more scalable alternative, we make hashing by cyclic polynomials pairwise independent by ignoring n-1 bits. Experimentally, we show that hashing by cyclic polynomials is is twice as fast as hashing by irreducible polynomials. We also show that randomized Karp-Rabin hash families are not pairwise independent.Comment: See software at https://github.com/lemire/rollinghashcp

    Fast N-Gram Language Model Look-Ahead for Decoders With Static Pronunciation Prefix Trees

    Get PDF
    Decoders that make use of token-passing restrict their search space by various types of token pruning. With use of the Language Model Look-Ahead (LMLA) technique it is possible to increase the number of tokens that can be pruned without loss of decoding precision. Unfortunately, for token passing decoders that use single static pronunciation prefix trees, full n-gram LMLA increases the needed number of language model probability calculations considerably. In this paper a method for applying full n-gram LMLA in a decoder with a single static pronunciation tree is introduced. The experiments show that this method improves the speed of the decoder without an increase of search errors.\u

    EXTRACTION AND PREDICTION OF SYSTEM PROPERTIES USING VARIABLE-N-GRAM MODELING AND COMPRESSIVE HASHING

    Get PDF
    In modern computer systems, memory accesses and power management are the two major performance limiting factors. Accesses to main memory are very slow when compared to operations within a processor chip. Hardware write buffers, caches, out-of-order execution, and prefetch logic, are commonly used to reduce the time spent waiting for main memory accesses. Compiler loop interchange and data layout transformations also can help. Unfortunately, large data structures often have access patterns for which none of the standard approaches are useful. Using smaller data structures can significantly improve performance by allowing the data to reside in higher levels of the memory hierarchy. This dissertation proposes using lossy data compression technology called ’Compressive Hashing’ to create “surrogates”, that can augment original large data structures to yield faster typical data access. One way to optimize system performance for power consumption is to provide a predictive control of system-level energy use. This dissertation creates a novel instruction-level cost model called the variable-n-gram model, which is closely related to N-Gram analysis commonly used in computational linguistics. This model does not require direct knowledge of complex architectural details, and is capable of determining performance relationships between instructions from an execution trace. Experimental measurements are used to derive a context-sensitive model for performance of each type of instruction in the context of an N-instruction sequence. Dynamic runtime power prediction mechanisms often suffer from high overhead costs. To reduce the overhead, this dissertation encodes the static instruction-level predictions into a data structure and uses compressive hashing to provide on-demand runtime access to those predictions. Genetic programming is used to evolve compressive hash functions and performance analysis of applications shows that, runtime access overhead can be reduced by a factor of ~3x-9x

    Locality-preserving minimal perfect hashing of k-mers

    Get PDF
    Motivation: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1,...,n} bijectively. It is well-known that n log(2) (e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k - 1 symbols, it seems possible to beat the classic log (2)(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers.Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature

    Learning with Scalability and Compactness

    Get PDF
    Artificial Intelligence has been thriving for decades since its birth. Traditional AI features heuristic search and planning, providing good strategy for tasks that are inherently search-based problems, such as games and GPS searching. In the meantime, machine learning, arguably the hottest subfield of AI, embraces data-driven methodology with great success in a wide range of applications such as computer vision and speech recognition. As a new trend, the applications of both learning and search have shifted toward mobile and embedded devices which entails not only scalability but also compactness of the models. Under this general paradigm, we propose a series of work to address the issues of scalability and compactness within machine learning and its applications on heuristic search. We first focus on the scalability issue of memory-based heuristic search which is recently ameliorated by Maximum Variance Unfolding (MVU), a manifold learning algorithm capable of learning state embeddings as effective heuristics to speed up AA^* search. Though achieving unprecedented online search performance with constraints on memory footprint, MVU is notoriously slow on offline training. To address this problem, we introduce Maximum Variance Correction (MVC), which finds large-scale feasible solutions to MVU by post-processing embeddings from any manifold learning algorithm. It increases the scale of MVU embeddings by several orders of magnitude and is naturally parallel. We further propose Goal-oriented Euclidean Heuristic (GOEH), a variant to MVU embeddings, which preferably optimizes the heuristics associated with goals in the embedding while maintaining their admissibility. We demonstrate unmatched reductions in search time across several non-trivial AA^* benchmark search problems. Through these work, we bridge the gap between the manifold learning literature and heuristic search which have been regarded as fundamentally different, leading to cross-fertilization for both fields. Deep learning has made a big splash in the machine learning community with its superior accuracy performance. However, it comes at a price of huge model size that might involves billions of parameters, which poses great challenges for its use on mobile and embedded devices. To achieve the compactness, we propose HashedNets, a general approach to compressing neural network models leveraging feature hashing. At its core, HashedNets randomly group parameters using a low-cost hash function, and share parameter value within the group. According to our empirical results, a neural network could be 32x smaller with little drop in accuracy performance. We further introduce Frequency-Sensitive Hashed Nets (FreshNets) to extend this hashing technique to convolutional neural network by compressing parameters in the frequency domain. Compared with many AI applications, neural networks seem not graining as much popularity as it should be in traditional data mining tasks. For these tasks, categorical features need to be first converted to numerical representation in advance in order for neural networks to process them. We show that a na\ {i}ve use of the classic one-hot encoding may result in gigantic weight matrices and therefore lead to prohibitively expensive memory cost in neural networks. Inspired by word embedding, we advocate a compellingly simple, yet effective neural network architecture with category embedding. It is capable of directly handling both numerical and categorical features as well as providing visual insights on feature similarities. At the end, we conduct comprehensive empirical evaluation which showcases the efficacy and practicality of our approach, and provides surprisingly good visualization and clustering for categorical features

    An effective Chinese indexing method based on partitioned signature files.

    Get PDF
    Wong Chi Yin.Thesis (M.Phil.)--Chinese University of Hong Kong, 1998.Includes bibliographical references (leaves 107-114).Abstract also in Chinese.Abstract --- p.iiAcknowledgements --- p.viChapter 1 --- Introduction --- p.1Chapter 1.1 --- Introduction to Chinese IR --- p.1Chapter 1.2 --- Contributions --- p.3Chapter 1.3 --- Organization of this Thesis --- p.5Chapter 2 --- Background --- p.6Chapter 2.1 --- Indexing methods --- p.6Chapter 2.1.1 --- Full-text scanning --- p.7Chapter 2.1.2 --- Inverted files --- p.7Chapter 2.1.3 --- Signature files --- p.9Chapter 2.1.4 --- Clustering --- p.10Chapter 2.2 --- Information Retrieval Models --- p.10Chapter 2.2.1 --- Boolean model --- p.11Chapter 2.2.2 --- Vector space model --- p.11Chapter 2.2.3 --- Probabilistic model --- p.13Chapter 2.2.4 --- Logical model --- p.14Chapter 3 --- Investigation of Segmentation on the Vector Space Retrieval Model --- p.15Chapter 3.1 --- Segmentation of Chinese Texts --- p.16Chapter 3.1.1 --- Character-based segmentation --- p.16Chapter 3.1.2 --- Word-based segmentation --- p.18Chapter 3.1.3 --- N-Gram segmentation --- p.21Chapter 3.2 --- Performance Evaluation of Three Segmentation Approaches --- p.23Chapter 3.2.1 --- Experimental Setup --- p.23Chapter 3.2.2 --- Experimental Results --- p.24Chapter 3.2.3 --- Discussion --- p.29Chapter 4 --- Signature File Background --- p.32Chapter 4.1 --- Superimposed coding --- p.34Chapter 4.2 --- False drop probability --- p.36Chapter 5 --- Partitioned Signature File Based On Chinese Word Length --- p.39Chapter 5.1 --- Fixed Weight Block (FWB) Signature File --- p.41Chapter 5.2 --- Overview of PSFC --- p.45Chapter 5.3 --- Design Considerations --- p.50Chapter 6 --- New Hashing Techniques for Partitioned Signature Files --- p.59Chapter 6.1 --- Direct Division Method --- p.61Chapter 6.2 --- Random Number Assisted Division Method --- p.62Chapter 6.3 --- Frequency-based hashing method --- p.64Chapter 6.4 --- Chinese character-based hashing method --- p.68Chapter 7 --- Experiments and Results --- p.72Chapter 7.1 --- Performance evaluation of partitioned signature file based on Chi- nese word length --- p.74Chapter 7.1.1 --- Retrieval Performance --- p.75Chapter 7.1.2 --- Signature Reduction Ratio --- p.77Chapter 7.1.3 --- Storage Requirement --- p.79Chapter 7.1.4 --- Discussion --- p.81Chapter 7.2 --- Performance evaluation of different dynamic signature generation methods --- p.82Chapter 7.2.1 --- Collision --- p.84Chapter 7.2.2 --- Retrieval Performance --- p.86Chapter 7.2.3 --- Discussion --- p.89Chapter 8 --- Conclusions and Future Work --- p.91Chapter 8.1 --- Conclusions --- p.91Chapter 8.2 --- Future work --- p.95Chapter A --- Notations of Signature Files --- p.96Chapter B --- False Drop Probability --- p.98Chapter C --- Experimental Results --- p.103Bibliography --- p.10
    corecore