1,973 research outputs found

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    The Case for Learned Index Structures

    Full text link
    Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible

    Signature file access methodologies for text retrieval: a literature review with additional test cases

    Get PDF
    Signature files are extremely compressed versions of text files which can be used as access or index files to facilitate searching documents for text strings. These access files, or signatures, are generated by storing hashed codes for individual words. Given the possible generation of similar codes in the hashing or storing process, the primary concern in researching signature files is to determine the accuracy of retrieving information. Inaccuracy is always represented by the false signaling of the presence of a text string. Two suggested ways to alter false drop rates are: 1) to determine if either of the two methologies for storing hashed codes, by superimposing them or by concatenating them, is more efficient; and 2) to determine if a particular hashing algorithm has any impact. To assess these issues, the history of suprimposed coding is traced from its development as a tool for compressing information onto punched cards in the 1950s to its incorporation into proposed signature file methodologies in the mid-1980\u27 s. Likewise, the concept of compressing individual words by various algorithms, or by hashing them is traced through the research literature. Following this literature review, benchmark trials are performed using both superimposed and concatenated methodologies while varying hashing algorithms. It is determined that while one combination of hashing algorithm and storage methodology is better, all signature file mehods can be considered viable

    Cayley Graphs of Semigroups and Applications to Hashing

    Full text link
    In 1994, Tillich and Zemor proposed a scheme for a family of hash functions that uses products of matrices in groups of the form SL2(F2n)SL_2(F_{2^n}). In 2009, Grassl et al. developed an attack to obtain collisions for palindromic bit strings by exploring a connection between the Tillich-Zemor functions and maximal length chains in the Euclidean algorithm for polynomials over F2F_2. In this work, we present a new proposal for hash functions based on Cayley graphs of semigroups. In our proposed hash function, the noncommutative semigroup of linear functions under composition is considered as platform for the scheme. We will also discuss its efficiency, pseudorandomness and security features. Furthermore, we generalized the Fit-Florea and Matula\u27s algorithm (2004) that finds the discrete logarithm in the multiplicative group of integers modulo 2k2^k by establishing a connection between semi-primitive roots modulo 2k2^k where k≄3k\geq 3 and the logarithmic base used in the algorithm

    Bloom Filters in Adversarial Environments

    Get PDF
    Many efficient data structures use randomness, allowing them to improve upon deterministic ones. Usually, their efficiency and correctness are analyzed using probabilistic tools under the assumption that the inputs and queries are independent of the internal randomness of the data structure. In this work, we consider data structures in a more robust model, which we call the adversarial model. Roughly speaking, this model allows an adversary to choose inputs and queries adaptively according to previous responses. Specifically, we consider a data structure known as "Bloom filter" and prove a tight connection between Bloom filters in this model and cryptography. A Bloom filter represents a set SS of elements approximately, by using fewer bits than a precise representation. The price for succinctness is allowing some errors: for any x∈Sx \in S it should always answer `Yes', and for any x∉Sx \notin S it should answer `Yes' only with small probability. In the adversarial model, we consider both efficient adversaries (that run in polynomial time) and computationally unbounded adversaries that are only bounded in the number of queries they can make. For computationally bounded adversaries, we show that non-trivial (memory-wise) Bloom filters exist if and only if one-way functions exist. For unbounded adversaries we show that there exists a Bloom filter for sets of size nn and error Δ\varepsilon, that is secure against tt queries and uses only O(nlog⁥1Δ+t)O(n \log{\frac{1}{\varepsilon}}+t) bits of memory. In comparison, nlog⁥1Δn\log{\frac{1}{\varepsilon}} is the best possible under a non-adaptive adversary

    Lempel Ziv Welch data compression using associative processing as an enabling technology for real time application

    Get PDF
    Data compression is a term that refers to the reduction of data representation requirements either in storage and/or in transmission. A commonly used algorithm for compression is the Lempel-Ziv-Welch (LZW) method proposed by Terry A. Welch[l]. LZW is an adaptive, dictionary based, lossless algorithm. This provides for a general compression mechanism that is applicable to a broad range of inputs. Furthermore, the lossless nature of LZW implies that it is a reversible process which results in the original file/message being fully recoverable from compression. A variant of this algorithm is currently the foundation of the UNIX compress program. Additionally, LZW is one of the compression schemes defined in the TIFF standard[12], as well as in the CCITT V.42bis standard. One of the challenges in designing an efficient compression mechanism, such as LZW, which can be used in real time applications, is the speed of the search into the data dictionary. In this paper an Associative Processing implementation of the LZW algorithm is presented. This approach provides an efficient solution to this requirement. Additionally, it is shown that Associative Processing (ASP) allows for rapid and elegant development of the LZW algorithm that will generally outperform standard approaches in complexity, readability, and performance

    Social Fingerprinting: detection of spambot groups through DNA-inspired behavioral modeling

    Full text link
    Spambot detection in online social networks is a long-lasting challenge involving the study and design of detection techniques capable of efficiently identifying ever-evolving spammers. Recently, a new wave of social spambots has emerged, with advanced human-like characteristics that allow them to go undetected even by current state-of-the-art algorithms. In this paper, we show that efficient spambots detection can be achieved via an in-depth analysis of their collective behaviors exploiting the digital DNA technique for modeling the behaviors of social network users. Inspired by its biological counterpart, in the digital DNA representation the behavioral lifetime of a digital account is encoded in a sequence of characters. Then, we define a similarity measure for such digital DNA sequences. We build upon digital DNA and the similarity between groups of users to characterize both genuine accounts and spambots. Leveraging such characterization, we design the Social Fingerprinting technique, which is able to discriminate among spambots and genuine accounts in both a supervised and an unsupervised fashion. We finally evaluate the effectiveness of Social Fingerprinting and we compare it with three state-of-the-art detection algorithms. Among the peculiarities of our approach is the possibility to apply off-the-shelf DNA analysis techniques to study online users behaviors and to efficiently rely on a limited number of lightweight account characteristics
    • 

    corecore