3,400 research outputs found

    Managing Compressed Structured Text

    Get PDF
    [Definition]: Compressing structured text is the problem of creating a reduced-space representation from which the original data can be re-created exactly. Compared to plain text compression, the goal is to take advantage of the structural properties of the data. A more ambitious goal is that of being able of manipulating this text in compressed form, without decompressing it. This entry focuses on compressing, navigating, and searching structured text, as those are the areas where more advances have been made

    A Universal Parallel Two-Pass MDL Context Tree Compression Algorithm

    Full text link
    Computing problems that handle large amounts of data necessitate the use of lossless data compression for efficient storage and transmission. We present a novel lossless universal data compression algorithm that uses parallel computational units to increase the throughput. The length-NN input sequence is partitioned into BB blocks. Processing each block independently of the other blocks can accelerate the computation by a factor of BB, but degrades the compression quality. Instead, our approach is to first estimate the minimum description length (MDL) context tree source underlying the entire input, and then encode each of the BB blocks in parallel based on the MDL source. With this two-pass approach, the compression loss incurred by using more parallel units is insignificant. Our algorithm is work-efficient, i.e., its computational complexity is O(N/B)O(N/B). Its redundancy is approximately Blog⁥(N/B)B\log(N/B) bits above Rissanen's lower bound on universal compression performance, with respect to any context tree source whose maximal depth is at most log⁥(N/B)\log(N/B). We improve the compression by using different quantizers for states of the context tree based on the number of symbols corresponding to those states. Numerical results from a prototype implementation suggest that our algorithm offers a better trade-off between compression and throughput than competing universal data compression algorithms.Comment: Accepted to Journal of Selected Topics in Signal Processing special issue on Signal Processing for Big Data (expected publication date June 2015). 10 pages double column, 6 figures, and 2 tables. arXiv admin note: substantial text overlap with arXiv:1405.6322. Version: Mar 2015: Corrected a typ

    Distinct encoded records join operator for distributed query processing

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2012Includes bibliographical references (leaves: 41-43)Text in English; Abstract: Turkish and Englishix, 49 leavesNowadays distributing data among different locations is very popular due to needs of business environment. In today's business environment, accessible, reliable, and scalable data is a critical need and distributed database system provides those advantages. It is a need to transfer data between sites while processing query in distributed database system, if the connection speed between sites is low then transmitting data is very time consuming. Optimizing distributed query processing is different from optimizing query processing in local database system. Most of the algorithms generated for distributed query processing focus on reducing the amount of data transferred between sites. Join operation in database system is for combining different tables with a common join attribute value, if the tables that are put in a join operation are at different locations then some of the tables are needed to be transferred to between sites. Join operation optimization algorithms in distributed database system focus on reducing the amount of data transfer by eliminating redundant tuples from relation before transmitting it to the other site. This thesis introduces a new distributed query processing technique named distinct encoded records join operation (DERjoin) which considers duplicated join attributes in a relation and eliminates them before sending the relation to another site

    Mechanically tunable optofluidic distributed feedback dye laser

    Get PDF
    A continuously tunable optofluidic distributed feedback (DFB) dye laser was demonstrated on a monolithic replica molded poly(dimethylsiloxane) (PDMS) chip. The optical feedback was provided by a phase-shifted higher order Bragg grating embedded in the liquid core of a single mode buried channel waveguide. Due to the soft elastomeric nature of PDMS, the laser frequency could be tuned by mechanically stretching the grating period. In principle, the mechanical tuning range is only limited by the gain bandwidth. A tuning range of nearly 60nm was demonstrated from a single dye laser chip by combining two common dye molecules Rhodamine 6G and Rhodamine 101. Single-mode operation was maintained with less than 0.1nm linewidth. Because of the higher order grating, a single laser, when operated with different dye solutions, can provide tunable light output covering the entire spectrum from near UV to near IR in which efficient laser dyes are available. An array of five DFB dye lasers with different grating periods was also demonstrated on a chip. Such tunable integrated laser arrays are expected to become key components in inexpensive advanced spectroscopy chips

    Parallel Wavelet Tree Construction

    Full text link
    We present parallel algorithms for wavelet tree construction with polylogarithmic depth, improving upon the linear depth of the recent parallel algorithms by Fuentes-Sepulveda et al. We experimentally show on a 40-core machine with two-way hyper-threading that we outperform the existing parallel algorithms by 1.3--5.6x and achieve up to 27x speedup over the sequential algorithm on a variety of real-world and artificial inputs. Our algorithms show good scalability with increasing thread count, input size and alphabet size. We also discuss extensions to variants of the standard wavelet tree.Comment: This is a longer version of the paper that appears in the Proceedings of the IEEE Data Compression Conference, 201

    Compressed Text Indexes:From Theory to Practice!

    Full text link
    A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology

    Distributed search based on self-indexed compressed text

    Get PDF
    Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.Fil: Arroyuelo, Diego. No especifĂ­ca;Fil: Gil Costa, Graciela VerĂłnica. Universidad Nacional de San Luis; Argentina. Consejo Nacional de Investigaciones CientĂ­ficas y TĂ©cnicas. Centro CientĂ­fico TecnolĂłgico Conicet - San Luis; ArgentinaFil: GonzĂĄlez, SenĂ©n. No especifĂ­ca;Fil: Marin, Mauricio. Universidad de Santiago de Chile; ChileFil: OyarzĂșn, Mauricio. Universidad de Santiago de Chile; Chil
    • 

    corecore