Search CORE

3,400 research outputs found

Managing Compressed Structured Text

Author: Diego Arroyuelo
G Gottlob
G Navarro
Gonzalo Navarro
Gonzalo Navarro
J Barbay
M Lohrey
M Lohrey
NR Brisaboa
NR Brisaboa
P Ferragina
Paolo Ferragina
R Baeza-Yates
S Sakr
V Mäkinen
Publication venue: Springer Nature
Publication date: 07/12/2018
Field of study

[Definition]: Compressing structured text is the problem of creating a reduced-space representation from which the original data can be re-created exactly. Compared to plain text compression, the goal is to take advantage of the structural properties of the data. A more ambitious goal is that of being able of manipulating this text in compressed form, without decompressing it. This entry focuses on compressing, navigating, and searching structured text, as those are the areas where more advances have been made

Repositorio da Universidade da Coruña

Crossref

A Universal Parallel Two-Pass MDL Context Tree Compression Algorithm

Author: Baron Dror
Krishnan Nikhil
Publication venue
Publication date: 21/03/2015
Field of study

Computing problems that handle large amounts of data necessitate the use of lossless data compression for efficient storage and transmission. We present a novel lossless universal data compression algorithm that uses parallel computational units to increase the throughput. The length-

N

input sequence is partitioned into

B

blocks. Processing each block independently of the other blocks can accelerate the computation by a factor of

B

, but degrades the compression quality. Instead, our approach is to first estimate the minimum description length (MDL) context tree source underlying the entire input, and then encode each of the

B

blocks in parallel based on the MDL source. With this two-pass approach, the compression loss incurred by using more parallel units is insignificant. Our algorithm is work-efficient, i.e., its computational complexity is

O(N/B)

. Its redundancy is approximately

B\log(N/B)

bits above Rissanen's lower bound on universal compression performance, with respect to any context tree source whose maximal depth is at most

\log(N/B)

. We improve the compression by using different quantizers for states of the context tree based on the number of symbols corresponding to those states. Numerical results from a prototype implementation suggest that our algorithm offers a better trade-off between compression and throughput than competing universal data compression algorithms.Comment: Accepted to Journal of Selected Topics in Signal Processing special issue on Signal Processing for Big Data (expected publication date June 2015). 10 pages double column, 6 figures, and 2 tables. arXiv admin note: substantial text overlap with arXiv:1405.6322. Version: Mar 2015: Corrected a typ

arXiv.org e-Print Archive

Recommended from our members

Parallel data compression

Author: Hirschberg Daniel S.
Stauffer Lynn M.
Publication venue: eScholarship, University of California
Publication date: 01/05/1991
Field of study

Data compression schemes remove data redundancy in communicated and stored data and increase the effective capacities of communication and storage devices. Parallel algorithms and implementations for textual data compression are surveyed. Related concepts from parallel computation and information theory are briefly discussed. Static and dynamic methods for codeword construction and transmission on various models of parallel computation are described. Included are parallel methods which boost system speed by coding data concurrently, and approaches which employ multiple compression techniques to improve compression ratios. Theoretical and empirical comparisons are reported and areas for future research are suggested

eScholarship - University of California

Distinct encoded records join operator for distributed query processing

Author: Öztürk Ahmet Cumhur
Publication venue: Izmir Institute of Technology
Publication date: 01/01/2012
Field of study

Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2012Includes bibliographical references (leaves: 41-43)Text in English; Abstract: Turkish and Englishix, 49 leavesNowadays distributing data among different locations is very popular due to needs of business environment. In today's business environment, accessible, reliable, and scalable data is a critical need and distributed database system provides those advantages. It is a need to transfer data between sites while processing query in distributed database system, if the connection speed between sites is low then transmitting data is very time consuming. Optimizing distributed query processing is different from optimizing query processing in local database system. Most of the algorithms generated for distributed query processing focus on reducing the amount of data transferred between sites. Join operation in database system is for combining different tables with a common join attribute value, if the tables that are put in a join operation are at different locations then some of the tables are needed to be transferred to between sites. Join operation optimization algorithms in distributed database system focus on reducing the amount of data transfer by eliminating redundant tuples from relation before transmitting it to the other site. This thesis introduces a new distributed query processing technique named distinct encoded records join operation (DERjoin) which considers duplicated join attributes in a relation and eliminates them before sending the relation to another site

Mechanically tunable optofluidic distributed feedback dye laser

Author: Emery Teresa
Li Zhenyu
Psaltis Demetri
Scherer Axel
Zhang Zhaoyu
Publication venue: Optical Society of America
Publication date: 01/01/2006
Field of study

A continuously tunable optofluidic distributed feedback (DFB) dye laser was demonstrated on a monolithic replica molded poly(dimethylsiloxane) (PDMS) chip. The optical feedback was provided by a phase-shifted higher order Bragg grating embedded in the liquid core of a single mode buried channel waveguide. Due to the soft elastomeric nature of PDMS, the laser frequency could be tuned by mechanically stretching the grating period. In principle, the mechanical tuning range is only limited by the gain bandwidth. A tuning range of nearly 60nm was demonstrated from a single dye laser chip by combining two common dye molecules Rhodamine 6G and Rhodamine 101. Single-mode operation was maintained with less than 0.1nm linewidth. Because of the higher order grating, a single laser, when operated with different dye solutions, can provide tunable light output covering the entire spectrum from near UV to near IR in which efficient laser dyes are available. An array of five DFB dye lasers with different grating periods was also demonstrated on a chip. Such tunable integrated laser arrays are expected to become key components in inexpensive advanced spectroscopy chips

Crossref

Caltech Authors

Parallel Wavelet Tree Construction

Author: Shun Julian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2015
Field of study

We present parallel algorithms for wavelet tree construction with polylogarithmic depth, improving upon the linear depth of the recent parallel algorithms by Fuentes-Sepulveda et al. We experimentally show on a 40-core machine with two-way hyper-threading that we outperform the existing parallel algorithms by 1.3--5.6x and achieve up to 27x speedup over the sequential algorithm on a variety of real-world and artificial inputs. Our algorithms show good scalability with increasing thread count, input size and alphabet size. We also discuss extensions to variants of the standard wavelet tree.Comment: This is a longer version of the paper that appears in the Proceedings of the IEEE Data Compression Conference, 201

arXiv.org e-Print Archive

Crossref

Compressed Text Indexes:From Theory to Practice!

Author: Ferragina Paolo
Gonzalez Rodrigo
Navarro Gonzalo
Venturini Rossano
Publication venue
Publication date: 01/01/2007
Field of study

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology

arXiv.org e-Print Archive

CiteSeerX

Archivio della Ricerca - Università di Pisa

Distributed search based on self-indexed compressed text

Author: Arroyuelo Diego
Gil Costa Graciela Verónica
González Senén
Marin Mauricio
Oyarzún Mauricio
Publication venue: Pergamon-Elsevier Science Ltd
Publication date: 01/03/2012
Field of study

Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.Fil: Arroyuelo, Diego. No especifíca;Fil: Gil Costa, Graciela Verónica. Universidad Nacional de San Luis; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaFil: González, Senén. No especifíca;Fil: Marin, Mauricio. Universidad de Santiago de Chile; ChileFil: Oyarzún, Mauricio. Universidad de Santiago de Chile; Chil

CONICET Digital