3,400 research outputs found
Managing Compressed Structured Text
[Definition]: Compressing structured text is the problem of creating a reduced-space representation from which the original
data can be re-created exactly. Compared to plain text compression, the goal is to take advantage of the structural
properties of the data. A more ambitious goal is that of being able of manipulating this text in compressed form,
without decompressing it. This entry focuses on compressing, navigating, and searching structured text, as those
are the areas where more advances have been made
A Universal Parallel Two-Pass MDL Context Tree Compression Algorithm
Computing problems that handle large amounts of data necessitate the use of
lossless data compression for efficient storage and transmission. We present a
novel lossless universal data compression algorithm that uses parallel
computational units to increase the throughput. The length- input sequence
is partitioned into blocks. Processing each block independently of the
other blocks can accelerate the computation by a factor of , but degrades
the compression quality. Instead, our approach is to first estimate the minimum
description length (MDL) context tree source underlying the entire input, and
then encode each of the blocks in parallel based on the MDL source. With
this two-pass approach, the compression loss incurred by using more parallel
units is insignificant. Our algorithm is work-efficient, i.e., its
computational complexity is . Its redundancy is approximately
bits above Rissanen's lower bound on universal compression
performance, with respect to any context tree source whose maximal depth is at
most . We improve the compression by using different quantizers for
states of the context tree based on the number of symbols corresponding to
those states. Numerical results from a prototype implementation suggest that
our algorithm offers a better trade-off between compression and throughput than
competing universal data compression algorithms.Comment: Accepted to Journal of Selected Topics in Signal Processing special
issue on Signal Processing for Big Data (expected publication date June
2015). 10 pages double column, 6 figures, and 2 tables. arXiv admin note:
substantial text overlap with arXiv:1405.6322. Version: Mar 2015: Corrected a
typ
Recommended from our members
Parallel data compression
Data compression schemes remove data redundancy in communicated and stored data and increase the effective capacities of communication and storage devices. Parallel algorithms and implementations for textual data compression are surveyed. Related concepts from parallel computation and information theory are briefly discussed. Static and dynamic methods for codeword construction and transmission on various models of parallel computation are described. Included are parallel methods which boost system speed by coding data concurrently, and approaches which employ multiple compression techniques to improve compression ratios. Theoretical and empirical comparisons are reported and areas for future research are suggested
Distinct encoded records join operator for distributed query processing
Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2012Includes bibliographical references (leaves: 41-43)Text in English; Abstract: Turkish and Englishix, 49 leavesNowadays distributing data among different locations is very popular due to needs of business environment. In today's business environment, accessible, reliable, and scalable data is a critical need and distributed database system provides those advantages. It is a need to transfer data between sites while processing query in distributed database system, if the connection speed between sites is low then transmitting data is very time consuming. Optimizing distributed query processing is different from optimizing query processing in local database system. Most of the algorithms generated for distributed query processing focus on reducing the amount of data transferred between sites. Join operation in database system is for combining different tables with a common join attribute value, if the tables that are put in a join operation are at different locations then some of the tables are needed to be transferred to between sites. Join operation optimization algorithms in distributed database system focus on reducing the amount of data transfer by eliminating redundant tuples from relation before transmitting it to the other site. This thesis introduces a new distributed query processing technique named distinct encoded records join operation (DERjoin) which considers duplicated join attributes in a relation and eliminates them before sending the relation to another site
Mechanically tunable optofluidic distributed feedback dye laser
A continuously tunable optofluidic distributed feedback (DFB) dye laser was demonstrated on a monolithic replica molded poly(dimethylsiloxane) (PDMS) chip. The optical feedback was provided by a phase-shifted higher order Bragg grating embedded in the liquid core of a single mode buried channel waveguide. Due to the soft elastomeric nature of PDMS, the laser frequency could be tuned by mechanically stretching the grating period. In principle, the mechanical tuning range is only limited by the gain bandwidth. A tuning range of nearly 60nm was demonstrated from a single dye laser chip by combining two common dye molecules Rhodamine 6G and Rhodamine 101. Single-mode operation was maintained with less than 0.1nm linewidth. Because of the higher order grating, a single laser, when operated with different dye solutions, can provide tunable light output covering the entire spectrum from near UV to near IR in which efficient laser dyes are available. An array of five DFB dye lasers with different grating periods was also demonstrated on a chip. Such tunable integrated laser arrays are expected to become key components in inexpensive advanced spectroscopy chips
Parallel Wavelet Tree Construction
We present parallel algorithms for wavelet tree construction with
polylogarithmic depth, improving upon the linear depth of the recent parallel
algorithms by Fuentes-Sepulveda et al. We experimentally show on a 40-core
machine with two-way hyper-threading that we outperform the existing parallel
algorithms by 1.3--5.6x and achieve up to 27x speedup over the sequential
algorithm on a variety of real-world and artificial inputs. Our algorithms show
good scalability with increasing thread count, input size and alphabet size. We
also discuss extensions to variants of the standard wavelet tree.Comment: This is a longer version of the paper that appears in the Proceedings
of the IEEE Data Compression Conference, 201
Compressed Text Indexes:From Theory to Practice!
A compressed full-text self-index represents a text in a compressed form and
still answers queries efficiently. This technology represents a breakthrough
over the text indexing techniques of the previous decade, whose indexes
required several times the size of the text. Although it is relatively new,
this technology has matured up to a point where theoretical research is giving
way to practical developments. Nonetheless this requires significant
programming skills, a deep engineering effort, and a strong algorithmic
background to dig into the research results. To date only isolated
implementations and focused comparisons of compressed indexes have been
reported, and they missed a common API, which prevented their re-use or
deployment within other applications.
The goal of this paper is to fill this gap. First, we present the existing
implementations of compressed indexes from a practitioner's point of view.
Second, we introduce the Pizza&Chili site, which offers tuned implementations
and a standardized API for the most successful compressed full-text
self-indexes, together with effective testbeds and scripts for their automatic
validation and test. Third, we show the results of our extensive experiments on
these codes with the aim of demonstrating the practical relevance of this novel
and exciting technology
Distributed search based on self-indexed compressed text
Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.Fil: Arroyuelo, Diego. No especifĂca;Fil: Gil Costa, Graciela VerĂłnica. Universidad Nacional de San Luis; Argentina. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - San Luis; ArgentinaFil: GonzĂĄlez, SenĂ©n. No especifĂca;Fil: Marin, Mauricio. Universidad de Santiago de Chile; ChileFil: OyarzĂșn, Mauricio. Universidad de Santiago de Chile; Chil
- âŠ