1,033 research outputs found
Dv2v: A Dynamic Variable-to-Variable Compressor
We present Dv2v, a new dynamic (one-pass) variable-to-variable compressor.
Variable-to-variable compression aims at using a modeler that gathers
variable-length input symbols and a variable-length statistical coder that
assigns shorter codewords to the more frequent symbols. In Dv2v, we process the
input text word-wise to gather variable-length symbols that can be either
terminals (new words) or non-terminals, subsequences of words seen before in
the input text. Those input symbols are set in a vocabulary that is kept sorted
by frequency. Therefore, those symbols can be easily encoded with dense codes.
Our Dv2v permits real-time transmission of data, i.e. compression/transmission
can begin as soon as data become available. Our experiments show that Dv2v is
able to overcome the compression ratios of the v2vDC, the state-of-the-art
semi-static variable-to-variable compressor, and to almost reach p7zip values.
It also draws a competitive performance at both compression and decompression.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Tight and simple Web graph compression
Analysing Web graphs has applications in determining page ranks, fighting Web
spam, detecting communities and mirror sites, and more. This study is however
hampered by the necessity of storing a major part of huge graphs in the
external memory, which prevents efficient random access to edge (hyperlink)
lists. A number of algorithms involving compression techniques have thus been
presented, to represent Web graphs succinctly but also providing random access.
Those techniques are usually based on differential encodings of the adjacency
lists, finding repeating nodes or node regions in the successive lists, more
general grammar-based transformations or 2-dimensional representations of the
binary matrix of the graph. In this paper we present two Web graph compression
algorithms. The first can be seen as engineering of the Boldi and Vigna (2004)
method. We extend the notion of similarity between link lists, and use a more
compact encoding of residuals. The algorithm works on blocks of varying size
(in the number of input lines) and sacrifices access time for better
compression ratio, achieving more succinct graph representation than other
algorithms reported in the literature. The second algorithm works on blocks of
the same size, in the number of input lines, and its key mechanism is merging
the block into a single ordered list. This method achieves much more attractive
space-time tradeoffs.Comment: 15 page
Representation and Exploitation of Event Sequences
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
The Ten Commandments, the thirty best smartphones in the market and
the five most wanted people by the FBI. Our life is ruled by sequences:
thought sequences, number sequences, event sequences. . . a history book
is nothing more than a compilation of events and our favorite film is
just a sequence of scenes. All of them have something in common, it
is possible to acquire relevant information from them. Frequently, by
accumulating some data from the elements of each sequence we may
access hidden information (e.g. the passengers transported by a bus
on a journey is the sum of the passengers who got on in the sequence
of stops made); other times, reordering the elements by any of their
characteristics facilitates the access to the elements of interest (e.g. the
publication of books in 2019 can be ordered chronologically, by author,
by literary genre or even by a combination of characteristics); but it
will always be sought to store them in the smallest space possible.
Thus, this thesis proposes technological solutions for the storage
and subsequent processing of events, focusing specifically on three
fundamental aspects that can be found in any application that needs
to manage them: compressed and dynamic storage, aggregation
or accumulation of elements of the sequence and element sequence
reordering by their different characteristics or dimensions.
The first contribution of this work is a compact structure for the
dynamic compression of event sequences. This structure allows any
sequence to be compressed in a single pass, that is, it is capable of
compressing in real time as elements arrive. This contribution is
a milestone in the world of compression since, to date, this is the
first proposal for a variable-to-variable dynamic compressor for general purpose.
Regarding aggregation, a data warehouse-like proposal is presented
capable of storing information on any characteristic of the events in a
sequence in an aggregated, compact and accessible way. Following the
philosophy of current data warehouses, we avoid repeating cumulative
operations and speed up aggregate queries by preprocessing the
information and keeping it in this separate structure.
Finally, this thesis addresses the problem of indexing event sequences
considering their different characteristics and possible reorderings. A new
approach for simultaneously keeping the elements of a sequence ordered
by different characteristics is presented through compact structures.
Thus, it is possible to consult the information and perform operations
on the elements of the sequence using any possible rearrangement in a
simple and efficient way.[Resumen]
Los diez mandamientos, los treinta mejores móviles del mercado y las
cinco personas más buscadas por el FBI. Nuestra vida está gobernada
por secuencias: secuencias de pensamientos, secuencias de números,
secuencias de eventos. . . un libro de historia no es más que una sucesión
de eventos y nuestra película favorita no es sino una secuencia de
escenas. Todas ellas tienen algo en común, de todas podemos extraer
información relevante. A veces, al acumular algún dato de los elementos
de cada secuencia accedemos a información oculta (p. ej. los viajeros
transportados por un autobús en un trayecto es la suma de los pasajeros
que se subieron en la secuencia de paradas realizadas); otras veces, la
reordenación de los elementos por alguna de sus características facilita
el acceso a los elementos de interés (p. ej. la publicación de obras
literarias en 2019 puede ordenarse cronológicamente, por autor, por
género literario o incluso por una combinación de características); pero
siempre se buscará almacenarlas en el espacio más reducido posible sin
renunciar a su contenido.
Por ello, esta tesis propone soluciones tecnológicas para el almacenamiento
y posterior procesamiento de secuencias, centrándose
concretamente en tres aspectos fundamentales que se pueden encontrar
en cualquier aplicación que precise gestionarlas: el almacenamiento
comprimido y dinámico, la agregación o acumulación de algún dato
sobre los elementos de la secuencia y la reordenación de los elementos
de la secuencia por sus diferentes características o dimensiones.
La primera contribución de este trabajo es una estructura compacta
para la compresión dinámica de secuencias. Esta estructura permite
comprimir cualquier secuencia en una sola pasada, es decir, es capaz de comprimir en tiempo real a medida que llegan los elementos de la
secuencia. Esta aportación es un hito en el mundo de la compresión ya
que, hasta la fecha, es la primera propuesta de un compresor dinámico
“variable to variable” de carácter general.
En cuanto a la agregación, se presenta una propuesta de almacén
de datos capaz de guardar la información acumulada sobre alguna
característica de los eventos de la secuencia de modo compacto y
fácilmente accesible. Siguiendo la filosofía de los actuales almacenes de
datos, el objetivo es evitar repetir operaciones de acumulación y agilizar
las consultas agregadas mediante el preprocesado de la información
manteniéndola en esta estructura.
Por último, esta tesis aborda el problema de la indexación de
secuencias de eventos considerando sus diferentes características y
posibles reordenaciones. Se presenta una nueva forma de mantener
simultáneamente ordenados los elementos de una secuencia por diferentes
características a través de estructuras compactas. Así se permite
consultar la información y realizar operaciones sobre los elementos
de la secuencia usando cualquier posible ordenación de una manera
sencilla y eficiente
Enabling Fine-Grain Restricted Coset Coding Through Word-Level Compression for PCM
Phase change memory (PCM) has recently emerged as a promising technology to
meet the fast growing demand for large capacity memory in computer systems,
replacing DRAM that is impeded by physical limitations. Multi-level cell (MLC)
PCM offers high density with low per-byte fabrication cost. However, despite
many advantages, such as scalability and low leakage, the energy for
programming intermediate states is considerably larger than programing
single-level cell PCM. In this paper, we study encoding techniques to reduce
write energy for MLC PCM when the encoding granularity is lowered below the
typical cache line size. We observe that encoding data blocks at small
granularity to reduce write energy actually increases the write energy because
of the auxiliary encoding bits. We mitigate this adverse effect by 1) designing
suitable codeword mappings that use fewer auxiliary bits and 2) proposing a new
Word-Level Compression (WLC) which compresses more than 91% of the memory lines
and provides enough room to store the auxiliary data using a novel restricted
coset encoding applied at small data block granularities.
Experimental results show that the proposed encoding at 16-bit data
granularity reduces the write energy by 39%, on average, versus the leading
encoding approach for write energy reduction. Furthermore, it improves
endurance by 20% and is more reliable than the leading approach. Hardware
synthesis evaluation shows that the proposed encoding can be implemented
on-chip with only a nominal area overhead.Comment: 12 page
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
A new word-based compression model allowing compressed pattern matching
In this study a new semistatic data compression model that has a fast coding process and that allows compressed pattern matching is introduced. The name of the proposed model is chosen as tagged word-based compression algorithm (TWBCA) since it has a word-based coding and word-based compressed matching algorithm. The model has two phases. In the first phase a dictionary is constructed by adding a phrase, paying attention to word boundaries, and in the second phase compression is done by using codewords of phrases in this dictionary. The first byte of the codeword determines whether the word is compressed or not. By paying attention to this rule, the CPM process can be conducted as word based. In addition, the proposed method makes it possible to also search for the group of consecutively compressed words. Any of the previous pattern matching algorithms can be chosen to use in compressed pattern matching as a black box. The duration of the CPM process is always less than the duration of the same process on the texts coded by Gzip tool. While matching longer patterns, compressed pattern matching takes more time on the texts coded by compress and end-tagged dense code (ETDC). However, searching shorter patterns takes less time on texts coded by our approach than the texts compressed with compress. Besides this, the compression ratio of our algorithm has a better performance against ETDC only on a file that has been written in Turkish. The compression performance of TWBCA is stable and does not vary over 6% on different text files
Compressed self-indexed XML representation with efficient XPath evaluation
[Abstract] The popularity of the eXtensible Markup Language (XML) has been continuously growing since its first introduction, being today acknowledged as the de facto standard for semi-structured data representation and data exchange on the World Wide Web. In this scenario, several query languages were proposed to exploit the expressiveness of XML data, as well as systems to provide an eficient support. At the same time, as research in compression became more and more relevant, works also focused their efforts on studying new approaches to provide eficient solutions, using the minimum amount of space. Today, however, there is a lack of practical available tools that join both eficient query support, and minimum space requirements.
In this thesis we address this problem, and propose a new approach for storing, processing and querying XML documents in time and space eficient way, by specially focusing on XPath queries. We have developed a new compressed selfindexed representation of XML documents that obtains compression ratios about 30%-40%, over which a query module providing eficient XPath query evaluation has also been developed. As a whole, both parts make up a complete system, we called XXS, for the eficient evaluation of XPath queries over compressed self-indexed XML documents. Experimental results show the outstanding performance of our proposal, which can successfully compete with some of the best-known solutions, and that largely outperforms them in terms of space.[Resumo] A popularidade do eXtensible Markup Language (XML) non fixo máis que medrar dende a súa introdución inicial, sendo recoñecido hoxe en día como o estándar de facto para a representación de datos semi-estruturados e o intercambio de datos na Rede. Baixo este escenario, son varias as linguaxes de consulta que se propuxeron para explotar a expresividade dos datos en formato XML, así como sistemas que proporcionasen un soporte eficiente a eles. Ó mesmo tempo, e conforme a investigación en compresión se fixo cada vez máis relevante, os esforzos tamén foron dirixidos a estudiar novas aproximacións que ofrecesen solucións eficientes, pero usando ademáis a menor cantidade de espacio posible. Actualmente, sen embargo, existe unha clara ausencia de ferramentas prácticas dispoñibles que agrupen ambas características: un soporte á realización de consultas eficiente, xunto con requisitos de espacio mínimos.
Nesta tese abordamos ese problema, e propoñemos unha nova solución para o almacenamento, procesamento e consulta de documentos XML, eficiente tanto en tempo como en espacio, centrándonos, en particular, na linguaxe de consulta XPath. Así, desenvolvimos unha nova representación comprimida e auto-indexada de documentos XML, que obtén ratios de compresión en torno ó 30%-40%, e sobre a cal se creou tamén un módulo de consulta para a eficiente evaluación de consultas XPath. En conxunto, ambas contribucións conforman un sistema completo, que chamamos XXS, para a evaluación eficiente de consultas XPath sobre documentos XML comprimidos e auto-indexados. Os resultados experimentais amosan o destacado comportamento da nosa ferramenta, que é capaz de competir exitosamente con algunhas das solucións máis coñecidas, ás que ademáis supera claramente en termos de espacio.[Resumen] La popularidad del eXtensible Markup Language (XML) no ha hecho sino más que ir en aumento desde su introducción inicial, siendo hoy día reconocido como el estándar de facto para la representación de datos semi-estructurados, y el intercambio de datos en Internet. Bajo este escenario, son varios los lenguajes de consulta que se han venido proponiendo para explotar la expresividad de los datos en formato XML, así como sistemas que proporcionasen un soporte eficiente a ellos. Al mismo tiempo, y conforme la investigación en compresión se ha hecho cada vez más relevante, los esfuerzos se han dirigido también a estudiar nuevas aproximaciones que ofreciesen soluciones eficientes, pero usando además la menor cantidad de espacio posible. Actualmente, sin embargo, existe una clara ausencia de herramientas prácticas disponibles que aúnen ambas características: un soporte a la realización de consultas eficiente, con requisitos de espacio mínimos.
En esta tesis abordamos ese problema, y proponemos una nueva solución para el almacenamiento, procesamiento y consulta de documentos XML, eficiente en tiempo y en espacio, centrándonos, en particular, en el lenguaje de consulta XPath. Así, hemos desarrollado una nueva representación comprimida y auto-indexada de documentos XML, que obtiene ratios de compresión del 30%-40%, y sobre la cual se ha creado un módulo de consulta para la eficiente evaluación de consultas XPath. En conjunto, ambas contribuciones conforman un sistema completo, que hemos dado en llamar XXS, para la evaluación eficiente de consultas XPath sobre documentos XML comprimidos y auto-indexados. Los resultados experimentales evidencian el destacado comportamiento de nuestra herramienta, que es capaz de competir exitosamente con algunas de las soluciones más conocidas, a las que además supera claramente en términos de espacio
Compressão eficiente de sequências biológicas usando uma rede neuronal
Background: The increasing production of genomic data has led to
an intensified need for models that can cope efficiently with the lossless
compression of biosequences. Important applications include long-term
storage and compression-based data analysis. In the literature, only a
few recent articles propose the use of neural networks for biosequence
compression. However, they fall short when compared with specific
DNA compression tools, such as GeCo2. This limitation is due to the
absence of models specifically designed for DNA sequences. In this
work, we combine the power of neural networks with specific DNA and
amino acids models. For this purpose, we created GeCo3 and AC2, two
new biosequence compressors. Both use a neural network for mixing
the opinions of multiple specific models.
Findings: We benchmark GeCo3 as a reference-free DNA compressor
in five datasets, including a balanced and comprehensive dataset
of DNA sequences, the Y-chromosome and human mitogenome, two
compilations of archaeal and virus genomes, four whole genomes, and
two collections of FASTQ data of a human virome and ancient DNA.
GeCo3 achieves a solid improvement in compression over the previous
version (GeCo2) of 2:4%, 7:1%, 6:1%, 5:8%, and 6:0%, respectively.
As a reference-based DNA compressor, we benchmark GeCo3 in four
datasets constituted by the pairwise compression of the chromosomes
of the genomes of several primates. GeCo3 improves the compression in
12:4%, 11:7%, 10:8% and 10:1% over the state-of-the-art. The cost of
this compression improvement is some additional computational time
(1:7_ to 3:0_ slower than GeCo2). The RAM is constant, and the tool
scales efficiently, independently from the sequence size. Overall, these
values outperform the state-of-the-art. For AC2 the improvements and
costs over AC are similar, which allows the tool to also outperform the
state-of-the-art.
Conclusions: The GeCo3 and AC2 are biosequence compressors with
a neural network mixing approach, that provides additional gains over
top specific biocompressors. The proposed mixing method is portable,
requiring only the probabilities of the models as inputs, providing easy
adaptation to other data compressors or compression-based data analysis
tools. GeCo3 and AC2 are released under GPLv3 and are available
for free download at https://github.com/cobilab/geco3 and
https://github.com/cobilab/ac2.Contexto: O aumento da produção de dados genómicos levou a uma
maior necessidade de modelos que possam lidar de forma eficiente com
a compressão sem perdas de biosequências. Aplicações importantes
incluem armazenamento de longo prazo e análise de dados baseada em
compressão. Na literatura, apenas alguns artigos recentes propõem o
uso de uma rede neuronal para compressão de biosequências. No entanto,
os resultados ficam aquém quando comparados com ferramentas
de compressão de ADN específicas, como o GeCo2. Essa limitação
deve-se à ausência de modelos específicos para sequências de ADN.
Neste trabalho, combinamos o poder de uma rede neuronal com modelos
específicos de ADN e aminoácidos. Para isso, criámos o GeCo3 e
o AC2, dois novos compressores de biosequências. Ambos usam uma
rede neuronal para combinar as opiniões de vários modelos específicos.
Resultados: Comparamos o GeCo3 como um compressor de ADN
sem referência em cinco conjuntos de dados, incluindo um conjunto
de dados balanceado de sequências de ADN, o cromossoma Y e o mitogenoma
humano, duas compilações de genomas de arqueas e vírus,
quatro genomas inteiros e duas coleções de dados FASTQ de um viroma
humano e ADN antigo. O GeCo3 atinge uma melhoria sólida
na compressão em relação à versão anterior (GeCo2) de 2,4%, 7,1%,
6,1%, 5,8% e 6,0%, respectivamente. Como um compressor de ADN
baseado em referência, comparamos o GeCo3 em quatro conjuntos
de dados constituídos pela compressão aos pares dos cromossomas
dos genomas de vários primatas. O GeCo3 melhora a compressão em
12,4%, 11,7%, 10,8% e 10,1% em relação ao estado da arte. O custo
desta melhoria de compressão é algum tempo computacional adicional
(1,7 _ a 3,0 _ mais lento do que GeCo2). A RAM é constante e a
ferramenta escala de forma eficiente, independentemente do tamanho
da sequência. De forma geral, os rácios de compressão superam o estado
da arte. Para o AC2, as melhorias e custos em relação ao AC são
semelhantes, o que permite que a ferramenta também supere o estado
da arte.
Conclusões: O GeCo3 e o AC2 são compressores de sequências biológicas
com uma abordagem de mistura baseada numa rede neuronal,
que fornece ganhos adicionais em relação aos biocompressores específicos
de topo. O método de mistura proposto é portátil, exigindo apenas
as probabilidades dos modelos como entradas, proporcionando uma fácil
adaptação a outros compressores de dados ou ferramentas de análise
baseadas em compressão. O GeCo3 e o AC2 são distribuídos sob GPLv3
e estão disponíveis para download gratuito em https://github.com/
cobilab/geco3 e https://github.com/cobilab/ac2.Mestrado em Engenharia de Computadores e Telemátic
- …