27 research outputs found
Delta bloom filter compression using stochastic learning-based weak estimation
Substantial research has been done, and sill continues, for reducing the bandwidth requirement and for reliable access to the data, stored and transmitted, in a space efficient manner. Bloom filters and their variants have achieved wide spread acceptability in various fields due to their ability to satisfy these requirements.
As this need has increased, especially, for the applications which require heavy use of the transmission bandwidth, distributed computing environment for the databases or the proxy servers, and even the applications which are sensitive to the access to the information with frequent modifications, this thesis proposes a solution in the form of compressed delta Bloom filter.
This thesis proposes delta Bloom filter compression, using stochastic learning-based weak estimation and prediction with partial matching to achieve the goal of lossless compression with high compression gain for reducing the large data transferred frequently
Compressed self-indexed XML representation with efficient XPath evaluation
[Abstract] The popularity of the eXtensible Markup Language (XML) has been continuously growing since its first introduction, being today acknowledged as the de facto standard for semi-structured data representation and data exchange on the World Wide Web. In this scenario, several query languages were proposed to exploit the expressiveness of XML data, as well as systems to provide an eficient support. At the same time, as research in compression became more and more relevant, works also focused their efforts on studying new approaches to provide eficient solutions, using the minimum amount of space. Today, however, there is a lack of practical available tools that join both eficient query support, and minimum space requirements.
In this thesis we address this problem, and propose a new approach for storing, processing and querying XML documents in time and space eficient way, by specially focusing on XPath queries. We have developed a new compressed selfindexed representation of XML documents that obtains compression ratios about 30%-40%, over which a query module providing eficient XPath query evaluation has also been developed. As a whole, both parts make up a complete system, we called XXS, for the eficient evaluation of XPath queries over compressed self-indexed XML documents. Experimental results show the outstanding performance of our proposal, which can successfully compete with some of the best-known solutions, and that largely outperforms them in terms of space.[Resumo] A popularidade do eXtensible Markup Language (XML) non fixo máis que medrar dende a súa introdución inicial, sendo recoñecido hoxe en día como o estándar de facto para a representación de datos semi-estruturados e o intercambio de datos na Rede. Baixo este escenario, son varias as linguaxes de consulta que se propuxeron para explotar a expresividade dos datos en formato XML, así como sistemas que proporcionasen un soporte eficiente a eles. Ó mesmo tempo, e conforme a investigación en compresión se fixo cada vez máis relevante, os esforzos tamén foron dirixidos a estudiar novas aproximacións que ofrecesen solucións eficientes, pero usando ademáis a menor cantidade de espacio posible. Actualmente, sen embargo, existe unha clara ausencia de ferramentas prácticas dispoñibles que agrupen ambas características: un soporte á realización de consultas eficiente, xunto con requisitos de espacio mínimos.
Nesta tese abordamos ese problema, e propoñemos unha nova solución para o almacenamento, procesamento e consulta de documentos XML, eficiente tanto en tempo como en espacio, centrándonos, en particular, na linguaxe de consulta XPath. Así, desenvolvimos unha nova representación comprimida e auto-indexada de documentos XML, que obtén ratios de compresión en torno ó 30%-40%, e sobre a cal se creou tamén un módulo de consulta para a eficiente evaluación de consultas XPath. En conxunto, ambas contribucións conforman un sistema completo, que chamamos XXS, para a evaluación eficiente de consultas XPath sobre documentos XML comprimidos e auto-indexados. Os resultados experimentais amosan o destacado comportamento da nosa ferramenta, que é capaz de competir exitosamente con algunhas das solucións máis coñecidas, ás que ademáis supera claramente en termos de espacio.[Resumen] La popularidad del eXtensible Markup Language (XML) no ha hecho sino más que ir en aumento desde su introducción inicial, siendo hoy día reconocido como el estándar de facto para la representación de datos semi-estructurados, y el intercambio de datos en Internet. Bajo este escenario, son varios los lenguajes de consulta que se han venido proponiendo para explotar la expresividad de los datos en formato XML, así como sistemas que proporcionasen un soporte eficiente a ellos. Al mismo tiempo, y conforme la investigación en compresión se ha hecho cada vez más relevante, los esfuerzos se han dirigido también a estudiar nuevas aproximaciones que ofreciesen soluciones eficientes, pero usando además la menor cantidad de espacio posible. Actualmente, sin embargo, existe una clara ausencia de herramientas prácticas disponibles que aúnen ambas características: un soporte a la realización de consultas eficiente, con requisitos de espacio mínimos.
En esta tesis abordamos ese problema, y proponemos una nueva solución para el almacenamiento, procesamiento y consulta de documentos XML, eficiente en tiempo y en espacio, centrándonos, en particular, en el lenguaje de consulta XPath. Así, hemos desarrollado una nueva representación comprimida y auto-indexada de documentos XML, que obtiene ratios de compresión del 30%-40%, y sobre la cual se ha creado un módulo de consulta para la eficiente evaluación de consultas XPath. En conjunto, ambas contribuciones conforman un sistema completo, que hemos dado en llamar XXS, para la evaluación eficiente de consultas XPath sobre documentos XML comprimidos y auto-indexados. Los resultados experimentales evidencian el destacado comportamiento de nuestra herramienta, que es capaz de competir exitosamente con algunas de las soluciones más conocidas, a las que además supera claramente en términos de espacio
New data structures and algorithms for the efficient management of large spatial datasets
[Resumen] En esta tesis estudiamos la representación eficiente de matrices multidimensionales,
presentando nuevas estructuras de datos compactas para almacenar y procesar
grids en distintos ámbitos de aplicación. Proponemos varias estructuras de datos
estáticas y dinámicas para la representación de matrices binarias o de enteros
y estudiamos aplicaciones a la representación de datos raster en Sistemas de
Información Geográfica, bases de datos RDF, etc.
En primer lugar proponemos una colección de estructuras de datos estáticas para
la representación de matrices binarias y de enteros: 1) una nueva representación
de matrices binarias con grandes grupos de valores uniformes, con aplicaciones
a la representación de datos raster binarios; 2) una nueva estructura de datos
para representar matrices multidimensionales; 3) una nueva estructura de datos
para representar matrices de enteros con soporte para consultas top-k de rango.
También proponemos una nueva representación dinámica de matrices binarias, una
nueva estructura de datos que proporciona las mismas funcionalidades que nuestras
propuestas estáticas pero también soporta cambios en la matriz.
Nuestras estructuras de datos pueden utilizarse en distintos dominios. Proponemos
variantes específicas y combinaciones de nuestras propuestas para representar
grafos temporales, bases de datos RDF, datos raster binarios o generales y
datos raster temporales. También proponemos un nuevo algoritmo para consultar
conjuntamente un conjuto de datos raster (almacenado usando nuestras propuestas)
y un conjunto de datos vectorial almacenado en una estructura de datos clásica,
mostrando que nuestra propuesta puede ser más rápida y usar menos espacio que
otras alternativas. Nuestras representaciones proporcionan interesantes trade-offs y
son competitivas en espacio y tiempos de consulta con representaciones habituales
en los diferentes dominios.[Resumo] Nesta tese estudiamos a representación eficiente de matrices multidimensionais,
presentando novas estruturas de datos compactas para almacenar e procesar grids
en distintos ámbitos de aplicación. Propoñemos varias estruturas de datos estáticas
e dinámicas para a representación de matrices binarias ou de enteiros e estudiamos
aplicacións á representación de datos raster en Sistemas de Información Xeográfica,
bases de datos RDF, etc.
En primeiro lugar propoñemos unha colección de estruturas de datos estáticas
para a representación de matrices binarias e de enteiros: 1) unha nova representación
de matrices binarias con grandes grupos de valores uniformes, con aplicacións
á representación de datos raster binarios; 2) unha nova estrutura de datos
para representar matrices multidimensionais; 3) unha nova estrutura de datos
para representar matrices de enteiros con soporte para consultas top-k. Tamén
propoñemos unha nova representación dinámica de matrices binarias, unha nova
estrutura de datos que proporciona as mesmas funcionalidades que as nosas
propostas estáticas pero tamén soporta cambios na matriz.
As nosas estruturas de datos poden utilizarse en distintos dominios. Propoñemos
variantes específicas e combinacións das nosas propostas para representar grafos temporais,
bases de datos RDF, datos raster binarios ou xerais e datos raster temporais.
Tamén propoñemos un novo algoritmo para consultar conxuntamente datos raster
(almacenados usando as nosas propostas) con datos vectoriais almacenados nunha
estrutura de datos clásica, amosando que a nosa proposta pode ser máis rápida e
usar menos espazo que outras alternativas. As nosas representacións proporcionan
interesantes trade-offs e son competitivas en espazo e tempos de consulta con
representacións habituais nos diferentes dominios.[Abstract] In this thesis we study the efficient representation of multidimensional grids,
presenting new compact data structures to store and query grids in different
application domains. We propose several static and dynamic data structures for the
representation of binary grids and grids of integers, and study applications to the
representation of raster data in Geographic Information Systems, RDF databases,
etc.
We first propose a collection of static data structures for the representation of
binary grids and grids of integers: 1) a new representation of bi-dimensional binary
grids with large clusters of uniform values, with applications to the representation
of binary raster data; 2) a new data structure to represent multidimensional binary
grids; 3) a new data structure to represent grids of integers with support for top-k
range queries. We also propose a new dynamic representation of binary grids, a new
data structure that provides the same functionalities that our static representations
of binary grids but also supports changes in the grid.
Our data structures can be used in several application domains. We propose
specific variants and combinations of our generic proposals to represent temporal
graphs, RDF databases, OLAP databases, binary or general raster data, and
temporal raster data. We also propose a new algorithm to jointly query a raster
dataset (stored using our representations) and a vectorial dataset stored in a classic
data structure, showing that our proposal can be faster and require less space than
the usual alternatives. Our representations provide interesting trade-offs and are
competitive in terms of space and query times with usual representations in the
different domains
Enhanced coding, clock recovery and detection for a magnetic credit card
Merged with duplicate record 10026.1/2299 on 03.04.2017 by CS (TIS)This thesis describes the background, investigation and construction of a system
for storing data on the magnetic stripe of a standard three-inch plastic credit
in: inch card. Investigation shows that the information storage limit within a 3.375 in
by 0.11 in rectangle of the stripe is bounded to about 20 kBytes. Practical issues
limit the data storage to around 300 Bytes with a low raw error rate: a four-fold
density increase over the standard. Removal of the timing jitter (that is prob-'
ably caused by the magnetic medium particle size) would increase the limit to
1500 Bytes with no other system changes. This is enough capacity for either a
small digital passport photograph or a digitized signature: making it possible
to remove printed versions from the surface of the card.
To achieve even these modest gains has required the development of a new
variable rate code that is more resilient to timing errors than other codes in its
efficiency class. The tabulation of the effects of timing errors required the construction
of a new code metric and self-recovering decoders. In addition, a new
method of timing recovery, based on the signal 'snatches' has been invented to
increase the rapidity with which a Bayesian decoder can track the changing velocity
of a hand-swiped card. The timing recovery and Bayesian detector have
been integrated into one computation (software) unit that is self-contained and
can decode a general class of (d, k) constrained codes. Additionally, the unit has
a signal truncation mechanism to alleviate some of the effects of non-linear distortion
that are present when a magnetic card is read with a magneto-resistive
magnetic sensor that has been driven beyond its bias magnetization.
While the storage density is low and the total storage capacity is meagre in
comparison with contemporary storage devices, the high density card may still
have a niche role to play in society. Nevertheless, in the face of the Smart card its
long term outlook is uncertain. However, several areas of coding and detection
under short-duration extreme conditions have brought new decoding methods
to light. The scope of these methods is not limited just to the credit card
MediaSync: Handbook on Multimedia Synchronization
This book provides an approachable overview of the most recent advances in the fascinating field of media synchronization (mediasync), gathering contributions from the most representative and influential experts. Understanding the challenges of this field in the current multi-sensory, multi-device, and multi-protocol world is not an easy task. The book revisits the foundations of mediasync, including theoretical frameworks and models, highlights ongoing research efforts, like hybrid broadband broadcast (HBB) delivery and users' perception modeling (i.e., Quality of Experience or QoE), and paves the way for the future (e.g., towards the deployment of multi-sensory and ultra-realistic experiences). Although many advances around mediasync have been devised and deployed, this area of research is getting renewed attention to overcome remaining challenges in the next-generation (heterogeneous and ubiquitous) media ecosystem. Given the significant advances in this research area, its current relevance and the multiple disciplines it involves, the availability of a reference book on mediasync becomes necessary. This book fills the gap in this context. In particular, it addresses key aspects and reviews the most relevant contributions within the mediasync research space, from different perspectives. Mediasync: Handbook on Multimedia Synchronization is the perfect companion for scholars and practitioners that want to acquire strong knowledge about this research area, and also approach the challenges behind ensuring the best mediated experiences, by providing the adequate synchronization between the media elements that constitute these experiences
Content-aware compression for big textual data analysis
A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements
Local Editing in Lempel-Ziv Compressed Data
This thesis explores the problem of editing data while compressed by a variant of Lempel-Ziv compression. We show that the random-access properties of the LZ-End compression allow random edits, and present the first algorithm to achieve this. The thesis goes on to adapt the LZ-End parsing so that the random access properties become local access, which has tighter memory bounds. Furthermore, the new parsing allows a much improved algorithm to edit the compressed data
Second generation sparse models
Sparse data models, where data is assumed to be well represented as a linear combination of a few elements from a learned dictionary, have gained considerable attention in recent years, and their use has led to state-of-the-art results in many applications. The success of these models is largely attributed to two critical features: the use of sparsity as a robust mechanism for regularizing the linear coefficients that represent the data, and the flexibility provided by overcomplete dictionaries that are learned from the data. These features are controlled by two critical hyper-parameters: the desired sparsity of the coefficients, and the size of the dictionaries to be learned. However, lacking theoretical guidelines for selecting these critical parameters, applications based on sparse models often require hand-tuning and cross-validation to select them, for each application, and each data set. This can be both inefficient and ineffective. On the other hand, there are multiple scenarios in which imposing additional constraints to the produced representations, including the sparse codes and the dictionary itself, can result in further improvements. This thesis is about improving and/or extending current sparse models by addressing the two issues discussed above, providing the elements for a new generation of more powerful and flexible sparse models. First, we seek to gain a better understanding of sparse models as data modeling tools, so that critical parameters can be selected automatically, efficiently, and in a principled way. Secondly, we explore new sparse modeling formulations for effectively exploiting the prior information present in different scenarios. In order to achieve these goals, we combine ideas and tools from information theory, statistics, machine learning, and optimization theory. The theoretical contributions are complemented with applications in audio, image and video processing