307 research outputs found
Digital Image Access & Retrieval
The 33th Annual Clinic on Library Applications of Data Processing, held at the University of Illinois at Urbana-Champaign in March of 1996, addressed the theme of "Digital Image Access & Retrieval." The papers from this conference cover a wide range of topics concerning digital imaging technology for visual resource collections. Papers covered three general areas: (1) systems, planning, and implementation; (2) automatic and semi-automatic indexing; and (3) preservation with the bulk of the conference focusing on indexing and retrieval.published or submitted for publicatio
Efficient storage of versioned matrices
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 95-96).Versioned-matrix storage is increasingly important in scientific applications. Various computer-based scientific research, from astronomy observations to weather predictions to mechanical finite-element analyses, results in the generation of large matrices that must be stored and retrieved. Such matrices are often versioned; an initial matrix is stored, then a subsequent matrix based on the first is produced, then another subsequent matrix after that. For large databases of matrices, available disk storage can be a substantial constraint. I propose a framework and programming interface for storing such versioned matrices, and consider a variety of intra-matrix and inter-matrix approaches to data storage and compression, taking into account disk-space usage, performance for inserting data, and performance for retrieving data from the database. For inter-matrix "delta" compression, I explore and compare several differencing algorithms, and several means of selecting which arrays are differenced against each other, with the aim of optimizing both disk-space usage and insert and retrieve performance. This work shows that substantial disk-space savings and performance improvements can be achieved by judicious use of these techniques. In particular, a combination of Lempel-Ziv compression and a proposed form of delta compression, it is possible to both decrease disk usage by a factor of 10 and increase query performance for a factor of two or more, for particular data sets and query workloads. Various other strategies can dramatically improve query performance in particular edge cases; for example, a technique called "chunking", where a matrix is broken up and saved as several files on disk, can cause query runtime to be approximately linear in the amount of data requested rather than the size of the raw matrix on disk.by Adam B. Seering.M.Eng
Advances in Genomic Data Compression
The rapid growth in the number of individual whole genome sequences and metagenomic
datasets is generating an unprecedented volume of genomic data. This is partly due to the
continuous drop in the cost of sequencing as well as growth in the utility of sequencing for
research and clinical purposes. We are now reaching a point whereby the lion share of the
cost is shifting from the actual sequencing to processing and storing the resulting data.
With genomic datasets reaching the petabyte scale in hospitals and medium to large
research groups, it is clear that there is an urgent need to store the data more efficiently - not
only to reduce current costs, but also to make sequencing even more affordable to an even
larger set of use cases, thereby accelerating the pace of adoption of genomic data for a
widening range of research projects and clinical applications.
In Chapter 1 of this thesis, I lay the groundwork for a new approach to compressing genomic
data—one that is based on an extensible software platform, which I called Genozip. This
initial proof of concept allows compression of data in a widely used format, namely the
Variant Call Format, or VCF (Danecek et al. 2011) . In Chapter 2, I expand on the work of
Chapter 1, showing how the software architecture is designed to support the addition of
genomic file formats, compression methods, and codecs. Benchmarking results show that
Genozip generally performs better and faster than the leading tools for compression of
common genomic data formats such as VCF, SAM (Li et al. 2009) and FASTQ (Cock et al.
2010) .
In Chapter 3, I take a detour from compression, and demonstrate how potentially Genozip,
with its detailed internal data structures for genomic file processing, could be used for other
types of data manipulation. As an example, I introduce DVCF, or Dual-coordinate VCF—an
extension of the VCF format that allows representation of genetic variants concurrently in
two coordinate systems defined by two different reference genomes (Lan 2021) . It is
possible to use a DVCF file in a pipeline where each step of the pipeline accesses the data
in either of the coordinate systems. I also developed novel methods for lifting over data from
one coordinate system to another, and show the superiority of my methods compared to the
two leading tools in that space, namely GATK LiftoverVCF (McKenna et al. 2010) and
CrossMap (Zhao et al. 2014) . Overall, the Genozip software package is a high quality and versatile bioinformatic tool that is already adopted by dozens of research and clinical laboratories worldwide. Through
reduction of the cost of whole genome sequencing data processing and storage, Genozip is
likely to further encourage the use of genomics in research and clinical settings.Thesis (Ph.D.) -- University of Adelaide, School of Biological Sciences, 202
Effective reorganization and self-indexing of big semantic data
En esta tesis hemos analizado la redundancia estructural que los grafos RDF poseen y propuesto una técnica de preprocesamiento: RDF-Tr, que agrupa, reorganiza y recodifica los triples, tratando dos fuentes de redundancia estructural subyacentes a la naturaleza del esquema RDF. Hemos integrado RDF-Tr en HDT y k2-triples, reduciendo el tamaño que obtienen los compresores originales, superando a las técnicas más prominentes del estado del arte. Hemos denominado HDT++ y k2-triples++ al resultado de aplicar RDF-Tr en cada compresor.
En el ámbito de la compresión RDF se utilizan estructuras compactas para construir autoÃndices RDF, que proporcionan acceso eficiente a los datos sin descomprimirlos. HDT-FoQ es utilizado para publicar y consumir grandes colecciones de datos RDF. Hemos extendido HDT++, llamándolo iHDT++, para resolver patrones SPARQL, consumiendo menos memoria que HDT-FoQ, a la vez que acelera la resolución de la mayorÃa de las consultas, mejorando la relación espacio-tiempo del resto de autoÃndices.Departamento de Informática (Arquitectura y TecnologÃa de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Doctorado en Informátic
Succinct and Self-Indexed Data Structures for the Exploitation and Representation of Moving Objects
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis deals with the efficient representation and exploitation of trajectories of
objects that move in space without any type of restriction (airplanes, birds, boats,
etc.). Currently, this is a very relevant problem due to the proliferation of GPS
devices, which makes it possible to collect a large number of trajectories. However,
until now there is no efficient way to properly store and exploit them.
In this thesis, we propose eight structures that meet two fundamental objectives.
First, they are capable of storing space-time data, describing the trajectories, in a
reduced space, so that their exploitation takes advantage of the memory hierarchy.
Second, those structures allow exploiting the information by object queries, given
an object, they retrieve the position or trajectory of that object along that time; or
space-time range queries, given a region of space and a time interval, the objects
that are within the region at that time are obtained. It should be noted that
state-of-the-art solutions are only capable of efficiently answering one of the two
types of queries.
All of these data structures have a common nexus, they all use two elements:
snapshots and logs. Each snapshot works as a spatial index that periodically indexes
the absolute position of each object or the Minimum Bounding Rectangle (MBR) of
its trajectory. They serve to speed up the spatio-temporal range queries. We have
implemented two types of snapshots: based on k2-trees or R-trees.
With respect to the log, it represents the trajectory (sequence of movements) of
each object. It is the main element of the structures, and facilitates the resolution
of object and spatio-temporal range queries. Four strategies have been implemented
to represent the log in a compressed form: ScdcCT, GraCT, ContaCT and RCT.
With the combination of these two elements we build eight different structures for
the representation of trajectories. All of them have been implemented and evaluated
experimentally, showing that they reduce the space required by traditional methods
by up to two orders of magnitude. Furthermore, they are all competitive in solving
object queries as well as spatial-temporal ones.[Resumen]
Esta tesis aborda la representación y explotación eficiente de trayectorias de objetos
que se mueven en el espacio sin ningún tipo de restricción (aviones, pájaros, barcos,
etc.). En la actualidad, este es un problema muy relevante debido a la proliferación
de dispositivos GPS, lo que permite coleccionar una gran cantidad de trayectorias.
Sin embargo, hasta ahora no existe un modo eficiente para almacenarlas y explotarlas
adecuadamente.
Esta tesis propone ocho estructuras que cumplen con dos objetivos fundamentales.
En primer lugar, son capaces de almacenar en espacio reducido los datos espaciotemporales,
que describen las trayectorias, de modo que su explotación saque partido
a la jerarquÃa de memoria.
En segundo lugar, las estructuras permiten explotar la información realizando
consultas sobre objetos, dado el objeto se calcula su posición o trayectoria durante
un intervalo de tiempo; o consultas de rango espacio-temporal, dada una región del
espacio y un intervalo de tiempo se obtienen los objetos que estaban dentro de la
región en ese tiempo. Hay que destacar que las soluciones del estado del arte solo
son capaces de responder eficientemente uno de los dos tipos de consultas.
Todas estas estructuras de datos tienen un nexo común, todas ellas usan dos
elementos: snapshots y logs. Cada snapshot funciona como un Ãndice espacial que
periódicamente indexa la posición absoluta de cada objeto o el Minimum Bounding
Rectangle (MBR) de su trayectoria. Sirven para agilizar las consultas de rango
espacio-temporal. Hemos implementado dos tipos de snapshot: basadas en k2-trees
o en R-trees.
Con respecto al log, éste representa la trayectoria (secuencia de movimientos) de
cada objeto. Es el principal elemento de nuestras estructuras, y facilita la resolución
de consultas de objeto y de rango espacio-temporal. Se han implementado cuatro
estrategias para representar el log de forma comprimida: ScdcCT, GraCT, ContaCT
y RCT.
Con la combinación de estos dos elementos construimos ocho estructuras diferentes
para la representación de trayectorias. Todas ellas han sido implementadas y
evaluadas experimentalmente, donde reducen hasta dos órdenes de magnitud el
espacio que requieren los métodos tradicionales. Además, todas ellas son competitivas resolviendo tanto consultas de objeto como de rango espacio-temporal.[Resumo]
Esta tese trata sobre a representación e explotación eficiente de traxectorias de
obxectos que se moven no espazo sen ningún tipo de restrición (avións, paxaros,
buques, etc.). Na actualidade, este é un problema moi relevante debido á proliferación
de dispositivos GPS, o que fai posible a recollida dun gran número de traxectorias.
Non obstante, ata o de agora non existe un xeito eficiente de almacenalos e explotalos.
Esta tese propón oito estruturas que cumpren dous obxectivos fundamentais. En
primeiro lugar, son capaces de almacenar datos espazo-temporais, que describen
as traxectorias, nun espazo reducido, de xeito que a súa explotación aproveita a
xerarquÃa da memoria.
En segundo lugar, as estruturas permiten explotar a información realizando
consultas de obxectos, dado o obxecto calcúlase a súa posición ou traxectoria nun
perÃodo de tempo; ou consultas de rango espazo-temporal, dada unha rexión de
espazo e un intervalo de tempo, obtéñense os obxectos que estaban dentro da rexión
nese momento. Cómpre salientar que as solucións do estado do arte só son capaces
de responder eficientemente a un dos dous tipos de consultas.
Todas estas estruturas de datos teñen unha ligazón común, empregan dous
elementos: snapshots e logs. Cada snapshot funciona como un Ãndice espacial que
indexa periodicamente a posición absoluta de cada obxecto ou o Minimum Bounding
Rectangle (MBR) da súa traxectoria. Serven para acelerar as consultas de rango
espazo-temporal. Implementamos dous tipos de snapshot: baseadas en k2-trees ou
en R-trees.
Con respecto ao log, este representa a traxectoria (secuencia de movementos) de
cada obxecto. É o principal elemento das nosas estruturas, e facilita a resolución
de consultas sobre obxectos e de rango espacio-temporal. Implementáronse catro
estratexias para representar o log nunha forma comprimida: ScdcCT, GraCT,
ContaCT e RCT.
Coa combinación destes dous elementos construÃmos oito estruturas diferentes
para a representación de traxectorias. Todas elas foron implementadas e avaliadas
experimentalmente, onde reducen ata dúas ordes de magnitude o espazo requirido
polos métodos tradicionais. Ademais, todas elas son competitivas para resolver tanto
consultas de obxectos como espazo-temporais
Managing tail latency in large scale information retrieval systems
As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem - how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system - in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related queries together, known as multi-queries, to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Ultimately, we find that some solutions yield a low tail latency, and are hence suitable for use in real-time search environments. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency
Graph Processing in Main-Memory Column Stores
Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access.
Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries.
A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language.
In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals.
Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators.
Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes
Recommended from our members
Automated Testing and Debugging for Big Data Analytics
The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, which are falling behind compared to the DISC framework design and implementation. In practice, big data applications often fail as users are unable to test all behaviors emerging from interleaving dataflow operators, user-defined functions, and framework's code. "Testing based on a random sample" rarely guarantees the reliability and "trial and error" and "print" debugging methods are expensive and time-consuming. Thus, the current practice of developing a big data application must be improved and the tools built to enhance the developer's productivity must adapt to the distinct characteristics of data-intensive scalable computing. By synthesizing ideas from software engineering and database systems, our hypothesis is that we can design effective and scalable testing and debugging algorithms for big data analytics without compromising the performance and efficiency of the underlying DISC framework. To design such techniques, we investigate how we can build interactive and responsive debugging primitives that significantly reduce the debugging time, yet do not pose much performance overhead on big data applications. Furthermore, we investigate how we can leverage data provenance techniques from databases and fault-isolation algorithms from software engineering to pinpoint the minimal subset of failure-inducing inputs efficiently. To improve the reliability of big data analytics, we investigate how we can abstract the semantics of dataflow operators and use them in tandem with the semantics of user-defined functions to generate a minimum set of synthetic test inputs capable of revealing more defects than the entire input dataset.To examine the first hypothesis, we introduce interactive, real-time debugging primitives for big data analytics through innovative and scalable debugging features such as simulated breakpoint, dynamic watchpoint, and crash culprit identification. Second, we design a new automated fault localization approach that combines insights from both the software engineering and database literature to bring delta debugging closer to a reality in the big data applications by leveraging data provenance and by constructing systems optimizations for debugging provenance queries. Lastly, we devise a new symbolic-execution based white-box testing algorithm for big data applications that abstracts the implementation of dataflow operators using logical specifications instead of modeling their implementations and combines them with the semantics of any arbitrary user-defined function. We instantiate the idea of an interactive debugging algorithm as BigDebug, the idea of an automated debugging algorithm as BigSift, and the idea of symbolic execution-based testing as BigTest. Our investigation shows that the interactive debugging primitives can scale to terabytes---our record-level tracing incurs less than 25% overhead on average and provides up to 100% time saving compared to the baseline replay debugger. Second, we observe that by combining data provenance with delta debugging, we can identify the minimum faulty input in just under 30% of the original job execution time. Lastly, we verify that by abstracting dataflow operators using logical specifications, we can efficiently generate the most concise test data suitable for local testing while revealing twice as many faults as prior approaches. Our investigations collectively demonstrate that developer productivity can be significantly improved through effective and scalable testing and debugging techniques for big data analytics, without impacting the DISC framework's performance. This dissertation affirms the feasibility of automated debugging and testing techniques for big data analytics---techniques that were previously considered infeasible for large-scale data processing
- …