200 research outputs found
Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays
Local moments are used for local regression, to compute statistical measures
such as sums, averages, and standard deviations, and to approximate probability
distributions. We consider the case where the data source is a very large I/O
array of size n and we want to compute the first N local moments, for some
constant N. Without precomputation, this requires O(n) time. We develop a
sequence of algorithms of increasing sophistication that use precomputation and
additional buffer space to speed up queries. The simpler algorithms partition
the I/O array into consecutive ranges called bins, and they are applicable not
only to local-moment queries, but also to algebraic queries (MAX, AVERAGE, SUM,
etc.). With N buffers of size sqrt{n}, time complexity drops to O(sqrt n). A
more sophisticated approach uses hierarchical buffering and has a logarithmic
time complexity (O(b log_b n)), when using N hierarchical buffers of size n/b.
Using Overlapped Bin Buffering, we show that only a single buffer is needed, as
with wavelet-based algorithms, but using much less storage. Applications exist
in multidimensional and statistical databases over massive data sets,
interactive image processing, and visualization
How to evaluate multiple range-sum queries progressively
Decision support system users typically submit batches of range-sum queries simultaneously rather than issuing individual, unrelated queries. We propose a wavelet based technique that exploits I/O sharing across a query batch to evaluate the set of queries progressively and efficiently. The challenge is that now controlling the structure of errors across query results becomes more critical than minimizing error per individual query. Consequently, we define a class of structural error penalty functions and show how they are controlled by our technique. Experiments demonstrate that our technique is efficient as an exact algorithm, and the progressive estimates are accurate, even after less than one I/O per query
Representation and Exploitation of Event Sequences
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
The Ten Commandments, the thirty best smartphones in the market and
the five most wanted people by the FBI. Our life is ruled by sequences:
thought sequences, number sequences, event sequences. . . a history book
is nothing more than a compilation of events and our favorite film is
just a sequence of scenes. All of them have something in common, it
is possible to acquire relevant information from them. Frequently, by
accumulating some data from the elements of each sequence we may
access hidden information (e.g. the passengers transported by a bus
on a journey is the sum of the passengers who got on in the sequence
of stops made); other times, reordering the elements by any of their
characteristics facilitates the access to the elements of interest (e.g. the
publication of books in 2019 can be ordered chronologically, by author,
by literary genre or even by a combination of characteristics); but it
will always be sought to store them in the smallest space possible.
Thus, this thesis proposes technological solutions for the storage
and subsequent processing of events, focusing specifically on three
fundamental aspects that can be found in any application that needs
to manage them: compressed and dynamic storage, aggregation
or accumulation of elements of the sequence and element sequence
reordering by their different characteristics or dimensions.
The first contribution of this work is a compact structure for the
dynamic compression of event sequences. This structure allows any
sequence to be compressed in a single pass, that is, it is capable of
compressing in real time as elements arrive. This contribution is
a milestone in the world of compression since, to date, this is the
first proposal for a variable-to-variable dynamic compressor for general purpose.
Regarding aggregation, a data warehouse-like proposal is presented
capable of storing information on any characteristic of the events in a
sequence in an aggregated, compact and accessible way. Following the
philosophy of current data warehouses, we avoid repeating cumulative
operations and speed up aggregate queries by preprocessing the
information and keeping it in this separate structure.
Finally, this thesis addresses the problem of indexing event sequences
considering their different characteristics and possible reorderings. A new
approach for simultaneously keeping the elements of a sequence ordered
by different characteristics is presented through compact structures.
Thus, it is possible to consult the information and perform operations
on the elements of the sequence using any possible rearrangement in a
simple and efficient way.[Resumen]
Los diez mandamientos, los treinta mejores móviles del mercado y las
cinco personas más buscadas por el FBI. Nuestra vida está gobernada
por secuencias: secuencias de pensamientos, secuencias de números,
secuencias de eventos. . . un libro de historia no es más que una sucesión
de eventos y nuestra película favorita no es sino una secuencia de
escenas. Todas ellas tienen algo en común, de todas podemos extraer
información relevante. A veces, al acumular algún dato de los elementos
de cada secuencia accedemos a información oculta (p. ej. los viajeros
transportados por un autobús en un trayecto es la suma de los pasajeros
que se subieron en la secuencia de paradas realizadas); otras veces, la
reordenación de los elementos por alguna de sus características facilita
el acceso a los elementos de interés (p. ej. la publicación de obras
literarias en 2019 puede ordenarse cronológicamente, por autor, por
género literario o incluso por una combinación de características); pero
siempre se buscará almacenarlas en el espacio más reducido posible sin
renunciar a su contenido.
Por ello, esta tesis propone soluciones tecnológicas para el almacenamiento
y posterior procesamiento de secuencias, centrándose
concretamente en tres aspectos fundamentales que se pueden encontrar
en cualquier aplicación que precise gestionarlas: el almacenamiento
comprimido y dinámico, la agregación o acumulación de algún dato
sobre los elementos de la secuencia y la reordenación de los elementos
de la secuencia por sus diferentes características o dimensiones.
La primera contribución de este trabajo es una estructura compacta
para la compresión dinámica de secuencias. Esta estructura permite
comprimir cualquier secuencia en una sola pasada, es decir, es capaz de comprimir en tiempo real a medida que llegan los elementos de la
secuencia. Esta aportación es un hito en el mundo de la compresión ya
que, hasta la fecha, es la primera propuesta de un compresor dinámico
“variable to variable” de carácter general.
En cuanto a la agregación, se presenta una propuesta de almacén
de datos capaz de guardar la información acumulada sobre alguna
característica de los eventos de la secuencia de modo compacto y
fácilmente accesible. Siguiendo la filosofía de los actuales almacenes de
datos, el objetivo es evitar repetir operaciones de acumulación y agilizar
las consultas agregadas mediante el preprocesado de la información
manteniéndola en esta estructura.
Por último, esta tesis aborda el problema de la indexación de
secuencias de eventos considerando sus diferentes características y
posibles reordenaciones. Se presenta una nueva forma de mantener
simultáneamente ordenados los elementos de una secuencia por diferentes
características a través de estructuras compactas. Así se permite
consultar la información y realizar operaciones sobre los elementos
de la secuencia usando cualquier posible ordenación de una manera
sencilla y eficiente
Cube data model for multilevel statistics computation of live execution traces
Execution trace logs are used to analyze system run-time
behaviour and detect problems. Trace analysis tools usually
read the input logs and gather either a detailed or brief summary
of them to later process and inspect in the analysis steps.
However, continuous and lengthy trace streams contained in
the live tracing mode make it difficult to indefinitely record
all events or even a detailed summary of the whole stream.
This situation is further complicated when the system aims to
compare different parts of the trace and provide a multilevel
and multidimensional analysis.
This paper presents an architecture with corresponding
data structures and algorithms to process stream events, generate
an adequate summary -detailed enough for recent data
and succinct enough for old data- and organize them to enable
an efficient multilevel and multidimensional analysis, similar
to OLAP analyses in the database applications. The proposed
solution arranges data in a compact manner using interval
forms and enables the range queries for any arbitrary
time durations. Since this feature makes it possible to compare
of different system parameters in different time areas it
significantly influences the systems ability to provide a comprehensive
trace analysis. Although the Linux operating system
trace logs are used to evaluate the solution, we propose a
generic architecture which can be used to summarize various
types of stream data
High throughput prediction of the long term stability of pharmaceutical macromolecules from short term multi-instrument spectroscopic data
The field of pharmaceutical chemistry is currently struggling with the question of how to relate changes in the physical form of a macromolecular biopharmaceutical, such as a therapeutic protein, to changes in the drug's efficacy, safety, and long term stability (ESS). A great number of experimental methods are typically utilized to investigate the differences between forms of a macromolecule, yet conclusions regarding changes in ESS are frequently tentative. An opportunity exists, however, to relate changes in form to changes in ESS. At least once during the development of a new drug, a study is undertaken (at great expense) of the ESS of the drug upon perturbation by multiple manufacturing, formulation, storage and transportation variables. The data acquired is then used to build a model that relates changes in ESS to manufacturing, formulation, storage and transportation variables. It is not common in the pharmaceutical industry, however, to relate changes in comprehensive ESS data sets to comprehensive measurements of changes in macromolecular form. We bridge the gap between physical measurements of a macromolecule's form and measurements of its long term stability, utilizing two data sets collected in a collaboration between our group at the University of Kansas and a group at the Ludwig Maximilians University in Munich, Germany. The long term stability data, collected by the team in Germany, contains measurements of the chemical and conformation stability of Granulocyte Colony Stimulating Factor (GCSF) over a period of two years in 16 different liquid formulations. The short term physical data, collected in our lab, is comprised of spectroscopic characterization of the response of GCSF to thermal unfolding. The same 16 liquid formulations of GCSF were used in each study, allowing us to fit models predicting the long term stability of GCSF from short term measurements. We first apply a novel data reduction method to the short term data. This method selects data in the neighborhood of thermal unfolding transitions, and automates traditional comparative analyses. We then model the long term stability measurements using a linear technique, least squares fits, and a nonlinear one, radial basis function networks (RBFN). Using a Pearson correlation coefficient permutation test, we find that many of the fitted results have less than a 1 percent probability of occurring by chance
Query Optimization and Execution for Multi-Dimensional OLAP
Online Analytical Processing (OLAP) is a database paradigm that supports the
rich analysis of multi-dimensional data. While current OLAP tools are primarily
constructed as extensions to conventional relational databases, the unique modeling
and processing requirements of OLAP systems often make for a relatively awkward
fit with RDBM systems in general, and their embedded string-based query languages
in particular. In this thesis, we discuss the design, implementation, and evaluation
of a robust multi-dimensional OLAP server. In fact, we focus on several distinct but
related themes. To begin, we investigate the integration of an open source embedded
storage engine with our own OLAP-specific indexing and access methods. We then
present a comprehensive OLAP query algebra that ultimately allows developers to
create expressive OLAP queries in native client languages such as Java. By utilizing
a formal algebraic model, we are able to support an intuitive Object Oriented query
API, as well as a powerful query optimization and execution engine. The thesis
describes both the optimization methodology and the related algorithms for the
efficient execution of the associated query plans. The end result of our research is a
comprehensive OLAP DBMS prototype that clearly demonstrates new opportunities
for improving the accessibility, functionality, and performance of current OLAP database
management systems
Integrating OLAP and Ranking: The Ranking-Cube Methodology
Recent years have witnessed an enormous growth of data in business, industry, and Web applications. Database search often returns a large collection of results, which poses challenges to both efficient query processing and effective digest of the query results. To address this problem, ranked search has been introduced to database systems. We study the problem of On-Line Analytical Processing (OLAP) of ranked queries, where ranked queries are conducted in the arbitrary subset of data defined by multi-dimensional selections. While pre-computation and multi-dimensional aggregation is the standard solution for OLAP, materializing dynamic ranking results is unrealistic because the ranking criteria are not known until the query time. To overcome such difficulty, we develop a new ranking cube method that performs semi on-line materialization and semi online computation in this thesis. Its complete life cycle, including cube construction, incremental maintenance, and query processing, is also discussed. We further extend the ranking cube in three dimensions. First, how to answer queries in high-dimensional data. Second, how to answer queries which involves joins over multiple relations. Third, how to answer general preference queries (besides ranked queries, such as skyline queries). Our performance studies show that ranking-cube is orders of magnitude faster than previous approaches
A Framework for Real-time Analysis in OLAP Systems
OLAP systems are designed to quickly answer multi-dimensional queries against large data warehouse systems. Constructing data cubes and their associated indexes is time consuming and computationally expensive, and for this reason, data cubes are only refreshed periodically. Increasingly, organizations are demanding for both historical and predictive analysis based on the most current data. This trend has also placed the requirement on OLAP systems to merge updates at a much faster rate than before.
In this thesis, we proposes a framework for OLAP systems that enables updates to be merged with data cubes in soft real-time. We apply a strategy of local partitioning of the data cube, and maintain a ``hot'' partition for each materialized view to merge update data. We augment this strategy by applying multi-core processing using the OpenMP library to accelerate data cube construction and query resolution.
Experiments using a data cube with 10,000,000 tuples and an update set of 100,000 tuples show that our framework achieves a 99% performance improvement updating the data cube, a 76% performance increase when constructing a new data cube, and a 72% performance increase when resolving a range query against a data cube with 1,000,000 tuples
Sorting improves word-aligned bitmap indexes
Bitmap indexes must be compressed to reduce input/output costs and minimize
CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use
techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid
(WAH) compression. These techniques are sensitive to the order of the rows: a
simple lexicographical sort can divide the index size by 9 and make indexes
several times faster. We investigate row-reordering heuristics. Simply
permuting the columns of the table can increase the sorting efficiency by 40%.
Secondary contributions include efficient algorithms to construct and aggregate
bitmaps. The effect of word length is also reviewed by constructing 16-bit,
32-bit and 64-bit indexes. Using 64-bit CPUs, we find that 64-bit indexes are
slightly faster than 32-bit indexes despite being nearly twice as large
- …