200 research outputs found

    Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays

    Get PDF
    Local moments are used for local regression, to compute statistical measures such as sums, averages, and standard deviations, and to approximate probability distributions. We consider the case where the data source is a very large I/O array of size n and we want to compute the first N local moments, for some constant N. Without precomputation, this requires O(n) time. We develop a sequence of algorithms of increasing sophistication that use precomputation and additional buffer space to speed up queries. The simpler algorithms partition the I/O array into consecutive ranges called bins, and they are applicable not only to local-moment queries, but also to algebraic queries (MAX, AVERAGE, SUM, etc.). With N buffers of size sqrt{n}, time complexity drops to O(sqrt n). A more sophisticated approach uses hierarchical buffering and has a logarithmic time complexity (O(b log_b n)), when using N hierarchical buffers of size n/b. Using Overlapped Bin Buffering, we show that only a single buffer is needed, as with wavelet-based algorithms, but using much less storage. Applications exist in multidimensional and statistical databases over massive data sets, interactive image processing, and visualization

    How to evaluate multiple range-sum queries progressively

    Get PDF
    Decision support system users typically submit batches of range-sum queries simultaneously rather than issuing individual, unrelated queries. We propose a wavelet based technique that exploits I/O sharing across a query batch to evaluate the set of queries progressively and efficiently. The challenge is that now controlling the structure of errors across query results becomes more critical than minimizing error per individual query. Consequently, we define a class of structural error penalty functions and show how they are controlled by our technique. Experiments demonstrate that our technique is efficient as an exact algorithm, and the progressive estimates are accurate, even after less than one I/O per query

    Representation and Exploitation of Event Sequences

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] The Ten Commandments, the thirty best smartphones in the market and the five most wanted people by the FBI. Our life is ruled by sequences: thought sequences, number sequences, event sequences. . . a history book is nothing more than a compilation of events and our favorite film is just a sequence of scenes. All of them have something in common, it is possible to acquire relevant information from them. Frequently, by accumulating some data from the elements of each sequence we may access hidden information (e.g. the passengers transported by a bus on a journey is the sum of the passengers who got on in the sequence of stops made); other times, reordering the elements by any of their characteristics facilitates the access to the elements of interest (e.g. the publication of books in 2019 can be ordered chronologically, by author, by literary genre or even by a combination of characteristics); but it will always be sought to store them in the smallest space possible. Thus, this thesis proposes technological solutions for the storage and subsequent processing of events, focusing specifically on three fundamental aspects that can be found in any application that needs to manage them: compressed and dynamic storage, aggregation or accumulation of elements of the sequence and element sequence reordering by their different characteristics or dimensions. The first contribution of this work is a compact structure for the dynamic compression of event sequences. This structure allows any sequence to be compressed in a single pass, that is, it is capable of compressing in real time as elements arrive. This contribution is a milestone in the world of compression since, to date, this is the first proposal for a variable-to-variable dynamic compressor for general purpose. Regarding aggregation, a data warehouse-like proposal is presented capable of storing information on any characteristic of the events in a sequence in an aggregated, compact and accessible way. Following the philosophy of current data warehouses, we avoid repeating cumulative operations and speed up aggregate queries by preprocessing the information and keeping it in this separate structure. Finally, this thesis addresses the problem of indexing event sequences considering their different characteristics and possible reorderings. A new approach for simultaneously keeping the elements of a sequence ordered by different characteristics is presented through compact structures. Thus, it is possible to consult the information and perform operations on the elements of the sequence using any possible rearrangement in a simple and efficient way.[Resumen] Los diez mandamientos, los treinta mejores móviles del mercado y las cinco personas más buscadas por el FBI. Nuestra vida está gobernada por secuencias: secuencias de pensamientos, secuencias de números, secuencias de eventos. . . un libro de historia no es más que una sucesión de eventos y nuestra película favorita no es sino una secuencia de escenas. Todas ellas tienen algo en común, de todas podemos extraer información relevante. A veces, al acumular algún dato de los elementos de cada secuencia accedemos a información oculta (p. ej. los viajeros transportados por un autobús en un trayecto es la suma de los pasajeros que se subieron en la secuencia de paradas realizadas); otras veces, la reordenación de los elementos por alguna de sus características facilita el acceso a los elementos de interés (p. ej. la publicación de obras literarias en 2019 puede ordenarse cronológicamente, por autor, por género literario o incluso por una combinación de características); pero siempre se buscará almacenarlas en el espacio más reducido posible sin renunciar a su contenido. Por ello, esta tesis propone soluciones tecnológicas para el almacenamiento y posterior procesamiento de secuencias, centrándose concretamente en tres aspectos fundamentales que se pueden encontrar en cualquier aplicación que precise gestionarlas: el almacenamiento comprimido y dinámico, la agregación o acumulación de algún dato sobre los elementos de la secuencia y la reordenación de los elementos de la secuencia por sus diferentes características o dimensiones. La primera contribución de este trabajo es una estructura compacta para la compresión dinámica de secuencias. Esta estructura permite comprimir cualquier secuencia en una sola pasada, es decir, es capaz de comprimir en tiempo real a medida que llegan los elementos de la secuencia. Esta aportación es un hito en el mundo de la compresión ya que, hasta la fecha, es la primera propuesta de un compresor dinámico “variable to variable” de carácter general. En cuanto a la agregación, se presenta una propuesta de almacén de datos capaz de guardar la información acumulada sobre alguna característica de los eventos de la secuencia de modo compacto y fácilmente accesible. Siguiendo la filosofía de los actuales almacenes de datos, el objetivo es evitar repetir operaciones de acumulación y agilizar las consultas agregadas mediante el preprocesado de la información manteniéndola en esta estructura. Por último, esta tesis aborda el problema de la indexación de secuencias de eventos considerando sus diferentes características y posibles reordenaciones. Se presenta una nueva forma de mantener simultáneamente ordenados los elementos de una secuencia por diferentes características a través de estructuras compactas. Así se permite consultar la información y realizar operaciones sobre los elementos de la secuencia usando cualquier posible ordenación de una manera sencilla y eficiente

    Cube data model for multilevel statistics computation of live execution traces

    Get PDF
    Execution trace logs are used to analyze system run-time behaviour and detect problems. Trace analysis tools usually read the input logs and gather either a detailed or brief summary of them to later process and inspect in the analysis steps. However, continuous and lengthy trace streams contained in the live tracing mode make it difficult to indefinitely record all events or even a detailed summary of the whole stream. This situation is further complicated when the system aims to compare different parts of the trace and provide a multilevel and multidimensional analysis. This paper presents an architecture with corresponding data structures and algorithms to process stream events, generate an adequate summary -detailed enough for recent data and succinct enough for old data- and organize them to enable an efficient multilevel and multidimensional analysis, similar to OLAP analyses in the database applications. The proposed solution arranges data in a compact manner using interval forms and enables the range queries for any arbitrary time durations. Since this feature makes it possible to compare of different system parameters in different time areas it significantly influences the systems ability to provide a comprehensive trace analysis. Although the Linux operating system trace logs are used to evaluate the solution, we propose a generic architecture which can be used to summarize various types of stream data

    High throughput prediction of the long term stability of pharmaceutical macromolecules from short term multi-instrument spectroscopic data

    Get PDF
    The field of pharmaceutical chemistry is currently struggling with the question of how to relate changes in the physical form of a macromolecular biopharmaceutical, such as a therapeutic protein, to changes in the drug's efficacy, safety, and long term stability (ESS). A great number of experimental methods are typically utilized to investigate the differences between forms of a macromolecule, yet conclusions regarding changes in ESS are frequently tentative. An opportunity exists, however, to relate changes in form to changes in ESS. At least once during the development of a new drug, a study is undertaken (at great expense) of the ESS of the drug upon perturbation by multiple manufacturing, formulation, storage and transportation variables. The data acquired is then used to build a model that relates changes in ESS to manufacturing, formulation, storage and transportation variables. It is not common in the pharmaceutical industry, however, to relate changes in comprehensive ESS data sets to comprehensive measurements of changes in macromolecular form. We bridge the gap between physical measurements of a macromolecule's form and measurements of its long term stability, utilizing two data sets collected in a collaboration between our group at the University of Kansas and a group at the Ludwig Maximilians University in Munich, Germany. The long term stability data, collected by the team in Germany, contains measurements of the chemical and conformation stability of Granulocyte Colony Stimulating Factor (GCSF) over a period of two years in 16 different liquid formulations. The short term physical data, collected in our lab, is comprised of spectroscopic characterization of the response of GCSF to thermal unfolding. The same 16 liquid formulations of GCSF were used in each study, allowing us to fit models predicting the long term stability of GCSF from short term measurements. We first apply a novel data reduction method to the short term data. This method selects data in the neighborhood of thermal unfolding transitions, and automates traditional comparative analyses. We then model the long term stability measurements using a linear technique, least squares fits, and a nonlinear one, radial basis function networks (RBFN). Using a Pearson correlation coefficient permutation test, we find that many of the fitted results have less than a 1 percent probability of occurring by chance

    Query Optimization and Execution for Multi-Dimensional OLAP

    Get PDF
    Online Analytical Processing (OLAP) is a database paradigm that supports the rich analysis of multi-dimensional data. While current OLAP tools are primarily constructed as extensions to conventional relational databases, the unique modeling and processing requirements of OLAP systems often make for a relatively awkward fit with RDBM systems in general, and their embedded string-based query languages in particular. In this thesis, we discuss the design, implementation, and evaluation of a robust multi-dimensional OLAP server. In fact, we focus on several distinct but related themes. To begin, we investigate the integration of an open source embedded storage engine with our own OLAP-specific indexing and access methods. We then present a comprehensive OLAP query algebra that ultimately allows developers to create expressive OLAP queries in native client languages such as Java. By utilizing a formal algebraic model, we are able to support an intuitive Object Oriented query API, as well as a powerful query optimization and execution engine. The thesis describes both the optimization methodology and the related algorithms for the efficient execution of the associated query plans. The end result of our research is a comprehensive OLAP DBMS prototype that clearly demonstrates new opportunities for improving the accessibility, functionality, and performance of current OLAP database management systems

    Integrating OLAP and Ranking: The Ranking-Cube Methodology

    Get PDF
    Recent years have witnessed an enormous growth of data in business, industry, and Web applications. Database search often returns a large collection of results, which poses challenges to both efficient query processing and effective digest of the query results. To address this problem, ranked search has been introduced to database systems. We study the problem of On-Line Analytical Processing (OLAP) of ranked queries, where ranked queries are conducted in the arbitrary subset of data defined by multi-dimensional selections. While pre-computation and multi-dimensional aggregation is the standard solution for OLAP, materializing dynamic ranking results is unrealistic because the ranking criteria are not known until the query time. To overcome such difficulty, we develop a new ranking cube method that performs semi on-line materialization and semi online computation in this thesis. Its complete life cycle, including cube construction, incremental maintenance, and query processing, is also discussed. We further extend the ranking cube in three dimensions. First, how to answer queries in high-dimensional data. Second, how to answer queries which involves joins over multiple relations. Third, how to answer general preference queries (besides ranked queries, such as skyline queries). Our performance studies show that ranking-cube is orders of magnitude faster than previous approaches

    A Framework for Real-time Analysis in OLAP Systems

    Get PDF
    OLAP systems are designed to quickly answer multi-dimensional queries against large data warehouse systems. Constructing data cubes and their associated indexes is time consuming and computationally expensive, and for this reason, data cubes are only refreshed periodically. Increasingly, organizations are demanding for both historical and predictive analysis based on the most current data. This trend has also placed the requirement on OLAP systems to merge updates at a much faster rate than before. In this thesis, we proposes a framework for OLAP systems that enables updates to be merged with data cubes in soft real-time. We apply a strategy of local partitioning of the data cube, and maintain a ``hot'' partition for each materialized view to merge update data. We augment this strategy by applying multi-core processing using the OpenMP library to accelerate data cube construction and query resolution. Experiments using a data cube with 10,000,000 tuples and an update set of 100,000 tuples show that our framework achieves a 99% performance improvement updating the data cube, a 76% performance increase when constructing a new data cube, and a 72% performance increase when resolving a range query against a data cube with 1,000,000 tuples

    Sorting improves word-aligned bitmap indexes

    Get PDF
    Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. These techniques are sensitive to the order of the rows: a simple lexicographical sort can divide the index size by 9 and make indexes several times faster. We investigate row-reordering heuristics. Simply permuting the columns of the table can increase the sorting efficiency by 40%. Secondary contributions include efficient algorithms to construct and aggregate bitmaps. The effect of word length is also reviewed by constructing 16-bit, 32-bit and 64-bit indexes. Using 64-bit CPUs, we find that 64-bit indexes are slightly faster than 32-bit indexes despite being nearly twice as large
    corecore