3,686 research outputs found

    On-line analytical processing in distributed data warehouses

    Get PDF
    The concepts of 'data warehousing' and 'on-line analytical processing' have seen a growing interest in the research and commercial product community. Today, the trend moves away from complex centralized data warehouses to distributed data marts integrated in a common conceptual schema. However, as the first part of this paper demonstrates, there are many problems and little solutions for large distributed decision support systems in worldwide operating corporations. After showing the benefits and problems of the distributed approach, this paper outlines possibilities for achieving performance in distributed online analytical processing. Finally, the architectural framework of the prototypical distributed OLAP system CUBESTAR is outlined

    improving query performance using distributed computing

    Get PDF
    Data warehouses are used to store large amounts of data. This data is often used for On-Line Analytical Processing (OLAP) where short response times are essential for on-line decision support. One of the most important requirements of a data warehouse server is the query performance. The principal aspect from the user perspective is how quickly the server processes a given query: “the data warehouse must be fast”. The main focus of our research is finding adequate solutions to improve query response time of typical OLAP queries and improve scalability using a distributed computation environment that takes advantage of characteristics specific to the OLAP context. Our proposal provides very good performance and scalability even on huge data warehouses

    CubiST++: Evaluating Ad-Hoc CUBE Queries Using Statistics Trees

    Get PDF
    We report on a new, efficient encoding for the data cube, which results in a drastic speed-up of OLAP queries that aggregate along any combination of dimensions over numerical and categorical attributes. We are focusing on a class of queries called cube queries, which return aggregated values rather than sets of tuples. Our approach, termed CubiST++ (Cubing with Statistics Trees Plus Families), represents a drastic departure from existing relational (ROLAP) and multi-dimensional (MOLAP) approaches in that it does not use the view lattice to compute and materialize new views from existing views in some heuristic fashion. Instead, CubiST++ encodes all possible aggregate views in the leaves of a new data structure called statistics tree (ST) during a one-time scan of the detailed data. In order to optimize the queries involving constraints on hierarchy levels of the underlying dimensions, we select and materialize a family of candidate trees, which represent superviews over the different hierarchical levels of the dimensions. Given a query, our query evaluation algorithm selects the smallest tree in the family, which can provide the answer. Extensive evaluations of our prototype implementation have demonstrated its superior run-time performance and scalability when compared with existing MOLAP and ROLAP systems

    Active Ontology: An Information Integration Approach for Dynamic Information Sources

    Get PDF
    In this paper we describe an ontology-based information integration approach that is suitable for highly dynamic distributed information sources, such as those available in Grid systems. The main challenges addressed are: 1) information changes frequently and information requests have to be answered quickly in order to provide up-to-date information; and 2) the most suitable information sources have to be selected from a set of different distributed ones that can provide the information needed. To deal with the first challenge we use an information cache that works with an update-on-demand policy. To deal with the second we add an information source selection step to the usual architecture used for ontology-based information integration. To illustrate our approach, we have developed an information service that aggregates metadata available in hundreds of information services of the EGEE Grid infrastructure

    Data Warehouse Design and Management: Theory and Practice

    Get PDF
    The need to store data and information permanently, for their reuse in later stages, is a very relevant problem in the modern world and now affects a large number of people and economic agents. The storage and subsequent use of data can indeed be a valuable source for decision making or to increase commercial activity. The next step to data storage is the efficient and effective use of information, particularly through the Business Intelligence, at whose base is just the implementation of a Data Warehouse. In the present paper we will analyze Data Warehouses with their theoretical models, and illustrate a practical implementation in a specific case study on a pharmaceutical distribution companyData warehouse, database, data model.

    Scalable Model-Based Management of Correlated Dimensional Time Series in ModelarDB+

    Full text link
    To monitor critical infrastructure, high quality sensors sampled at a high frequency are increasingly used. However, as they produce huge amounts of data, only simple aggregates are stored. This removes outliers and fluctuations that could indicate problems. As a remedy, we present a model-based approach for managing time series with dimensions that exploits correlation in and among time series. Specifically, we propose compressing groups of correlated time series using an extensible set of model types within a user-defined error bound (possibly zero). We name this new category of model-based compression methods for time series Multi-Model Group Compression (MMGC). We present the first MMGC method GOLEMM and extend model types to compress time series groups. We propose primitives for users to effectively define groups for differently sized data sets, and based on these, an automated grouping method using only the time series dimensions. We propose algorithms for executing simple and multi-dimensional aggregate queries on models. Last, we implement our methods in the Time Series Management System (TSMS) ModelarDB (ModelarDB+). Our evaluation shows that compared to widely used formats, ModelarDB+ provides up to 13.7 times faster ingestion due to high compression, 113 times better compression due to the adaptivity of GOLEMM, 630 times faster aggregates by using models, and close to linear scalability. It is also extensible and supports online query processing.Comment: 12 Pages, 28 Figures, and 1 Tabl

    Data Mining the SDSS SkyServer Database

    Full text link
    An earlier paper (Szalay et. al. "Designing and Mining MultiTerabyte Astronomy Archives: The Sloan Digital Sky Survey," ACM SIGMOD 2000) described the Sloan Digital Sky Survey's (SDSS) data management needs by defining twenty database queries and twelve data visualization tasks that a good data management system should support. We built a database and interfaces to support both the query load and also a website for ad-hoc access. This paper reports on the database design, describes the data loading pipeline, and reports on the query implementation and performance. The queries typically translated to a single SQL statement. Most queries run in less than 20 seconds, allowing scientists to interactively explore the database. This paper is an in-depth tour of those queries. Readers should first have studied the companion overview paper Szalay et. al. "The SDSS SkyServer, Public Access to the Sloan Digital Sky Server Data" ACM SIGMOND 2002.Comment: 40 pages, Original source is at http://research.microsoft.com/~gray/Papers/MSR_TR_O2_01_20_queries.do
    corecore