1,376 research outputs found

    Caching in Multidimensional Databases

    Get PDF
    One utilisation of multidimensional databases is the field of On-line Analytical Processing (OLAP). The applications in this area are designed to make the analysis of shared multidimensional information fast [9]. On one hand, speed can be achieved by specially devised data structures and algorithms. On the other hand, the analytical process is cyclic. In other words, the user of the OLAP application runs his or her queries one after the other. The output of the last query may be there (at least partly) in one of the previous results. Therefore caching also plays an important role in the operation of these systems. However, caching itself may not be enough to ensure acceptable performance. Size does matter: The more memory is available, the more we gain by loading and keeping information in there. Oftentimes, the cache size is fixed. This limits the performance of the multidimensional database, as well, unless we compress the data in order to move a greater proportion of them into the memory. Caching combined with proper compression methods promise further performance improvements. In this paper, we investigate how caching influences the speed of OLAP systems. Different physical representations (multidimensional and table) are evaluated. For the thorough comparison, models are proposed. We draw conclusions based on these models, and the conclusions are verified with empirical data.Comment: 14 pages, 5 figures, 8 tables. Paper presented at the Fifth Conference of PhD Students in Computer Science, Szeged, Hungary, 27 - 30 June 2006. For further details, please refer to http://www.inf.u-szeged.hu/~szepkuti/papers.html#cachin

    Attribute Value Reordering For Efficient Hybrid OLAP

    Get PDF
    The normalization of a data cube is the ordering of the attribute values. For large multidimensional arrays where dense and sparse chunks are stored differently, proper normalization can lead to improved storage efficiency. We show that it is NP-hard to compute an optimal normalization even for 1x3 chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are nearly statistically independent, we show that dimension-wise attribute frequency sorting is an optimal normalization and takes time O(d n log(n)) for data cubes of size n^d. When dimensions are not independent, we propose and evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is already 19%-30% more efficient than ROLAP, but normalization can improve it further by 9%-13% for a total gain of 29%-44% over ROLAP

    NOSQL design for analytical workloads: Variability matters

    Get PDF
    Big Data has recently gained popularity and has strongly questioned relational databases as universal storage systems, especially in the presence of analytical workloads. As result, co-relational alternatives, commonly known as NOSQL (Not Only SQL) databases, are extensively used for Big Data. As the primary focus of NOSQL is on performance, NOSQL databases are directly designed at the physical level, and consequently the resulting schema is tailored to the dataset and access patterns of the problem in hand. However, we believe that NOSQL design can also benefit from traditional design approaches. In this paper we present a method to design databases for analytical workloads. Starting from the conceptual model and adopting the classical 3-phase design used for relational databases, we propose a novel design method considering the new features brought by NOSQL and encompassing relational and co-relational design altogether.Peer ReviewedPostprint (author's final draft

    Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays

    Get PDF
    Local moments are used for local regression, to compute statistical measures such as sums, averages, and standard deviations, and to approximate probability distributions. We consider the case where the data source is a very large I/O array of size n and we want to compute the first N local moments, for some constant N. Without precomputation, this requires O(n) time. We develop a sequence of algorithms of increasing sophistication that use precomputation and additional buffer space to speed up queries. The simpler algorithms partition the I/O array into consecutive ranges called bins, and they are applicable not only to local-moment queries, but also to algebraic queries (MAX, AVERAGE, SUM, etc.). With N buffers of size sqrt{n}, time complexity drops to O(sqrt n). A more sophisticated approach uses hierarchical buffering and has a logarithmic time complexity (O(b log_b n)), when using N hierarchical buffers of size n/b. Using Overlapped Bin Buffering, we show that only a single buffer is needed, as with wavelet-based algorithms, but using much less storage. Applications exist in multidimensional and statistical databases over massive data sets, interactive image processing, and visualization

    Analyzing the solutions of DEA through information visualization and data mining techniques: SmartDEA framework

    Get PDF
    Data envelopment analysis (DEA) has proven to be a useful tool for assessing efficiency or productivity of organizations, which is of vital practical importance in managerial decision making. DEA provides a significant amount of information from which analysts and managers derive insights and guidelines to promote their existing performances. Regarding to this fact, effective and methodologic analysis and interpretation of DEA solutions are very critical. The main objective of this study is then to develop a general decision support system (DSS) framework to analyze the solutions of basic DEA models. The paper formally shows how the solutions of DEA models should be structured so that these solutions can be examined and interpreted by analysts through information visualization and data mining techniques effectively. An innovative and convenient DEA solver, SmartDEA, is designed and developed in accordance with the proposed analysis framework. The developed software provides a DEA solution which is consistent with the framework and is ready-to-analyze with data mining tools, through a table-based structure. The developed framework is tested and applied in a real world project for benchmarking the vendors of a leading Turkish automotive company. The results show the effectiveness and the efficacy of the proposed framework

    Index filtering and view materialization in ROLAP environment

    Get PDF

    Managing Metadata in Data Warehouses: Pitfalls and Possibilities

    Get PDF
    This paper motivates a comprehensive academic study of metadata and the roles that metadata plays in organizational information systems. While the benefits of metadata and challenges in implementing metadata solutions are widely addressed in practitioner publications, explicit discussion of metadata in academic literature is rare. Metadata, when discussed, is perceived primarily as a technology solution. Integrated management of metadata and its business value are not well addressed. This paper discusses both the benefits offered by and the challenges associated with integrating metadata. It also describes solutions for addressing some of these challenges. The inherent complexity of an integrated metadata repository is demonstrated by reviewing the metadata functionality required in a data warehouse: a decision support environment where its importance is acknowledged. Comparing this required functionality with metadata management functionalities offered by data warehousing software products identifies crucial gaps. Based on these analyses, topics for further research on metadata are proposed

    A Biased Topic Modeling Approach for Case Control Study from Health Related Social Media Postings

    Get PDF
    abstract: Online social networks are the hubs of social activity in cyberspace, and using them to exchange knowledge, experiences, and opinions is common. In this work, an advanced topic modeling framework is designed to analyse complex longitudinal health information from social media with minimal human annotation, and Adverse Drug Events and Reaction (ADR) information is extracted and automatically processed by using a biased topic modeling method. This framework improves and extends existing topic modelling algorithms that incorporate background knowledge. Using this approach, background knowledge such as ADR terms and other biomedical knowledge can be incorporated during the text mining process, with scores which indicate the presence of ADR being generated. A case control study has been performed on a data set of twitter timelines of women that announced their pregnancy, the goals of the study is to compare the ADR risk of medication usage from each medication category during the pregnancy. In addition, to evaluate the prediction power of this approach, another important aspect of personalized medicine was addressed: the prediction of medication usage through the identification of risk groups. During the prediction process, the health information from Twitter timeline, such as diseases, symptoms, treatments, effects, and etc., is summarized by the topic modelling processes and the summarization results is used for prediction. Dimension reduction and topic similarity measurement are integrated into this framework for timeline classification and prediction. This work could be applied to provide guidelines for FDA drug risk categories. Currently, this process is done based on laboratory results and reported cases. Finally, a multi-dimensional text data warehouse (MTD) to manage the output from the topic modelling is proposed. Some attempts have been also made to incorporate topic structure (ontology) and the MTD hierarchy. Results demonstrate that proposed methods show promise and this system represents a low-cost approach for drug safety early warning.Dissertation/ThesisDoctoral Dissertation Computer Science 201
    • …
    corecore