248 research outputs found


    Get PDF
    A common problem with OnLine Analytical Processing (OLAP) databases is data explosion - data size multiplies, when it is loaded from the source data into multidimensional cubes. Data explosion is not an issue for small databases, but can be serious problems with large databases. In this paper we discuss the sparsity and data explosion phenomenon in multidimensional data model, which lie at the core of OLAP systems. Our researches over five companies with different branch of business confirm the observations that in reality most of the cubes are extremely sparse. We also consider a different method that relational and multidimensional severs applies to reduce the data explosion and sparsity problems as compression and indexes techniques, partitioning, preliminary aggregations

    Frequent itemset mining on multiprocessor systems

    Get PDF
    Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets

    Sentinel Mining

    Get PDF

    Query Workload-Aware Index Structures for Range Searches in 1D, 2D, and High-Dimensional Spaces

    Get PDF
    abstract: Most current database management systems are optimized for single query execution. Yet, often, queries come as part of a query workload. Therefore, there is a need for index structures that can take into consideration existence of multiple queries in a query workload and efficiently produce accurate results for the entire query workload. These index structures should be scalable to handle large amounts of data as well as large query workloads. The main objective of this dissertation is to create and design scalable index structures that are optimized for range query workloads. Range queries are an important type of queries with wide-ranging applications. There are no existing index structures that are optimized for efficient execution of range query workloads. There are also unique challenges that need to be addressed for range queries in 1D, 2D, and high-dimensional spaces. In this work, I introduce novel cost models, index selection algorithms, and storage mechanisms that can tackle these challenges and efficiently process a given range query workload in 1D, 2D, and high-dimensional spaces. In particular, I introduce the index structures, HCS (for 1D spaces), cSHB (for 2D spaces), and PSLSH (for high-dimensional spaces) that are designed specifically to efficiently handle range query workload and the unique challenges arising from their respective spaces. I experimentally show the effectiveness of the above proposed index structures by comparing with state-of-the-art techniques.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Correlation indices : a new access method to exploit correlated attributes

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 69-71).In relational query processing, one generally chooses between two classes of access paths when performing a predicate lookup for which no clustered index is available. One option is to use an unclustered index. Another is to perform a complete sequential scan of the table. Online analytical processing (OLAP) workloads often do not benefit from the availability of unclustered indices; the cost of random disk I/O becomes prohibitive for all but the most selective queries. Unfortunately, this means that data warehouses and other OLAP systems frequently perform sequential scans, unless they can satisfy nearly all of the queries posed to them by a single clustered index [7], or unless they have available specialized data structures - like bitmap indices, materialized views, or cubes - to answer queries directly. This thesis presents a new index data structure called a correlation index (CI) that enables OLAP databases to answer a wider range of queries from a single clustered index or sorted file. The CI exploits correlations between the key attribute of a clustered index and other unclustered attributes in the table. In order to predict when CIs will exhibit wins over alternative access methods, the thesis describes an analytical cost model that is suitable for integration with existing query optimizers. An implementation compares CI performance against sequential scans and unclustered B+Tree indices in the popular Berkeley DB [22] library. Experimental results over three different data sets validate the accuracy of the cost model and establish numerous cases where CIs accelerate lookup times by 5 to 20 times over both unclustered B+Trees and sequential scans. The strong experimental results suggest that CIs offer practical and substantial benefits in a variety of useful query scenarios.by George Huo.M.Eng

    Is Semantic Query Optimization Worthwhile?

    Get PDF
    The term quote semantic query optimization quote (SQO) denotes a methodology whereby queries against databases are optimized using semantic information about the database objects being queried. The result of semantically optimizing a query is another query which is syntactically different to the original, but semantically equivalent and which may be answered more efficiently than the original. SQO is distinctly different from the work performed by the conventional SQL optimizer. The SQL optimizer generates a set of logically equivalent alternative execution paths based ultimately on the rules of relational algebra. However, only a small proportion of the readily available semantic information is utilised by current SQL optimizers. Researchers in SQO agree that SQO can be very effective. However, after some twenty years of research into SQO, there is still no commercial implementation. In this thesis we argue that we need to quantify the conditions for which SQO is worthwhile. We investigate what these conditions are and apply this knowledge to relational database management systems (RDBMS) with static schemas and infrequently updated data. Any semantic query optimizer requires the ability to reason using the semantic information available, in order to draw conclusions which ultimately facilitate the recasting of the original query into a form which can be answered more efficiently. This reasoning engine is currently not part of any commercial RDBMS implementation. We show how a practical semantic query optimizer may be built utilising readily available semantic information, much of it already captured by meta-data typically stored in commercial RDBMS. We develop cost models which predict an upper bound to the amount of optimization one can expect when queries are pre-processed by a semantic optimizer. We present a series of empirical results to confirm the effectiveness or otherwise of various types of SQO and demonstrate the circumstances under which SQO can be effective
