269 research outputs found

    XWeB: the XML Warehouse Benchmark

    Full text link
    With the emergence of XML as a standard for representing business data, new decision support applications are being developed. These XML data warehouses aim at supporting On-Line Analytical Processing (OLAP) operations that manipulate irregular XML data. To ensure feasibility of these new tools, important performance issues must be addressed. Performance is customarily assessed with the help of benchmarks. However, decision support benchmarks do not currently support XML features. In this paper, we introduce the XML Warehouse Benchmark (XWeB), which aims at filling this gap. XWeB derives from the relational decision support benchmark TPC-H. It is mainly composed of a test data warehouse that is based on a unified reference model for XML warehouses and that features XML-specific structures, and its associate XQuery decision support workload. XWeB's usage is illustrated by experiments on several XML database management systems

    Benchmarking Summarizability Processing in XML Warehouses with Complex Hierarchies

    Full text link
    Business Intelligence plays an important role in decision making. Based on data warehouses and Online Analytical Processing, a business intelligence tool can be used to analyze complex data. Still, summarizability issues in data warehouses cause ineffective analyses that may become critical problems to businesses. To settle this issue, many researchers have studied and proposed various solutions, both in relational and XML data warehouses. However, they find difficulty in evaluating the performance of their proposals since the available benchmarks lack complex hierarchies. In order to contribute to summarizability analysis, this paper proposes an extension to the XML warehouse benchmark (XWeB) with complex hierarchies. The benchmark enables us to generate XML data warehouses with scalable complex hierarchies as well as summarizability processing. We experimentally demonstrated that complex hierarchies can definitely be included into a benchmark dataset, and that our benchmark is able to compare two alternative approaches dealing with summarizability issues.Comment: 15th International Workshop on Data Warehousing and OLAP (DOLAP 2012), Maui : United States (2012

    Benchmarking Big Data OLAP NoSQL Databases

    Get PDF
    With the advent of Big Data, new challenges have emerged regarding the evaluation of decision support systems (DSS). Existing evaluation benchmarks are not configured to handle a massive data volume and wide data diversity. In this paper, we introduce a new DSS benchmark that supports multiple data storage systems, such as relational and Not Only SQL (NoSQL) systems. Our scheme recognizes numerous data models (snowflake, star and flat topologies) and several data formats (CSV, JSON, TBL, XML, etc.). It entails complex data generation characterized within “volume, variety, and velocity” framework (3 V). Next, our scheme enables distributed and parallel data generation. Furthermore, we exhibit some experimental results with KoalaBench

    Data generator for evaluating ETL process quality

    Get PDF
    Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.Peer ReviewedPostprint (author's final draft

    Implementing Multidimensional Data Warehouses into NoSQL

    Get PDF
    International audienceNot only SQL (NoSQL) databases are becoming increasingly popular and have some interesting strengths such as scalability and flexibility. In this paper, we investigate on the use of NoSQL systems for implementing OLAP (On-Line Analytical Processing) systems. More precisely, we are interested in instantiating OLAP systems (from the conceptual level to the logical level) and instantiating an aggregation lattice (optimization). We define a set of rules to map star schemas into two NoSQL models: columnoriented and document-oriented. The experimental part is carried out using the reference benchmark TPC. Our experiments show that our rules can effectively instantiate such systems (star schema and lattice). We also analyze differences between the two NoSQL systems considered. In our experiments, HBase (columnoriented) happens to be faster than MongoDB (document-oriented) in terms of loading time

    A New Big Data Benchmark for OLAP Cube Design Using Data Pre-Aggregation Techniques

    Get PDF
    In recent years, several new technologies have enabled OLAP processing over Big Data sources. Among these technologies, we highlight those that allow data pre-aggregation because of their demonstrated performance in data querying. This is the case of Apache Kylin, a Hadoop based technology that supports sub-second queries over fact tables with billions of rows combined with ultra high cardinality dimensions. However, taking advantage of data pre-aggregation techniques to designing analytic models for Big Data OLAP is not a trivial task. It requires very advanced knowledge of the underlying technologies and user querying patterns. A wrong design of the OLAP cube alters significantly several key performance metrics, including: (i) the analytic capabilities of the cube (time and ability to provide an answer to a query), (ii) size of the OLAP cube, and (iii) time required to build the OLAP cube. Therefore, in this paper we (i) propose a benchmark to aid Big Data OLAP designers to choose the most suitable cube design for their goals, (ii) we identify and describe the main requirements and trade-offs for effectively designing a Big Data OLAP cube taking advantage of data pre-aggregation techniques, and (iii) we validate our benchmark in a case study.This work has been funded by the ECLIPSE project (RTI2018-094283-B-C32) from the Spanish Ministry of Science, Innovation and Universities

    A Comparison of Leading Database Storage Engines in Support of Online Analytical Processing in an Open Source Environment

    Get PDF
    Online Analytical Processing (OLAP) has become the de facto data analysis technology used in modern decision support systems. It has experienced tremendous growth, and is among the top priorities for enterprises. Open source systems have become an effective alternative to proprietary systems in terms of cost and function. The purpose of the study was to investigate the performance of two leading database storage engines in an open source OLAP environment. Despite recent upgrades in performance features for the InnoDB database engine, the MyISAM database engine is shown to outperform the InnoDB database engine under a standard benchmark. This result was demonstrated in tests that included concurrent user sessions as well as asynchronous user sessions using data sets ranging from 6GB to 12GB. Although MyISAM outperformed InnoDB in all test performed, InnoDB provides ACID compliant transaction technologies are beneficial in a hybrid OLAP/OLTP system

    Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

    Full text link
    Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload. However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications. In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes
    • …
    corecore