269 research outputs found
XWeB: the XML Warehouse Benchmark
With the emergence of XML as a standard for representing business data, new
decision support applications are being developed. These XML data warehouses
aim at supporting On-Line Analytical Processing (OLAP) operations that
manipulate irregular XML data. To ensure feasibility of these new tools,
important performance issues must be addressed. Performance is customarily
assessed with the help of benchmarks. However, decision support benchmarks do
not currently support XML features. In this paper, we introduce the XML
Warehouse Benchmark (XWeB), which aims at filling this gap. XWeB derives from
the relational decision support benchmark TPC-H. It is mainly composed of a
test data warehouse that is based on a unified reference model for XML
warehouses and that features XML-specific structures, and its associate XQuery
decision support workload. XWeB's usage is illustrated by experiments on
several XML database management systems
Benchmarking Summarizability Processing in XML Warehouses with Complex Hierarchies
Business Intelligence plays an important role in decision making. Based on
data warehouses and Online Analytical Processing, a business intelligence tool
can be used to analyze complex data. Still, summarizability issues in data
warehouses cause ineffective analyses that may become critical problems to
businesses. To settle this issue, many researchers have studied and proposed
various solutions, both in relational and XML data warehouses. However, they
find difficulty in evaluating the performance of their proposals since the
available benchmarks lack complex hierarchies. In order to contribute to
summarizability analysis, this paper proposes an extension to the XML warehouse
benchmark (XWeB) with complex hierarchies. The benchmark enables us to generate
XML data warehouses with scalable complex hierarchies as well as
summarizability processing. We experimentally demonstrated that complex
hierarchies can definitely be included into a benchmark dataset, and that our
benchmark is able to compare two alternative approaches dealing with
summarizability issues.Comment: 15th International Workshop on Data Warehousing and OLAP (DOLAP
2012), Maui : United States (2012
Benchmarking Big Data OLAP NoSQL Databases
With the advent of Big Data, new challenges have emerged regarding the evaluation of decision support systems (DSS). Existing evaluation benchmarks are not configured to handle a massive data volume and wide data diversity. In this paper, we introduce a new DSS benchmark that supports multiple data storage systems, such as relational and Not Only SQL (NoSQL) systems. Our scheme recognizes numerous data models (snowflake, star and flat topologies) and several data formats (CSV, JSON, TBL, XML, etc.). It entails complex data generation characterized within “volume, variety, and velocity” framework (3 V). Next, our scheme enables distributed and parallel data generation. Furthermore, we exhibit some experimental results with KoalaBench
Data generator for evaluating ETL process quality
Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.Peer ReviewedPostprint (author's final draft
Implementing Multidimensional Data Warehouses into NoSQL
International audienceNot only SQL (NoSQL) databases are becoming increasingly popular and have some interesting strengths such as scalability and flexibility. In this paper, we investigate on the use of NoSQL systems for implementing OLAP (On-Line Analytical Processing) systems. More precisely, we are interested in instantiating OLAP systems (from the conceptual level to the logical level) and instantiating an aggregation lattice (optimization). We define a set of rules to map star schemas into two NoSQL models: columnoriented and document-oriented. The experimental part is carried out using the reference benchmark TPC. Our experiments show that our rules can effectively instantiate such systems (star schema and lattice). We also analyze differences between the two NoSQL systems considered. In our experiments, HBase (columnoriented) happens to be faster than MongoDB (document-oriented) in terms of loading time
A New Big Data Benchmark for OLAP Cube Design Using Data Pre-Aggregation Techniques
In recent years, several new technologies have enabled OLAP processing over Big Data sources. Among these technologies, we highlight those that allow data pre-aggregation because of their demonstrated performance in data querying. This is the case of Apache Kylin, a Hadoop based technology that supports sub-second queries over fact tables with billions of rows combined with ultra high cardinality dimensions. However, taking advantage of data pre-aggregation techniques to designing analytic models for Big Data OLAP is not a trivial task. It requires very advanced knowledge of the underlying technologies and user querying patterns. A wrong design of the OLAP cube alters significantly several key performance metrics, including: (i) the analytic capabilities of the cube (time and ability to provide an answer to a query), (ii) size of the OLAP cube, and (iii) time required to build the OLAP cube. Therefore, in this paper we (i) propose a benchmark to aid Big Data OLAP designers to choose the most suitable cube design for their goals, (ii) we identify and describe the main requirements and trade-offs for effectively designing a Big Data OLAP cube taking advantage of data pre-aggregation techniques, and (iii) we validate our benchmark in a case study.This work has been funded by the ECLIPSE project (RTI2018-094283-B-C32) from the Spanish Ministry of Science, Innovation and Universities
A Comparison of Leading Database Storage Engines in Support of Online Analytical Processing in an Open Source Environment
Online Analytical Processing (OLAP) has become the de facto data analysis technology used in modern decision support systems. It has experienced tremendous growth, and is among the top priorities for enterprises. Open source systems have become an effective alternative to proprietary systems in terms of cost and function. The purpose of the study was to investigate the performance of two leading database storage engines in an open source OLAP environment. Despite recent upgrades in performance features for the InnoDB database engine, the MyISAM database engine is shown to outperform the InnoDB database engine under a standard benchmark. This result was demonstrated in tests that included concurrent user sessions as well as asynchronous user sessions using data sets ranging from 6GB to 12GB. Although MyISAM outperformed InnoDB in all test performed, InnoDB provides ACID compliant transaction technologies are beneficial in a hybrid OLAP/OLTP system
Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads
Filtering data based on predicates is one of the most fundamental operations
for any modern data warehouse. Techniques to accelerate the execution of filter
expressions include clustered indexes, specialized sort orders (e.g., Z-order),
multi-dimensional indexes, and, for high selectivity queries, secondary
indexes. However, these schemes are hard to tune and their performance is
inconsistent. Recent work on learned multi-dimensional indexes has introduced
the idea of automatically optimizing an index for a particular dataset and
workload. However, the performance of that work suffers in the presence of
correlated data and skewed query workloads, both of which are common in real
applications. In this paper, we introduce Tsunami, which addresses these
limitations to achieve up to 6X faster query performance and up to 8X smaller
index size than existing learned multi-dimensional indexes, in addition to up
to 11X faster query performance and 170X smaller index size than
optimally-tuned traditional indexes
- …