13 research outputs found

    XWeB: the XML Warehouse Benchmark

    Full text link
    With the emergence of XML as a standard for representing business data, new decision support applications are being developed. These XML data warehouses aim at supporting On-Line Analytical Processing (OLAP) operations that manipulate irregular XML data. To ensure feasibility of these new tools, important performance issues must be addressed. Performance is customarily assessed with the help of benchmarks. However, decision support benchmarks do not currently support XML features. In this paper, we introduce the XML Warehouse Benchmark (XWeB), which aims at filling this gap. XWeB derives from the relational decision support benchmark TPC-H. It is mainly composed of a test data warehouse that is based on a unified reference model for XML warehouses and that features XML-specific structures, and its associate XQuery decision support workload. XWeB's usage is illustrated by experiments on several XML database management systems

    Benchmarking Summarizability Processing in XML Warehouses with Complex Hierarchies

    Full text link
    Business Intelligence plays an important role in decision making. Based on data warehouses and Online Analytical Processing, a business intelligence tool can be used to analyze complex data. Still, summarizability issues in data warehouses cause ineffective analyses that may become critical problems to businesses. To settle this issue, many researchers have studied and proposed various solutions, both in relational and XML data warehouses. However, they find difficulty in evaluating the performance of their proposals since the available benchmarks lack complex hierarchies. In order to contribute to summarizability analysis, this paper proposes an extension to the XML warehouse benchmark (XWeB) with complex hierarchies. The benchmark enables us to generate XML data warehouses with scalable complex hierarchies as well as summarizability processing. We experimentally demonstrated that complex hierarchies can definitely be included into a benchmark dataset, and that our benchmark is able to compare two alternative approaches dealing with summarizability issues.Comment: 15th International Workshop on Data Warehousing and OLAP (DOLAP 2012), Maui : United States (2012

    DIPBench: An Independent Benchmark for Data-Intensive Integration Processes

    Get PDF
    The integration of heterogeneous data sources is one of the main challenges within the area of data engineering. Due to the absence of an independent and universal benchmark for data-intensive integration processes, we propose a scalable benchmark, called DIPBench (Data intensive integration Process Benchmark), for evaluating the performance of integration systems. This benchmark could be used for subscription systems, like replication servers, distributed and federated DBMS or message-oriented middleware platforms like Enterprise Application Integration (EAI) servers and Extraction Transformation Loading (ETL) tools. In order to reach the mentioned universal view for integration processes, the benchmark is designed in a conceptual, process-driven way. The benchmark comprises 15 integration process types. We specify the source and target data schemas and provide a toolsuite for the initialization of the external systems, the execution of the benchmark and the monitoring of the integration system's performance. The core benchmark execution may be influenced by three scale factors. Finally, we discuss a metric unit used for evaluating the measured integration system's performance, and we illustrate our reference benchmark implementation for federated DBMS

    Data quality problems in TPC-DI based data integration processes

    Get PDF
    Many data driven organisations need to integrate data from multiple, distributed and heterogeneous resources for advanced data analysis. A data integration system is an essential component to collect data into a data warehouse or other data analytics systems. There are various alternatives of data integration systems which are created in-house or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating data integration systems. When using this benchmark, we find some typical data quality problems in the TPC-DI data source such as multi-meaning attributes and inconsistent data schemas, which could delay or even fail the data integration process. This paper explains processes of this benchmark and summarises typical data quality problems identified in the TPC-DI data source. Furthermore, in order to prevent data quality problems and proactively manage data quality, we propose a set of practical guidelines for researchers and practitioners to conduct data quality management when using the TPC-DI benchmark

    Benchmarking Big Data OLAP NoSQL Databases

    Get PDF
    With the advent of Big Data, new challenges have emerged regarding the evaluation of decision support systems (DSS). Existing evaluation benchmarks are not configured to handle a massive data volume and wide data diversity. In this paper, we introduce a new DSS benchmark that supports multiple data storage systems, such as relational and Not Only SQL (NoSQL) systems. Our scheme recognizes numerous data models (snowflake, star and flat topologies) and several data formats (CSV, JSON, TBL, XML, etc.). It entails complex data generation characterized within “volume, variety, and velocity” framework (3 V). Next, our scheme enables distributed and parallel data generation. Furthermore, we exhibit some experimental results with KoalaBench

    Cost-based Optimization of Multistore Query Plans

    Get PDF
    Multistores are data management systems that enable query processing across different and heterogeneous databases; besides the distribution of data, complexity factors like schema heterogeneity and data replication must be resolved through integration and data fusion activities. Our multistore solution relies on a dataspace to provide the user with an integrated view of the available data and enables the formulation and execution of GPSJ queries. In this paper, we propose a technique to optimize the execution of GPSJ queries by formulating and evaluating different execution plans on the multistore. In particular, we outline different strategies to carry out joins and data fusion by relying on different schema representations; then, a self-learning black-box cost model is used to estimate execution times and select the most efficient plan. The experiments assess the effectiveness of the cost model in choosing the best execution plan for the given queries and exploit multiple multistore benchmarks to investigate the factors that influence the performance of different plans

    Business Intelligence on Non-Conventional Data

    Get PDF
    The revolution in digital communications witnessed over the last decade had a significant impact on the world of Business Intelligence (BI). In the big data era, the amount and diversity of data that can be collected and analyzed for the decision-making process transcends the restricted and structured set of internal data that BI systems are conventionally limited to. This thesis investigates the unique challenges imposed by three specific categories of non-conventional data: social data, linked data and schemaless data. Social data comprises the user-generated contents published through websites and social media, which can provide a fresh and timely perception about people’s tastes and opinions. In Social BI (SBI), the analysis focuses on topics, meant as specific concepts of interest within the subject area. In this context, this thesis proposes meta-star, an alternative strategy to the traditional star-schema for modeling hierarchies of topics to enable OLAP analyses. The thesis also presents an architectural framework of a real SBI project and a cross-disciplinary benchmark for SBI. Linked data employ the Resource Description Framework (RDF) to provide a public network of interlinked, structured, cross-domain knowledge. In this context, this thesis proposes an interactive and collaborative approach to build aggregation hierarchies from linked data. Schemaless data refers to the storage of data in NoSQL databases that do not force a predefined schema, but let database instances embed their own local schemata. In this context, this thesis proposes an approach to determine the schema profile of a document-based database; the goal is to facilitate users in a schema-on-read analysis process by understanding the rules that drove the usage of the different schemata. A final and complementary contribution of this thesis is an innovative technique in the field of recommendation systems to overcome user disorientation in the analysis of a large and heterogeneous wealth of data

    Gestión por procesos (BPM) usando mejora continua y reingeniería de procesos de negocio

    Get PDF
    El siguiente trabajo tiene por objetivo presentar el paradigma de gestión por procesos, el cuál es muy importante como parte principal de la estrategia organizacional. Éste concepto hoy en día conocido como BPM (Gestión de Procesos de Negocio) consta de dos partes fundamentales: la Gestión y las Tecnologías. En éste aspecto, el presente trabajo hace énfasis en la primera parte, mostrando dos metodologías necesarias en la Gestión: La Mejora Continua y la Reingeniería; aplicándolas en dos casos para organizaciones reales (Telecom y DataSec). -- Palabras Claves: Procesos. Gestión de Procesos de Negocio (BPM), Tecnología, Mejora Continua, Reingeniería.Tesi
    corecore