    A Generalized Approach to Optimization of Relational Data Warehouses Using Hybrid Greedy and Genetic Algorithms

    As far as we know, in the open scientific literature, there is no generalized framework for the optimization of relational data warehouses which includes view and index selection and vertical view fragmentation. In this paper we are offering such a framework. We propose a formalized multidimensional model, based on relational schemas, which provides complete vertical view fragmentation and presents an approach of the transformation of a fragmented snowflake schema to a defragmented star schema through the process of denormalization. We define the generalized system of relational data warehouses optimization by including vertical fragmentation of the implementation schema (F), indexes (I) and view selection (S) for materialization. We consider Genetic Algorithm as an optimization method and introduce the technique of "recessive bits" for handling the infeasible solutions that are obtained by a Genetic Algorithm. We also present two novel hybrid algorithms, i.e. they are combination of Greedy and Genetic Algorithms. Finally, we present our experimental results and show improvements of the performance and benefits of the generalized approach (SFI) and show that our novel algorithms significantly improve the efficiency of the optimization process for different input parameters

    Query-Time Data Integration

    Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections


    A common problem with OnLine Analytical Processing (OLAP) databases is data explosion - data size multiplies, when it is loaded from the source data into multidimensional cubes. Data explosion is not an issue for small databases, but can be serious problems with large databases. In this paper we discuss the sparsity and data explosion phenomenon in multidimensional data model, which lie at the core of OLAP systems. Our researches over five companies with different branch of business confirm the observations that in reality most of the cubes are extremely sparse. We also consider a different method that relational and multidimensional severs applies to reduce the data explosion and sparsity problems as compression and indexes techniques, partitioning, preliminary aggregations

    Development of new data partitioning and allocation algorithms for query optimization of distributed data warehouse systems

    Distributed databases and in particular distributed data warehousing are becoming an increasingly important technology for information integration and data analysis. Data Warehouse (DW) systems are used by decision makers for performance measurement and decision support. However, although data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, the OLAP query response time is strongly affected by the volume of data need to be accessed from storage disks. Data partitioning is one of the physical design techniques that may be used to optimize query processing cost in DWs. It is a non redundant optimization technique because it does not replicate data, contrary to redundant techniques like materialized views and indexes. The warehouse partitioning problem is concerned with determining the set of dimension tables to be partitioned and using them to generate the fact table fragments. In this work an enhanced grouping algorithm that avoids the limitations of some existing vertical partitioning algorithms is proposed. Furthermore, a static partitioning algorithm that allows fragmentation at early stages of schema design is presented. The thesis also, investigates the performance of the data warehouse after implementing a combination of Genetic Algorithm (GA) and Simulated Annealing (SA) techniques to horizontally partition the data warehouse star schema. It, then presents the experimentation and implementation results of the proposed algorithm. This research presented different approaches to optimize data fragments allocation cost using a greedy mathematical model and a combination of simulated annealing and genetic algorithm to determine the site by site allocation leading to optimal solutions for fragments distribution. Throughout this thesis, the term fragmentation and partitioning will be used interchangeably


    Selection of a proper set of views to materialize plays an important role indatabase performance. There are many methods of view selection which uses different techniques and frameworks to select an efficient set of views for materialization. In this paper, we present a new efficient, scalable method for view selection under the given storage constraints using a tree mining approach and evolutionary optimization. Tree mining algorithm is designed to determine the exact frequency of (sub)queries in the historical SQL dataset. Query Cost model achieves the objective of maximizing the performance benefits from the final view set which is derived from the frequent view set given by tree mining algorithm. Performance benefit of a query is defined as a function of queryfrequency, query creation cost, and query maintenance cost. The experimental results shows that the proposed method is successful in recommending a solution which is fairly close to optimal solution

    Automatic physical database design : recommending materialized views

    This work discusses physical database design while focusing on the problem of selecting materialized views for improving the performance of a database system. We first address the satisfiability and implication problems for mixed arithmetic constraints. The results are used to support the construction of a search space for view selection problems. We proposed an approach for constructing a search space based on identifying maximum commonalities among queries and on rewriting queries using views. These commonalities are used to define candidate views for materialization from which an optimal or near-optimal set can be chosen as a solution to the view selection problem. Using a search space constructed this way, we address a specific instance of the view selection problem that aims at minimizing the view maintenance cost of multiple materialized views using multi-query optimization techniques. Further, we study this same problem in the context of a commercial database management system in the presence of memory and time restrictions. We also suggest a heuristic approach for maintaining the views while guaranteeing that the restrictions are satisfied. Finally, we consider a dynamic version of the view selection problem where the workload is a sequence of query and update statements. In this case, the views can be created (materialized) and dropped during the execution of the workload. We have implemented our approaches to the dynamic view selection problem and performed extensive experimental testing. Our experiments show that our approaches perform in most cases better than previous ones in terms of effectiveness and efficiency

    Multi-Objective Materialized View Selection in Data-Intensive Flows

    In this thesis we present Forge, a tool for automating multi-objective materialization of intermediate results in data-intensive flows, driven by a set of different quality objectives. We report initial evaluation results, showing the feasibility and efficiency of our approach

    Keyword search in graphs, relational databases and social networks

    Keyword search, a well known mechanism for retrieving relevant information from a set of documents, has recently been studied for extracting information from structured data (e.g., relational databases and XML documents). It offers an alternative way to query languages (e.g., SQL) to explore databases, which is effective for lay users who may not be familiar with the database schema or the query language. This dissertation addresses some issues in keyword search in structured data. Namely, novel solutions to existing problems in keyword search in graphs or relational databases are proposed. In addition, a problem related to graph keyword search, team formation in social networks, is studied. The dissertation consists of four parts. The first part addresses keyword search over a graph which finds a substructure of the graph containing all or some of the query keywords. Current methods for keyword search over graphs may produce answers in which some content nodes (i.e., nodes that contain input keywords) are not very close to each other. In addition, current methods explore both content and non-content nodes while searching for the result and are thus both time and memory consuming for large graphs. To address the above problems, we propose algorithms for finding r-cliques in graphs. An r-clique is a group of content nodes that cover all the input keywords and the distance between each pair of nodes is less than or equal to r. Two approximation algorithms that produce r-cliques with a bounded approximation ratio in polynomial delay are proposed. In the second part, the problem of duplication-free and minimal keyword search in graphs is studied. Current methods for keyword search in graphs may produce duplicate answers that contain the same set of content nodes. In addition, an answer found by these methods may not be minimal in the sense that some of the nodes in the answer may contain query keywords that are all covered by other nodes in the answer. Removing these nodes does not change the coverage of the answer but can make the answer more compact. We define the problem of finding duplication-free and minimal answers, and propose algorithms for finding such answers efficiently. Meaningful keyword search in relational databases is the subject of the third part of this dissertation. Keyword search over relational databases returns a join tree spanning tuples containing the query keywords. As many answers of varying quality can be found, and the user is often only interested in seeing the·top-k answers, how to gauge the relevance of answers to rank them is of paramount importance. This becomes more pertinent for databases with large and complex schemas. We focus on the relevance of join trees as the fundamental means to rank the answers. We devise means to measure relevance of relations and foreign keys in the schema over the information content of the database. The problem of keyword search over graph data is similar to the problem of team formation in social networks. In this setting, keywords represent skills and the nodes in a graph represent the experts that possess skills. Given an expert network, in which a node represents an expert that has a cost for using the expert service and an edge represents the communication cost between the two corresponding experts, we tackle the problem of finding a team of experts that covers a set of required skills and also minimizes the communication cost as well as the personnel cost of the team. We propose two types of approximation algorithms to solve this bi-criteria problem in the fourth part of this dissertation