8 research outputs found

    Automatic Physical Design for XML Databases

    Get PDF
    Database systems employ physical structures such as indexes and materialized views to improve query performance, potentially by orders of magnitude. It is therefore important for a database administrator to choose the appropriate configuration of these physical structures (i.e., the appropriate physical design) for a given database. Deciding on the physical design of a database is not an easy task, and a considerable amount of research exists on automatic physical design tools for relational databases. Recently, XML database systems are increasingly being used for managing highly structured XML data, and support for XML data is being added to commercial relational database systems. This raises the important question of how to choose the appropriate physical design (i.e., the appropriate set of physical structures) for an XML database. Relational automatic physical design tools are not adequate, so new research is needed in this area. In this thesis, we address the problem of automatic physical design for XML databases, which is the process of automatically selecting the best set of physical structures for a given database and a given query workload representing the client application's usage patterns of this data. We focus on recommending two types of physical structures: XML indexes and relational materialized views of XML data. For each of these structures, we study the recommendation process and present a design advisor that automatically recommends a configuration of physical structures given an XML database and a workload of XML queries. The recommendation process is divided into four main phases: (1) enumerating candidate physical structures, (2) generalizing candidate structures in order to generate more candidates that are useful to queries that are not seen in the given workload but similar to the workload queries, (3) estimating the benefit of various candidate structures, and (4) selecting the best set of candidate structures for the given database and workload. We present a design advisor for recommending XML indexes, one for recommending materialized views, and an integrated design advisor that recommends both indexes and materialized views. A key characteristic of our advisors is that they are tightly coupled with the query optimizer of the database system, and rely on the optimizer for enumerating and evaluating physical designs whenever possible. This characteristic makes our techniques suitable for any database system that complies with a set of minimum requirements listed within the thesis. We have implemented the index, materialized view, and integrated advisors in a prototype version of IBM DB2 V9, which supports both relational and XML data, and we experimentally demonstrate the effectiveness of their recommendations using this implementation

    Accelerating data retrieval steps in XML documents

    Get PDF

    Techniques efficaces basées sur des vues matérialisées pour la gestion des données du Web (algorithmes et systèmes)

    Get PDF
    Le langage XML, proposé par le W3C, est aujourd hui utilisé comme un modèle de données pour le stockage et l interrogation de grands volumes de données dans les systèmes de bases de données. En dépit d importants travaux de recherche et le développement de systèmes efficace, le traitement de grands volumes de données XML pose encore des problèmes des performance dus à la complexité et hétérogénéité des données ainsi qu à la complexité des langages courants d interrogation XML. Les vues matérialisées sont employées depuis des décennies dans les bases de données afin de raccourcir les temps de traitement des requêtes. Elles peuvent être considérées les résultats de requêtes pré-calculées, que l on réutilise afin d éviter de recalculer (complètement ou partiellement) une nouvelle requête. Les vues matérialisées ont fait l objet de nombreuses recherches, en particulier dans le contexte des entrepôts des données relationnelles.Cette thèse étudie l applicabilité de techniques de vues matérialisées pour optimiser les performances des systèmes de gestion de données Web, et en particulier XML, dans des environnements distribués. Dans cette thèse, nos apportons trois contributions.D abord, nous considérons le problème de la sélection des meilleures vues à matérialiser dans un espace de stockage donné, afin d améliorer la performance d une charge de travail des requêtes. Nous sommes les premiers à considérer un sous-langage de XQuery enrichi avec la possibilité de sélectionner des noeuds multiples et à de multiples niveaux de granularités. La difficulté dans ce contexte vient de la puissance expressive et des caractéristiques du langage des requêtes et des vues, et de la taille de l espace de recherche de vues que l on pourrait matérialiser.Alors que le problème général a une complexité prohibitive, nous proposons et étudions un algorithme heuristique et démontrer ses performances supérieures par rapport à l état de l art.Deuxièmement, nous considérons la gestion de grands corpus XML dans des réseaux pair à pair, basées sur des tables de hachage distribuées. Nous considérons la plateforme ViP2P dans laquelle des vues XML distribuées sont matérialisées à partir des données publiées dans le réseau, puis exploitées pour répondre efficacement aux requêtes émises par un pair du réseau. Nous y avons apporté d importantes optimisations orientées sur le passage à l échelle, et nous avons caractérisé la performance du système par une série d expériences déployées dans un réseau à grande échelle. Ces expériences dépassent de plusieurs ordres de grandeur les systèmes similaires en termes de volumes de données et de débit de dissémination des données. Cette étude est à ce jour la plus complète concernant une plateforme de gestion de contenus XML déployée entièrement et testée à une échelle réelle.Enfin, nous présentons une nouvelle approche de dissémination de données dans un système d abonnements, en présence de contraintes sur les ressources CPU et réseau disponibles; cette approche est mise en oeuvre dans le cadre de notre plateforme Delta. Le passage à l échelle est obtenu en déchargeant le fournisseur de données de l effort de répondre à une partie des abonnements. Pour cela, nous tirons profit de techniques de réécriture de requêtes à l aide de vues afin de diffuser les données de ces abonnements, à partir d autres abonnements.Notre contribution principale est un nouvel algorithme qui organise les vues dans un réseau de dissémination d information multi-niveaux ; ce réseau est calculé à l aide d outils techniques de programmation linéaire afin de passer à l échelle pour de grands nombres de vues, respecter les contraintes de capacité du système, et minimiser les délais de propagation des information. L efficacité et la performance de notre algorithme est confirmée par notre évaluation expérimentale, qui inclut l étude d un déploiement réel dans un réseau WAN.XML was recommended by W3C in 1998 as a markup language to be used by device- and system-independent methods of representing information. XML is nowadays used as a data model for storing and querying large volumes of data in database systems. In spite of significant research and systems development, many performance problems are raised by processing very large amounts of XML data. Materialized views have long been used in databases to speed up queries. Materialized views can be seen as precomputed query results that can be re-used to evaluate (part of) another query, and have been a topic of intensive research, in particular in the context of relational data warehousing. This thesis investigates the applicability of materialized views techniques to optimize the performance of Web data management tools, in particular in distributed settings, considering XML data and queries. We make three contributions.We first consider the problem of choosing the best views to materialize within a given space budget in order to improve the performance of a query workload. Our work is the first to address the view selection problem for a rich subset of XQuery. The challenges we face stem from the expressive power and features of both the query and view languages and from the size of the search space of candidate views to materialize. While the general problem has prohibitive complexity, we propose and study a heuristic algorithm and demonstrate its superior performance compared to the state of the art.Second, we consider the management of large XML corpora in peer-to-peer networks, based on distributed hash tables (or DHTs, in short). We consider a platform leveraging distributed materialized XML views, defined by arbitrary XML queries, filled in with data published anywhere in the network, and exploited to efficiently answer queries issued by any network peer. This thesis has contributed important scalability oriented optimizations, as well as a comprehensive set of experiments deployed in a country-wide WAN. These experiments outgrow by orders of magnitude similar competitor systems in terms of data volumes and data dissemination throughput. Thus, they are the most advanced in understanding the performance behavior of DHT-based XML content management in real settings.Finally, we present a novel approach for scalable content-based publish/subscribe (pub/sub, in short) in the presence of constraints on the available computational resources of data publishers. We achieve scalability by off-loading subscriptions from the publisher, and leveraging view-based query rewriting to feed these subscriptions from the data accumulated in others. Our main contribution is a novel algorithm for organizing subscriptions in a multi-level dissemination network in order to serve large numbers of subscriptions, respect capacity constraints, and minimize latency. The efficiency and effectiveness of our algorithm are confirmed through extensive experiments and a large deployment in a WAN.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    Hybrid Database for XML Resource Management

    Get PDF
    Although XML has been used in software applications for a considerable amount of time, managing XML files is not a common skill in the realm of backend software design. This is primarily because JSON has become a more prevalent file format and is supported by numerous SQL and NoSQL databases. In this thesis, we will delve into the fundamentals and implementation of a web application that utilizes a hybrid database, with the goal of determining whether it is suitable for managing XML resources. Upon closer examination of the existing architecture, the client discovered a problem with upgrading their project. Further investigation revealed that the current approach of storing XML files in a single folder had serious flaws that could cause issues. As a result, a decision was made to revamp the entire web application, with hybrid databases being chosen as the preferred solution due to the application's XML storage concept. It is worth noting that there exists a type of database specifically designed for XML resources, known as native XML databases. However, the development team thoroughly reviewed all the requirements provided by the product owner, Niko Siltala, and assessed the compatibility of both native XML databases and hybrid databases for the new application. Based on our analysis, it was concluded that the hybrid database is the most suitable option for the project. The changes were successfully designed and implemented, and the development team determined that hybrid databases are a viable option for managing a significant number of XML file dependencies. There were no significant obstacles encountered that would hinder the use of this type of database. The advantages of using hybrid databases were observed, including streamlined XML file storage, the ability to mix XPATH/XQUERY in SQL queries, and simplified codebases

    AIRSPACE PLANNING FOR OPTIMAL CAPACITY, EFFICIENCY, AND SAFETY USING ANALYTICS

    Get PDF
    Air Navigation Service Providers (ANSP) worldwide have been making a considerable effort for the development of a better method for planning optimal airspace capacity, efficiency, and safety. These goals require separation and sequencing of aircraft before they depart. Prior approaches have tactically achieved these goals to some extent. However, dealing with increasingly congested airspace and new environmental factors with high levels of uncertainty still remains the challenge when deterministic approach is used. Hence due to the nature of uncertainties, we take a stochastic approach and propose a suite of analytics models for (1) Flight Time Prediction, (2) Aircraft Trajectory Clustering, (3) Aircraft Trajectory Prediction, and (4) Aircraft Conflict Detection and Resolution long before aircraft depart. The suite of data-driven models runs on a scalable Data Management System that continuously processes streaming massive flight data to achieve the strategic airspace planning for optimal capacity, efficiency, and safety. (1) Flight Time Prediction. Unlike other systems that collect and use features only for the arrival airport to build a data-driven model for predicting flight times, we use a richer set of features along the potential route, such as weather parameters and air traffic data in addition to those that are particular to the arrival airport. Our feature engineering process generates an extensive set of multidimensional time series data which goes through Time Series Clustering with Dynamic Time Warping (DTW) to generate a single set of representative features at each time instance. The features are fed into various regression and deep learning models and the best performing models with most accurate ETA predictions are selected. Evaluations on extensive set of real trajectory, weather, and airport data in Europe verify our prediction system generates more accurate ETAs with far less variance than those of European ANSP, EUROCONTROL’s. This translates to more accurately predicted flight arrival times, enabling airlines to make more cost-effective ground resource allocation and ANSPs to make more efficient flight scheduling. (2) Aircraft Trajectory Clustering. The novel divide-cluster-merge; DICLERGE system clusters aircraft trajectories by dividing them into the three standard major flight phases: climb, en-route, and descent. Trajectory segments in each phase are clustered in isolation, then merged together. Our unique approach also discovers a representative trajectory, the model for the entire trajectory set. (3) Aircraft Trajectory Prediction. Our approach considers airspace as a 3D grid network, where each grid point is a location of a weather observation. We hypothetically build cubes around these grid points, so the entire airspace can be considered as a set of cubes. Each cube is defined by its centroid, the original grid point, and associated weather parameters that remain homogeneous within the cube during a period of time. Then, we align raw trajectories to a set of cube centroids which are basically fixed 3D positions independent of trajectory data. This creates a new form of trajectories which are 4D joint cubes, where each cube is a segment that is associated with not only spatio-temporal attributes but also with weather parameters. Next, we exploit machine learning techniques to train inference models from historical data and apply a stochastic model, a Hidden Markov Model (HMM), to predict trajectories taking environmental uncertainties into account. During the process, we apply time series clustering to generate input observations from an excessive set of weather parameters to feed into the Viterbi algorithm. The experiments use a real trajectory dataset with pertaining weather observations and demonstrate the effectiveness of our approach to the trajectory prediction process for Air Traffic Management. (4) Aircraft Conflict Detection. We propose a novel data-driven system to address a long-range aircraft conflict detection and resolution (CDR) problem. Given a set of predicted trajectories, the system declares a conflict when a protected zone of an aircraft on its trajectory is infringed upon by another aircraft. The system resolves the conflict by prescribing an alternative solution that is optimized by perturbing at least one of the trajectories involved in the conflict. To achieve this, the system learns from descriptive patterns of historical trajectories and pertinent weather observations and builds a Hidden Markov Model (HMM). Using a variant of the Viterbi algorithm, the system avoids the airspace volume in which the conflict is detected and generates a new optimal trajectory that is conflict-free. The key concept upon which the system is built is the assumption that the airspace is nothing more than a horizontally and vertically concatenated set of spatio-temporal data cubes where each cube is considered as an atomic unit. We evaluate the system using real trajectory datasets with pertinent weather observations from two continents and demonstrate its effectiveness for strategic CDR. Overall, in this thesis, we develop a suite of analytics models and algorithms to accurately identify current patterns in the massive flight data and use these patterns to predict future behaviors in the airspace. Upon prediction of a non-ideal outcome, we prescribe a solution to plan airspace for optimal capacity, efficiency, and safety
    corecore