500 research outputs found

    Comparative Analysis of Decision Tree Algorithms for Data Warehouse Fragmentation

    Get PDF
    One of the main problems faced by Data Warehouse designers is fragmentation.Several studies have proposed data mining-based horizontal fragmentation methods.However, not exists a horizontal fragmentation technique that uses a decision tree. This paper presents the analysis of different decision tree algorithms to select the best one to implement the fragmentation method. Such analysis was performed under version 3.9.4 of Weka, considering four evaluation metrics (Precision, ROC Area, Recall and F-measure) for different selected data sets using the Star Schema Benchmark. The results showed that the two best algorithms were J48 and Random Forest in most cases; nevertheless, J48 was selected because it is more efficient in building the model.One of the main problems faced by Data Warehouse designers is fragmentation.Several studies have proposed data mining-based horizontal fragmentation methods.However, not exists a horizontal fragmentation technique that uses a decision tree. This paper presents the analysis of different decision tree algorithms to select the best one to implement the fragmentation method. Such analysis was performed under version 3.9.4 of Weka, considering four evaluation metrics (Precision, ROC Area, Recall and F-measure) for different selected data sets using the Star Schema Benchmark. The results showed that the two best algorithms were J48 and Random Forest in most cases; nevertheless, J48 was selected because it is more efficient in building the model

    Database Optimization Aspects for Information Retrieval

    Get PDF
    There is a growing need for systems that can process queries, combining both structured data and text. One way to provide such functionality is to integrate information retrieval (IR) techniques in a database management system (DBMS). However, both IR and database research have been separate research fields for decades, resulting in different - even conflicting - approaches to data management. Each DBMS has a component called a "query optimizer", which plays a crucial role in the efficiency and flexibility of the system. So, for successful integration the IR techniques and data structures, as well as the DBMS query optimizer, should be adapted to enable mutual cooperation. The author concentrates on top-N queries - a common class of IR queries. An IR top-N query asks for the N best documents given a set of keywords. The author proposes processing the data in batches as a compromise between IR and DBMS query processing. Experiments with this technique show that porting IR optimization techniques is (still) not a promising option due to the additional administrative overhead. Two new mathematical models are introduced to eliminate this overhead: a model that predicts selectivity, which is a crucial factor in the execution costs, and a model that predicts the quality of the top-N

    The use of alternative data models in data warehousing environments

    Get PDF
    Data Warehouses are increasing their data volume at an accelerated rate; high disk space consumption; slow query response time and complex database administration are common problems in these environments. The lack of a proper data model and an adequate architecture specifically targeted towards these environments are the root causes of these problems. Inefficient management of stored data includes duplicate values at column level and poor management of data sparsity which derives from a low data density, and affects the final size of Data Warehouses. It has been demonstrated that the Relational Model and Relational technology are not the best techniques for managing duplicates and data sparsity. The novelty of this research is to compare some data models considering their data density and their data sparsity management to optimise Data Warehouse environments. The Binary-Relational, the Associative/Triple Store and the Transrelational models have been investigated and based on the research results a novel Alternative Data Warehouse Reference architectural configuration has been defined. For the Transrelational model, no database implementation existed. Therefore it was necessary to develop an instantiation of it’s storage mechanism, and as far as could be determined this is the first public domain instantiation available of the storage mechanism for the Transrelational model

    Cache Conscious Data Layouting for In-Memory Databases

    Get PDF
    Many applications with manually implemented data management exhibit a data storage pattern in which semantically related data items are stored closer in memory than unrelated data items. The strong sematic relationship between these data items commonly induces contemporary accesses to them. This is called the principle of data locality and has been recognized by hardware vendors. It is commonly exploited to improve the performance of hardware. General Purpose Database Management Systems (DBMSs), whose main goal is to simplify optimal data storage and processing, generally fall short of this claim because the usage pattern of the stored data cannot be anticipated when designing the system. The current interest in column oriented databases indicates that one strategy does not fit all applications. A DBMS that automatically adapts it’s storage strategy to the workload of the database promises a significant performance increase by maximizing the benefit of hardware optimizations that are based on the principle of data locality. This thesis gives an overview of optimizations that are based on the principle of data locality and the effect they have on the data access performance of applications. Based on the findings, a model is introduced that allows an estimation of the costs of data accesses based on the arrangement of the data in the main memory. This model is evaluated through a series of experiments and incorporated into an automatic layouting component for a DBMS. This layouting component allows the calculation of an analytically optimal storage layout. The performance benefits brought by this component are evaluated in an application benchmark

    Maintenance-cost view-selection in large data warehouse systems: algorithms, implementations and evaluations.

    Get PDF
    Choi Chi Hon.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 120-126).Abstracts in English and Chinese.Abstract --- p.iAbstract (Chinese) --- p.iiAcknowledgement --- p.iiiContents --- p.ivList of Figures --- p.viiiList of Tables --- p.xChapter 1 --- Introduction --- p.1Chapter 1.1 --- Maintenance Cost View Selection Problem --- p.2Chapter 1.2 --- Previous Research Works --- p.3Chapter 1.3 --- Major Contributions --- p.4Chapter 1.4 --- Thesis Organization --- p.6Chapter 2 --- Literature Review --- p.7Chapter 2.1 --- Data Warehouse and OLAP Systems --- p.8Chapter 2.1.1 --- What Is Data Warehouse? --- p.8Chapter 2.1.2 --- What Is OLAP? --- p.10Chapter 2.1.3 --- Difference Between Operational Database Systems and OLAP --- p.10Chapter 2.1.4 --- Data Warehouse Architecture --- p.12Chapter 2.1.5 --- Multidimensional Data Model --- p.13Chapter 2.1.6 --- Star Schema and Snowflake Schema --- p.15Chapter 2.1.7 --- Data Cube --- p.17Chapter 2.1.8 --- ROLAP and MOLAP --- p.19Chapter 2.1.9 --- Query Optimization --- p.20Chapter 2.2 --- Materialized View --- p.22Chapter 2.2.1 --- What Is A Materialized View --- p.23Chapter 2.2.2 --- The Role of Materialized View in OLAP --- p.23Chapter 2.2.3 --- The Challenges in Exploiting Materialized View --- p.24Chapter 2.2.4 --- What Is View Maintenance --- p.25Chapter 2.3 --- View Selection --- p.27Chapter 2.3.1 --- Selection Strategy --- p.27Chapter 2.4 --- Summary --- p.32Chapter 3 --- Problem Definition --- p.33Chapter 3.1 --- View Selection Under Constraint --- p.33Chapter 3.2 --- The Lattice Framework for Maintenance Cost View Selection Prob- lem --- p.35Chapter 3.3 --- The Difficulties of Maintenance Cost View Selection Problem --- p.39Chapter 3.4 --- Summary --- p.41Chapter 4 --- What Difference Heuristics Make --- p.43Chapter 4.1 --- Motivation --- p.44Chapter 4.2 --- Example --- p.46Chapter 4.3 --- Existing Algorithms --- p.49Chapter 4.3.1 --- A*-Heuristic --- p.51Chapter 4.3.2 --- Inverted-Tree Greedy --- p.52Chapter 4.3.3 --- Two-Phase Greedy --- p.54Chapter 4.3.4 --- Integrated Greedy --- p.57Chapter 4.4 --- A Performance Study --- p.60Chapter 4.5 --- Summary --- p.68Chapter 5 --- Materialized View Selection as Constrained Evolutionary Opti- mization --- p.71Chapter 5.1 --- Motivation --- p.72Chapter 5.2 --- Evolutionary Algorithms --- p.73Chapter 5.2.1 --- Constraint Handling: Penalty v.s. Stochastic Ranking --- p.74Chapter 5.2.2 --- The New Stochastic Ranking Evolutionary Algorithm --- p.78Chapter 5.3 --- Experimental Studies --- p.81Chapter 5.3.1 --- Experimental Setup --- p.82Chapter 5.3.2 --- Experimental Results --- p.82Chapter 5.4 --- Summary --- p.89Chapter 6 --- Dynamic Materialized View Management Based On Predicates --- p.90Chapter 6.1 --- Motivation --- p.91Chapter 6.2 --- Examples --- p.93Chapter 6.3 --- Related Work: Static Prepartitioning-Based Materialized View Management --- p.96Chapter 6.4 --- A New Dynamic Predicate-based Partitioning Approach --- p.99Chapter 6.4.1 --- System Overview --- p.102Chapter 6.4.2 --- Partition Advisor --- p.103Chapter 6.4.3 --- View Manager --- p.104Chapter 6.5 --- A Performance Study --- p.108Chapter 6.5.1 --- Performance Metrics --- p.110Chapter 6.5.2 --- Feasibility Studies --- p.110Chapter 6.5.3 --- Query Locality --- p.112Chapter 6.5.4 --- The Effectiveness of Disk Size --- p.115Chapter 6.5.5 --- Scalability --- p.115Chapter 6.6 --- Summary --- p.116Chapter 7 --- Conclusions and Future Work --- p.118Bibliography --- p.12

    Effizienz in Cluster-Datenbanksystemen - Dynamische und Arbeitslastberücksichtigende Skalierung und Allokation

    Get PDF
    Database systems have been vital in all forms of data processing for a long time. In recent years, the amount of processed data has been growing dramatically, even in small projects. Nevertheless, database management systems tend to be static in terms of size and performance which makes scaling a difficult and expensive task. Because of performance and especially cost advantages more and more installed systems have a shared nothing cluster architecture. Due to the massive parallelism of the hardware programming paradigms from high performance computing are translated into data processing. Database research struggles to keep up with this trend. A key feature of traditional database systems is to provide transparent access to the stored data. This introduces data dependencies and increases system complexity and inter process communication. Therefore, many developers are exchanging this feature for a better scalability. However, explicitly managing the data distribution and data flow requires a deep understanding of the distributed system and reduces the possibilities for automatic and autonomic optimization. In this thesis we present an approach for database system scaling and allocation that features good scalability although it keeps the data distribution transparent. The first part of this thesis analyzes the challenges and opportunities for self-scaling database management systems in cluster environments. Scalability is a major concern of Internet based applications. Access peaks that overload the application are a financial risk. Therefore, systems are usually configured to be able to process peaks at any given moment. As a result, server systems often have a very low utilization. In distributed systems the efficiency can be increased by adapting the number of nodes to the current workload. We propose a processing model and an architecture that allows efficient self-scaling of cluster database systems. In the second part we consider different allocation approaches. To increase the efficiency we present a workload-aware, query-centric model. The approach is formalized; optimal and heuristic algorithms are presented. The algorithms optimize the data distribution for local query execution and balance the workload according to the query history. We present different query classification schemes for different forms of partitioning. The approach is evaluated for OLTP and OLAP style workloads. It is shown that variants of the approach scale well for both fields of application. The third part of the thesis considers benchmarks for large, adaptive systems. First, we present a data generator for cloud-sized applications. Due to its architecture the data generator can easily be extended and configured. A key feature is the high degree of parallelism that makes linear speedup for arbitrary numbers of nodes possible. To simulate systems with user interaction, we have analyzed a productive online e-learning management system. Based on our findings, we present a model for workload generation that considers the temporal dependency of user interaction.Datenbanksysteme sind seit langem die Grundlage für alle Arten von Informationsverarbeitung. In den letzten Jahren ist das Datenaufkommen selbst in kleinen Projekten dramatisch angestiegen. Dennoch sind viele Datenbanksysteme statisch in Bezug auf ihre Kapazität und Verarbeitungsgeschwindigkeit was die Skalierung aufwendig und teuer macht. Aufgrund der guten Geschwindigkeit und vor allem aus Kostengründen haben immer mehr Systeme eine Shared-Nothing-Architektur, bestehen also aus unabhängigen, lose gekoppelten Rechnerknoten. Da dieses Konstruktionsprinzip einen sehr hohen Grad an Parallelität aufweist, werden zunehmend Programmierparadigmen aus dem klassischen Hochleistungsrechen für die Informationsverarbeitung eingesetzt. Dieser Trend stellt die Datenbankforschung vor große Herausforderungen. Eine der grundlegenden Eigenschaften traditioneller Datenbanksysteme ist der transparente Zugriff zu den gespeicherten Daten, der es dem Nutzer erlaubt unabhängig von der internen Organisation auf die Daten zuzugreifen. Die resultierende Unabhängigkeit führt zu Abhängigkeiten in den Daten und erhöht die Komplexität der Systeme und der Kommunikation zwischen einzelnen Prozessen. Daher wird Transparenz von vielen Entwicklern für eine bessere Skalierbarkeit geopfert. Diese Entscheidung führt dazu, dass der die Datenorganisation und der Datenfluss explizit behandelt werden muss, was die Möglichkeiten für eine automatische und autonome Optimierung des Systems einschränkt. Der in dieser Arbeit vorgestellte Ansatz zur Skalierung und Allokation erhält den transparenten Zugriff und zeichnet sich dabei durch seine vollständige Automatisierbarkeit und sehr gute Skalierbarkeit aus. Im ersten Teil dieser Dissertation werden die Herausforderungen und Chancen für selbst-skalierende Datenbankmanagementsysteme behandelt, die in auf Computerclustern betrieben werden. Gute Skalierbarkeit ist eine notwendige Eigenschaft für Anwendungen, die über das Internet zugreifbar sind. Lastspitzen im Zugriff, die die Anwendung überladen stellen ein finanzielles Risiko dar. Deshalb werden Systeme so konfiguriert, dass sie eventuelle Lastspitzen zu jedem Zeitpunkt verarbeiten können. Das führt meist zu einer im Schnitt sehr geringen Auslastung der unterliegenden Systeme. Eine Möglichkeit dieser Ineffizienz entgegen zu steuern ist es die Anzahl der verwendeten Rechnerknoten an die vorliegende Last anzupassen. In dieser Dissertation werden ein Modell und eine Architektur für die Anfrageverarbeitung vorgestellt, mit denen es möglich ist Datenbanksysteme auf Clusterrechnern einfach und effizient zu skalieren. Im zweiten Teil der Arbeit werden verschieden Möglichkeiten für die Datenverteilung behandelt. Um die Effizienz zu steigern wird ein Modell verwendet, das die Lastverteilung im Anfragestrom berücksichtigt. Der Ansatz ist formalisiert und optimale und heuristische Lösungen werden präsentiert. Die vorgestellten Algorithmen optimieren die Datenverteilung für eine lokale Ausführung aller Anfragen und balancieren die Last auf den Rechnerknoten. Es werden unterschiedliche Arten der Anfrageklassifizierung vorgestellt, die zu verschiedenen Arten von Partitionierung führen. Der Ansatz wird sowohl für Onlinetransaktionsverarbeitung, als auch Onlinedatenanalyse evaluiert. Die Evaluierung zeigt, dass der Ansatz für beide Felder sehr gut skaliert. Im letzten Teil der Arbeit werden verschiedene Techniken für die Leistungsmessung von großen, adaptiven Systemen präsentiert. Zunächst wird ein Datengenerierungsansatz gezeigt, der es ermöglicht sehr große Datenmengen völlig parallel zu erzeugen. Um die Benutzerinteraktion von Onlinesystemen zu simulieren wurde ein produktives E-learningsystem analysiert. Anhand der Analyse wurde ein Modell für die Generierung von Arbeitslasten erstellt, das die zeitlichen Abhängigkeiten von Benutzerinteraktion berücksichtigt

    Elastic Dataflow Processing on the Cloud

    Get PDF
    Τα νεφη εχουν μετατραπει σε μια ελκυστικη πλατφορμα για την πολυπλοκη επεξεργασια δεδομενων μεγαλης κλιμακας, ειδικα εξαιτιας της εννοιας της ελαστικοτητας, η οποια και τα χαρακτηριζει: οι υπολογιστικοι ποροι μπορουν να εκμισθωθουν δυναμικα και να χρησιμοποιουνται για οσο χρονο ειναι απαραιτητο. Αυτο δινει την δυνατοτητα να δημιουργηθει μια εικονικη υποδομη η οποια μπορει να αλλαζει δυναμικα στο χρονο. Οι συγχρονες εφαρμογες απαιτουν την εκτελεση πολυπλοκων ερωτηματων σε Μεγαλα Δεδομενα για την εξορυξη γνωσης και την υποστηριξη επιχειρησιακων αποφασεων. Τα πολυπλοκα αυτα ερωτηματα, εκφραζονται σε γλωσσες υψηλου επιπεδου και τυπικα μεταφραζονται σε ροες επεξεργασιας δεδομενων, η απλα ροες δεδομενων. Ενα λογικο ερωτημα που τιθεται ειναι κατα ποσον η ελαστικοτητα επηρεαζει την εκτελεση των ροων δεδομενων και με πιο τροπο. Ειναι λογικο οτι η εκτελεση να ειναι πιθανον γρηγοροτερη αν χρησιμοποιηθουν περισ- σοτεροι υπολογιστικοι ποροι, αλλα το κοστος θα ειναι υψηλοτερο. Αυτο δημιουργει την εννοια της οικο-ελαστικοτητας, ενος επιπλεον τυπου ελαστικοτητας ο οποιος προερχεται απο την οικονο- μικη θεωρια, και συλλαμβανει τις εναλλακτικες μεταξυ του χρονου εκτελεσης και του χρηματικου κοστους οπως προκυπτει απο την χρηση των πορων. Στα πλαισια αυτης της διδακτορικης διατριβης, προσεγγιζουμε την ελαστικοτητα με ενα ενοποιημενο μοντελο που περιλαμβανει και τις δυο ειδων ελαστικοτητες που υπαρχουν στα υπολογιστικα νεφη. Αυτη η ενοποιημενη προσεγγιση της ελαστικοτητας ειναι πολυ σημαντικη στην σχεδιαση συστηματων που ρυθμιζονται αυτοματα (auto-tuned) σε περιβαλλοντα νεφους. Αρχικα δειχνουμε οτι η οικο-ελαστικοτητα υπαρχει σε αρκετους τυπους υπολογισμου που εμφανιζονται συχνα στην πραξη και οτι μπορει να βρεθει χρησιμοποιωντας εναν απλο, αλλα ταυτοχρονα αποδοτικο και ε- πεκτασιμο αλγοριθμο. Επειτα, παρουσιαζουμε δυο εφαρμογες που χρησιμοποιουν αλγοριθμους οι οποιοι χρησιμοποιουν το ενοποιημενο μοντελο ελαστικοτητας που προτεινουμε για να μπορουν να προσαρμοζουν δυναμικα το συστημα στα ερωτηματα της εισοδου: 1) την ελαστικη επεξεργασια αναλυτικων ερωτηματων τα οποια εχουν πλανα εκτελεσης με μορφη δεντρων με σκοπο την μεγι- στοποιηση του κερδους και 2) την αυτοματη διαχειριση χρησιμων ευρετηριων λαμβανοντας υποψη το χρηματικο κοστος των υπολογιστικων και των αποθηκευτικων πορων. Τελος, παρουσιαζουμε το EXAREME, ενα συστημα για την ελαστικη επεξεργασια μεγαλου ογκου δεδομενων στο νεφος το οποιο εχει χρησιμοποιηθει και επεκταθει σε αυτην την δουλεια. Το συστημα προσφερει δηλωτικες γλωσσες που βασιζονται στην SQL επεκταμενη με συναρτησεις οι οποιες μπορει να οριστουν απο χρηστες (User-Defined Functions, UDFs). Επιπλεον, το συντακτικο της γλωσσας εχει επεκταθει με στοιχεια παραλληλισμου. Το EXAREME εχει σχεδιαστει για να εκμεταλλευεται τις ελαστικοτη- τες που προσφερουν τα νεφη, δεσμευοντας και αποδεσμευοντας υπολογιστικους πορους δυναμικα με σκοπο την προσαρμογη στα ερωτηματα.Clouds have become an attractive platform for the large-scale processing of modern applications on Big Data, especially due to the concept of elasticity, which characterizes them: resources can be leased on demand and used for as much time as needed, offering the ability to create virtual infrastructures that change dynamically over time. Such applications often require processing of complex queries that are expressed in a high-level language and are typically transformed into data processing flows (dataflows). A logical question that arises is whether elasticity affects dataflow execution and in which way. It seems reasonable that the execution is faster when more resources are used, however the monetary cost is higher. This gives rise to the concept eco-elasticity, an additional kind of elasticity that comes from economics, and captures the trade-offs between the response time of the system and the amount of money we pay for it as influenced by the use of different amounts of resources. In this thesis, we approach the elasticity of clouds in a unified way that combines both the traditional notion and eco-elasticity. This unified elasticity concept is essential for the development of auto-tuned systems in cloud environments. First, we demonstrate that eco-elasticity exists in several common tasks that appear in practice and that can be discovered using a simple, yet highly scalable and efficient algorithm. Next, we present two cases of auto-tuned algorithms that use the unified model of elasticity in order to adapt to the query workload: 1) processing analytical queries in the form of tree execution plans in order to maximize profit and 2) automated index management taking into account compute and storage re- sources. Finally, we describe EXAREME, a system for elastic data processing on the cloud that has been used and extended in this work. The system offers declarative languages that are based on SQL with user-defined functions (UDFs) extended with parallelism primi- tives. EXAREME exploits both elasticities of clouds by dynamically allocating and deallocating compute resources in order to adapt to the query workload

    Metadata-driven data integration

    Get PDF
    Cotutela: Universitat Politècnica de Catalunya i Université Libre de Bruxelles, IT4BI-DC programme for the joint Ph.D. degree in computer science.Data has an undoubtable impact on society. Storing and processing large amounts of available data is currently one of the key success factors for an organization. Nonetheless, we are recently witnessing a change represented by huge and heterogeneous amounts of data. Indeed, 90% of the data in the world has been generated in the last two years. Thus, in order to carry on these data exploitation tasks, organizations must first perform data integration combining data from multiple sources to yield a unified view over them. Yet, the integration of massive and heterogeneous amounts of data requires revisiting the traditional integration assumptions to cope with the new requirements posed by such data-intensive settings. This PhD thesis aims to provide a novel framework for data integration in the context of data-intensive ecosystems, which entails dealing with vast amounts of heterogeneous data, from multiple sources and in their original format. To this end, we advocate for an integration process consisting of sequential activities governed by a semantic layer, implemented via a shared repository of metadata. From an stewardship perspective, this activities are the deployment of a data integration architecture, followed by the population of such shared metadata. From a data consumption perspective, the activities are virtual and materialized data integration, the former an exploratory task and the latter a consolidation one. Following the proposed framework, we focus on providing contributions to each of the four activities. We begin proposing a software reference architecture for semantic-aware data-intensive systems. Such architecture serves as a blueprint to deploy a stack of systems, its core being the metadata repository. Next, we propose a graph-based metadata model as formalism for metadata management. We focus on supporting schema and data source evolution, a predominant factor on the heterogeneous sources at hand. For virtual integration, we propose query rewriting algorithms that rely on the previously proposed metadata model. We additionally consider semantic heterogeneities in the data sources, which the proposed algorithms are capable of automatically resolving. Finally, the thesis focuses on the materialized integration activity, and to this end, proposes a method to select intermediate results to materialize in data-intensive flows. Overall, the results of this thesis serve as contribution to the field of data integration in contemporary data-intensive ecosystems.Les dades tenen un impacte indubtable en la societat. La capacitat d’emmagatzemar i processar grans quantitats de dades disponibles és avui en dia un dels factors claus per l’èxit d’una organització. No obstant, avui en dia estem presenciant un canvi representat per grans volums de dades heterogenis. En efecte, el 90% de les dades mundials han sigut generades en els últims dos anys. Per tal de dur a terme aquestes tasques d’explotació de dades, les organitzacions primer han de realitzar una integració de les dades, combinantles a partir de diferents fonts amb l’objectiu de tenir-ne una vista unificada d’elles. Per això, aquest fet requereix reconsiderar les assumpcions tradicionals en integració amb l’objectiu de lidiar amb els requisits imposats per aquests sistemes de tractament massiu de dades. Aquesta tesi doctoral té com a objectiu proporcional un nou marc de treball per a la integració de dades en el context de sistemes de tractament massiu de dades, el qual implica lidiar amb una gran quantitat de dades heterogènies, provinents de múltiples fonts i en el seu format original. Per això, proposem un procés d’integració compost d’una seqüència d’activitats governades per una capa semàntica, la qual és implementada a partir d’un repositori de metadades compartides. Des d’una perspectiva d’administració, aquestes activitats són el desplegament d’una arquitectura d’integració de dades, seguit per la inserció d’aquestes metadades compartides. Des d’una perspectiva de consum de dades, les activitats són la integració virtual i materialització de les dades, la primera sent una tasca exploratòria i la segona una de consolidació. Seguint el marc de treball proposat, ens centrem en proporcionar contribucions a cada una de les quatre activitats. La tesi inicia proposant una arquitectura de referència de software per a sistemes de tractament massiu de dades amb coneixement semàntic. Aquesta arquitectura serveix com a planell per a desplegar un conjunt de sistemes, sent el repositori de metadades al seu nucli. Posteriorment, proposem un model basat en grafs per a la gestió de metadades. Concretament, ens centrem en donar suport a l’evolució d’esquemes i fonts de dades, un dels factors predominants en les fonts de dades heterogènies considerades. Per a l’integració virtual, proposem algorismes de rescriptura de consultes que usen el model de metadades previament proposat. Com a afegitó, considerem heterogeneïtat semàntica en les fonts de dades, les quals els algorismes de rescriptura poden resoldre automàticament. Finalment, la tesi es centra en l’activitat d’integració materialitzada. Per això proposa un mètode per a seleccionar els resultats intermedis a materialitzar un fluxes de tractament intensiu de dades. En general, els resultats d’aquesta tesi serveixen com a contribució al camp d’integració de dades en els ecosistemes de tractament massiu de dades contemporanisLes données ont un impact indéniable sur la société. Le stockage et le traitement de grandes quantités de données disponibles constituent actuellement l’un des facteurs clés de succès d’une entreprise. Néanmoins, nous assistons récemment à un changement représenté par des quantités de données massives et hétérogènes. En effet, 90% des données dans le monde ont été générées au cours des deux dernières années. Ainsi, pour mener à bien ces tâches d’exploitation des données, les organisations doivent d’abord réaliser une intégration des données en combinant des données provenant de sources multiples pour obtenir une vue unifiée de ces dernières. Cependant, l’intégration de quantités de données massives et hétérogènes nécessite de revoir les hypothèses d’intégration traditionnelles afin de faire face aux nouvelles exigences posées par les systèmes de gestion de données massives. Cette thèse de doctorat a pour objectif de fournir un nouveau cadre pour l’intégration de données dans le contexte d’écosystèmes à forte intensité de données, ce qui implique de traiter de grandes quantités de données hétérogènes, provenant de sources multiples et dans leur format d’origine. À cette fin, nous préconisons un processus d’intégration constitué d’activités séquentielles régies par une couche sémantique, mise en oeuvre via un dépôt partagé de métadonnées. Du point de vue de la gestion, ces activités consistent à déployer une architecture d’intégration de données, suivies de la population de métadonnées partagées. Du point de vue de la consommation de données, les activités sont l’intégration de données virtuelle et matérialisée, la première étant une tâche exploratoire et la seconde, une tâche de consolidation. Conformément au cadre proposé, nous nous attachons à fournir des contributions à chacune des quatre activités. Nous commençons par proposer une architecture logicielle de référence pour les systèmes de gestion de données massives et à connaissance sémantique. Une telle architecture consiste en un schéma directeur pour le déploiement d’une pile de systèmes, le dépôt de métadonnées étant son composant principal. Ensuite, nous proposons un modèle de métadonnées basé sur des graphes comme formalisme pour la gestion des métadonnées. Nous mettons l’accent sur la prise en charge de l’évolution des schémas et des sources de données, facteur prédominant des sources hétérogènes sous-jacentes. Pour l’intégration virtuelle, nous proposons des algorithmes de réécriture de requêtes qui s’appuient sur le modèle de métadonnées proposé précédemment. Nous considérons en outre les hétérogénéités sémantiques dans les sources de données, que les algorithmes proposés sont capables de résoudre automatiquement. Enfin, la thèse se concentre sur l’activité d’intégration matérialisée et propose à cette fin une méthode de sélection de résultats intermédiaires à matérialiser dans des flux des données massives. Dans l’ensemble, les résultats de cette thèse constituent une contribution au domaine de l’intégration des données dans les écosystèmes contemporains de gestion de données massivesPostprint (published version

    IDEAS-1997-2021-Final-Programs

    Get PDF
    This document records the final program for each of the 26 meetings of the International Database and Engineering Application Symposium from 1997 through 2021. These meetings were organized in various locations on three continents. Most of the papers published during these years are in the digital libraries of IEEE(1997-2007) or ACM(2008-2021)

    From Network to Web dimension in supply chain management

    Get PDF
    Cette thèse soutient que la dimension réseau, étant actuellement la portée du domaine de la gestion de chaîne logistique, contraint l’avancement de ce domaine et restreint des innovations conceptuelles et fondamentales capables d’adresser les grands défis économiques, environnementaux et sociaux. Les concepts de chaîne et de réseau ne reflètent pas la complexité des flux physiques, informationnels et financiers générés par les interactions qui ont lieu dans des réseaux interconnectés. Ces concepts n’offrent pas les fondations théoriques pour supporter des interventions allant au-delà d’un seul réseau et laissent échapper des opportunités nécessitant une vision multi-réseau. Ainsi, la dimension “web”, celle des réseaux de réseaux, est proposée comme une extension de la dimension réseau. Cette extension peut être vue comme l’étape naturelle suivante dans la progression qui a commencé par le niveau de gestion des opérations internes, est passée au niveau de la chaîne logistique et se trouve actuellement au niveau du réseau logistique. Après l’investigation théorique des raisons et de la façon d’intégrer la dimension web dans le domaine de la gestion de la chaîne logistique, la thèse étudie des implications importantes de cette intégration sur la collaboration inter-organisationnelle et le processus de prise de décision dans des environnements de webs logistiques. Elle démontre, en exploitant l’exemple des réseaux interconnectés ouverts, des potentialités inimaginables sans une vision web. Une méthodologie de conception d’un modèle de simulation permettant l’évaluation et la comparaison des webs ouverts par rapport aux webs existants est proposée. Puisque l’aide à la décision est une composante importante de la gestion de la chaîne logistique, la thèse contribue à déterminer les besoins des gestionnaires et à identifier les lignes directrices de la conception des outils d’aide à la décision offrant le support adéquat pour faire face aux défis et à la complexité des webs logistiques. Ces lignes directrices ont été compilées dans un cadre de conception des logiciels d’aide à la décision supportant la dimension web. Ce cadre est exploité pour développer quatre applications logicielles offrant aux praticiens et aux chercheurs des outils nécessaires pour étudier, analyser et démêler la complexité des webs logistiques.This thesis argues that the network dimension as the current scope of supply chain management is confining the evolution of this field and restricting the conceptual and fundamental innovations required for addressing the major challenges imposed by the evolution of markets and the increased intricacies of business relationships. The concepts of chain and network are limitative when attempting to represent the complexity of physical, informational and financial flows resulting from the interactions occurring in overlapping networks. They lack the theoretical foundations necessary to explain and encompass initiatives that go beyond a single chain or network. They also lead to overlook substantial opportunities that require beyond a network vision. Therefore, the “web” dimension, as networks of networks, is proposed as an extension to the network dimension in supply chain management. This new scope is the natural next step in the progression from the internal operations management level to the supply chain level and then to the supply network level. After a theoretical investigation of why and how the web dimension should be integrated into the supply chain management field, the thesis studies and discusses important implications of this integration on inter-organisational collaboration and of the decision-making processes in the logistic web environments. It demonstrates through the example of open interconnected logistic webs some of the potentials that cannot be imagined without a web vision. A methodology for designing a simulation model to assess the impact of such open webs versus existing webs is proposed. Since decision support is a key element in supply chain management, the thesis contributes to determine the needs of supply chain managers and identify the important axes for designing decision support systems that provide adequate assistance in dealing with the challenges and complexity presented by logistic web environments. The identified elements result in the establishment of a foundation for designing software solutions required to handle the challenges revealed by the web dimension. This conceptual framework is applied to the prototyping of four applications that have the potential of providing practitioners and researchers with the appropriate understanding and necessary tools to deal with the complexity of logistics webs
    corecore