111 research outputs found

    A data transformation model for relational and non-relational data

    Get PDF
    The information systems that support small, medium, and large organisations need data transformation solutions from multiple data sources to fulfill the requirements of new applications and decision-making to stay competitive. Relational data is the foundation for the majority of applications programme, whereas non-relational data is the foundation for the majority of newly produced applications. The relational model is the most elegant one; nonetheless, this kind of database has a drawback when it comes to managing very large volumes of data. Because they can handle massive volumes of data, non-relational databases have evolved into relational database substitutes. The key issue is that rules for data transformation processes across various data types are becoming less well-defined, leading to a steady decline in data quality. Therefore, to handle relational and non-relational data and satisfy the requirements for data quality, an empirical model in this domain knowledge is required. This study seeks to develop a data transformation model used for different data sources while satisfying data quality requirements, especially the transformation processes in relational and non-relational model, named Data Transformation with Two ETL Phases and Central-Library (DTTEPC). The different stages and methods in the developed model are used to transform the metadata information and stored data from relational to non-relational systems, and vice versa. The model is developed and validated through expert review, and the prototype based on the final version is employed in two case studies: education and healthcare. The results of the usability test demonstrate that the developed model is capable of transforming metadata data and stored data across systems. So enhancing the information systems in various organizations through data transformation solutions. The DTTEPC model improved the integrity and completeness of the data transformation processes. Moreover, supports decision-makers by utilizing information from various sources and systems in real-time demands

    Time Series Management Systems: A 2022 Survey

    Get PDF

    Adaptive Management of Multimodel Data and Heterogeneous Workloads

    Get PDF
    Data management systems are facing a growing demand for a tighter integration of heterogeneous data from different applications and sources for both operational and analytical purposes in real-time. However, the vast diversification of the data management landscape has led to a situation where there is a trade-off between high operational performance and a tight integration of data. The difference between the growth of data volume and the growth of computational power demands a new approach for managing multimodel data and handling heterogeneous workloads. With PolyDBMS we present a novel class of database management systems, bridging the gap between multimodel database and polystore systems. This new kind of database system combines the operational capabilities of traditional database systems with the flexibility of polystore systems. This includes support for data modifications, transactions, and schema changes at runtime. With native support for multiple data models and query languages, a PolyDBMS presents a holistic solution for the management of heterogeneous data. This does not only enable a tight integration of data across different applications, it also allows a more efficient usage of resources. By leveraging and combining highly optimized database systems as storage and execution engines, this novel class of database system takes advantage of decades of database systems research and development. In this thesis, we present the conceptual foundations and models for building a PolyDBMS. This includes a holistic model for maintaining and querying multiple data models in one logical schema that enables cross-model queries. With the PolyAlgebra, we present a solution for representing queries based on one or multiple data models while preserving their semantics. Furthermore, we introduce a concept for the adaptive planning and decomposition of queries across heterogeneous database systems with different capabilities and features. The conceptual contributions presented in this thesis materialize in Polypheny-DB, the first implementation of a PolyDBMS. Supporting the relational, document, and labeled property graph data model, Polypheny-DB is a suitable solution for structured, semi-structured, and unstructured data. This is complemented by an extensive type system that includes support for binary large objects. With support for multiple query languages, industry standard query interfaces, and a rich set of domain-specific data stores and data sources, Polypheny-DB offers a flexibility unmatched by existing data management solutions

    Physical database design in document stores

    Get PDF
    Tesi en modalitat de cotutela, Universitat Politècnica de Catalunya i Université libre de BruxellesNoSQL is an umbrella term used to classify alternate storage systems to the traditional Relational Database Management Systems (RDBMSs). Among these, Document stores have gained popularity mainly due to the semi-structured data storage model and the rich query capabilities. They encourage users to use a data-first approach as opposed to a design-first one. Database design on document stores is mainly carried out in a trial-and-error or ad-hoc rule-based manner instead of a formal process such as normalization in an RDBMS. However, these approaches could easily lead to a non-optimal design resulting additional costs in the long run. This PhD thesis aims to provide a novel multi-criteria-based approach to database design in document stores. Most of such existing approaches are based on optimizing query performance. However, other factors include storage requirement and complexity of the stored documents specific to each use case. There is a large solution space of alternative designs due to the different combinations of referencing and nesting of data. Thus, we believe multi-criteria optimization is ideal to solve this problem. To achieve this, we need to address several issues that will enable us to apply multi-criteria optimization for the data design problem. First, we evaluate the impact of alternate storage representations of semi-structured data. There are multiple and equivalent ways to physically represent semi-structured data, but there is a lack of evidence about the potential impact on space and query performance. Thus, we embark on the task of quantifying that precisely for document stores. We empirically compare multiple ways of representing semi-structured data, allowing us to derive a set of guidelines for efficient physical database design considering both JSON and relational options in the same palette. Then, we need a formal canonical model that can represent alternative designs. We propose a hypergraph-based approach for representing heterogeneous datastore designs. We extend and formalize an existing common programming interface to NoSQL systems as hypergraphs. We define design constraints and query transformation rules for representative data store types. Next, we propose a simple query rewriting algorithm and provide a prototype implementation together with storage statistics estimator. Next, we require a formal query cost model to estimate and evaluate query performance on alternative document store designs. Document stores use primitive approaches to query processing, such as relying on the end-user to specify the usage of indexes instead of a formal cost model. But we require a reliable approach to compare alternative designs on how they perform on a specific query. For this, we define a generic storage and query cost model based on disk access and memory allocation. As all document stores carry out data operations in memory, we first estimate the memory usage by considering the characteristics of the stored documents, their access patterns, and memory management algorithms. Then, using this estimation and metadata storage size, we introduce a cost model for random access queries. We validate our work on two well-known document store implementations. The results show that the memory usage estimates have an average precision of 91% and predicted costs are highly correlated to the actual execution times. During this work, we also managed to suggest several improvements to document stores. Finally, we implement the automated database design solution using multi-criteria optimization. We introduce an algebra of transformations that can systematically modify a design of our canonical representation. Then, using them, we implement a local search algorithm driven by a loss function that can propose near-optimal designs with high probability. We compare our prototype against an existing document store data design solution. Our proposed designs have better performance and are more compact with less redundancy.NoSQL descriu sistemes d'emmagatzematge alternatius als tradicionals de gestió de bases de dades relacionals (RDBMS). Entre aquests, els magatzems de documents han guanyat popularitat principalment a causa del model de dades semiestructurat i les riques capacitats de consulta. Animen els usuaris a utilitzar un enfocament de dades primer, en lloc d'un enfocament de disseny primer. El disseny de dades en magatzems de documents es porta a terme principalment en forma d'assaig-error o basat en regles ad-hoc en lloc d'un procés formal i sistemàtic com ara la normalització en un RDBMS. Aquest enfocament condueix fàcilment a un disseny no òptim que generarà costos addicionals a llarg termini. La majoria dels enfocaments existents es basen en l'optimització del rendiment de les consultes. Aquesta tesi pretén, en canvi, proporcionar un nou enfocament basat en diversos criteris per al disseny de bases de dades en magatzems de documents, inclouen el requisit d'espai i la complexitat dels documents emmagatzemats específics per a cada cas d'ús. En general, hi ha un gran espai de solucions de dissenys alternatives. Per tant, creiem que l'optimització multicriteri és ideal per resoldre aquest problema. Per aconseguir-ho, hem d'abordar diversos problemes que ens permetran aplicar l'optimització multicriteri. En primer, avaluem l'impacte de les representacions alternatives de dades semiestructurades. Hi ha maneres múltiples i equivalents de representar dades semiestructurades, però hi ha una manca d'evidència sobre l'impacte potencial en l'espai i el rendiment de les consultes. Així, ens embarquem en la tasca de quantificar-ho. Comparem empíricament múltiples representacions de dades semiestructurades, cosa que ens permet derivar directrius per a un disseny eficient tenint en compte les opcions dels JSON i relacionals alhora. Aleshores, necessitem un model canònic que pugui representar dissenys alternatius i proposem un enfocament basat en hipergrafs. Estenem i formalitzem una interfície de programació comuna existent als sistemes NoSQL com a hipergrafs. Definim restriccions de disseny i regles de transformació de consultes per a tipus de magatzem de dades representatius. A continuació, proposem un algorisme de reescriptura de consultes senzill i proporcionem una implementació juntament amb un estimador d'estadístiques d'emmagatzematge. Els magatzems de documents utilitzen enfocaments primitius per al processament de consultes, com ara confiar en l'usuari final per especificar l'ús d'índexs en lloc d'un model de cost. Conseqüentment, necessitem un model de cost de consulta per estimar i avaluar el rendiment en dissenys alternatius. Per això, definim un model genèric propi basat en l'accés a disc i l'assignació de memòria. Com que tots els magatzems de documents duen a terme operacions de dades a memòria, primer estimem l'ús de la memòria tenint en compte les característiques dels documents emmagatzemats, els seus patrons d'accés i els algorismes de gestió de memòria. A continuació, utilitzant aquesta estimació i la mida d'emmagatzematge de metadades, introduïm un model de costos per a consultes d'accés aleatori. Validem el nostre treball en dues implementacions conegudes. Els resultats mostren que les estimacions d'ús de memòria tenen una precisió mitjana del 91% i els costos previstos estan altament correlacionats amb els temps d'execució reals. Finalment, implementem la solució de disseny automatitzat de bases de dades mitjançant l'optimització multicriteri. Introduïm una àlgebra de transformacions que pot modificar sistemàticament un disseny en la nostra representació canònica. A continuació, utilitzant-la, implementem un algorisme de cerca local impulsat per una funció de pèrdua que pot proposar dissenys gairebé òptims amb alta probabilitat. Comparem el nostre prototip amb una solució de disseny de dades de magatzem de documents existent. Els nostres dissenys proposats tenen un millor rendiment i són més compactes, amb menys redundànciaNoSQL est un terme générique utilisé pour classer les systèmes de stockage alternatifs aux systèmes de gestion de bases de données relationnelles (SGBDR) traditionnels. Au moment de la rédaction de cet article, il existe plus de 200 systèmes NoSQL disponibles qui peuvent être classés en quatre catégories principales sur le modèle de stockage de données : magasins de valeurs-clés, magasins de documents, magasins de familles de colonnes et magasins de graphiques. Les magasins de documents ont gagné en popularité principalement en raison du modèle de stockage de données semi-structuré et des capacités de requêtes riches par rapport aux autres systèmes NoSQL, ce qui en fait un candidat idéal pour le prototypage rapide. Les magasins de documents encouragent les utilisateurs à utiliser une approche axée sur les données plutôt que sur la conception. La conception de bases de données sur les magasins de documents est principalement effectuée par essais et erreurs ou selon des règles ad hoc plutôt que par un processus formel tel que la normalisation dans un SGBDR. Cependant, ces approches pourraient facilement conduire à une conception de base de données non optimale entraînant des coûts supplémentaires de traitement des requêtes, de stockage des données et de refonte. Cette thèse de doctorat vise à fournir une nouvelle approche multicritère de la conception de bases de données dans les magasins de documents. La plupart des approches existantes de conception de bases de données sont basées sur l’optimisation des performances des requêtes. Cependant, d’autres facteurs incluent les exigences de stockage et la complexité des documents stockés spécifique à chaque cas d’utilisation. De plus, il existe un grand espace de solution de conceptions alternatives en raison des différentes combinaisons de référencement et d’imbrication des données. Par conséquent, nous pensons que l’optimisation multicritères est idéale par l’intermédiaire d’une expérience éprouvée dans la résolution de tels problèmes dans divers domaines. Cependant, pour y parvenir, nous devons résoudre plusieurs problèmes qui nous permettront d’appliquer une optimisation multicritère pour le problème de conception de données. Premièrement, nous évaluons l’impact des représentations alternatives de stockage des données semi-structurées. Il existe plusieurs manières équivalentes de représenter physiquement des données semi-structurées, mais il y a un manque de preuves concernant l’impact potentiel sur l’espace et sur les performances des requêtes. Ainsi, nous nous lançons dans la tâche de quantifier cela précisément pour les magasins de documents. Nous comparons empiriquement plusieurs façons de représenter des données semi-structurées, ce qui nous permet de dériver un ensemble de directives pour une conception de base de données physique efficace en tenant compte à la fois des options JSON et relationnelles dans la même palette. Ensuite, nous avons besoin d’un modèle canonique formel capable de représenter des conceptions alternatives. Dans cette mesure, nous proposons une approche basée sur des hypergraphes pour représenter des conceptions de magasins de données hétérogènes. Prenant une interface de programmation commune existante aux systèmes NoSQL, nous l’étendons et la formalisons sous forme d’hypergraphes. Ensuite, nous définissons les contraintes de conception et les règles de transformation des requêtes pour trois types de magasins de données représentatifs. Ensuite, nous proposons un algorithme de réécriture de requête simple à partir d’un algorithme générique dans un magasin de données sous-jacent spécifique et fournissons une implémentation prototype. De plus, nous introduisons un estimateur de statistiques de stockage sur les magasins de données sous-jacents. Enfin, nous montrons la faisabilité de notre approche sur un cas d’utilisation d’un système polyglotte existant ainsi que son utilité dans les calculs de métadonnées et de chemins de requêtes physiques. Ensuite, nous avons besoin d’un modèle de coûts de requêtes formel pour estimer et évaluer les performances des requêtes sur des conceptions alternatives de magasin de documents. Les magasins de documents utilisent des approches primitives du traitement des requêtes, telles que l’évaluation de tous les plans de requête possibles pour trouver le plan gagnant et son utilisation dans les requêtes similaires ultérieures, ou l’appui sur l’usager final pour spécifier l’utilisation des index au lieu d’un modèle de coûts formel. Cependant, nous avons besoin d’une approche fiable pour comparer deux conceptions alternatives sur la façon dont elles fonctionnent sur une requête spécifique. Pour cela, nous définissons un modèle de coûts de stockage et de requête générique basé sur l’accès au disque et l’allocation de mémoire qui permet d’estimer l’impact des décisions de conception. Étant donné que tous les magasins de documents effectuent des opérations sur les données en mémoire, nous estimons d’abord l’utilisation de la mémoire en considérant les caractéristiques des documents stockés, leurs modèles d’accès et les algorithmes de gestion de la mémoire. Ensuite, en utilisant cette estimation et la taille de stockage des métadonnées, nous introduisons un modèle de coûts pour les requêtes à accès aléatoire. Il s’agit de la première tenta ive d’une telle approche au meilleur de notre connaissance. Enfin, nous validons notre travail sur deux implémentations de magasin de documents bien connues : MongoDB et Couchbase. Les résultats démontrent que les estimations d’utilisation de la mémoire ont une précision moyenne de 91% et que les coûts prévus sont fortement corrélés aux temps d’exécution réels. Au cours de ce travail, nous avons réussi à proposer plusieurs améliorations aux systèmes de stockage de documents. Ainsi, ce modèle de coûts contribue également à identifier les discordances entre les implémentations de stockage de documents et leurs attentes théoriques. Enfin, nous implémentons la solution de conception automatisée de bases de données en utilisant l’optimisation multicritères. Tout d’abord, nous introduisons une algèbre de transformations qui peut systématiquement modifier une conception de notre représentation canonique. Ensuite, en utilisant ces transformations, nous implémentons un algorithme de recherche locale piloté par une fonction de perte qui peut proposer des conceptions quasi optimales avec une probabilité élevée. Enfin, nous comparons notre prototype à une solution de conception de données de magasin de documents existante uniquement basée sur le coût des requêtes. Nos conceptions proposées ont de meilleures performances et sont plus compactes avec moins de redondancePostprint (published version

    The Evolution of Cloud Data Architectures: Storage, Compute, and Migration

    Get PDF
    Recent advances in data architectures have shifted from on-premises to the cloud. However, new challenges emerge as data explosion continues to expand at an exponential rate. As a result, my Ph.D. research focuses on addressing the following challenges. First, cloud data-warehouses such as Snowflake, BigQuery, and Redshift often rely on storage systems such as distributed file systems or object stores to store massive amounts of data. The growth of data volumes is accompanied by an increase in the number of objects stored and the amount of metadata such systems must manage. By treating metadata management similar to data management, we built FileScale, an HDFS-based file system that replaces metadata management in HDFS with a three-tiered distributed architecture that incorporates a high throughput, distributed main-memory database system at the lowest layer, along with distributed caching and routing functionality above it. FileScale performs comparably to the single-machine architecture at a small scale, while enabling linear scalability as the file system metadata increases. Second, Function as a Service, or FaaS, is a new type of cloud-computing service that executes code in response to events without the complex infrastructure typically associated with building and launching microservices applications. FaaS offers cloud functions with millisecond billing granularity to be scaled automatically, independently, and instantaneously as needed. We built Flock, the first practical cloud-native SQL query engine that supports event stream processing on FaaS with heterogeneous hardware (x86 and Arm) with the ability to shuffle and aggregate data without requiring a centralized coordinator or remote storage such as Amazon S3. This architecture is more cost-effective than traditional systems, especially for dynamic workloads and continuous queries. Third, Software as a Service, or SaaS, is a method of software product delivery to end-users over the internet and via pay-as-you-go pricing in which the software is centrally hosted and managed by the cloud service provider. Continuous Deployment (CD) in SaaS, an aspect of DevOps, is the increasingly popular practice of frequent, automated deployment of software changes. To realize the benefits of CD, it must be straightforward to deploy updates to both front-end code and the database, even when the database’s schema has changed. Unfortunately, this is where current practices run into difficulty. So we built BullFrog, a PostgreSQL extension that is the first system to use lazy schema migration to support single-step, online schema evolution without downtime, which achieves efficient, exactly-once physical migration of data under contention

    Logging Statements Analysis and Automation in Software Systems with Data Mining and Machine Learning Techniques

    Get PDF
    Log files are widely used to record runtime information of software systems, such as the timestamp of an event, the name or ID of the component that generated the log, and parts of the state of a task execution. The rich information of logs enables system developers (and operators) to monitor the runtime behavior of their systems and further track down system problems in development and production settings. With the ever-increasing scale and complexity of modern computing systems, the volume of logs is rapidly growing. For example, eBay reported that the rate of log generation on their servers is in the order of several petabytes per day in 2018 [17]. Therefore, the traditional way of log analysis that largely relies on manual inspection (e.g., searching for error/warning keywords or grep) has become an inefficient, a labor intensive, error-prone, and outdated task. The growth of the logs has initiated the emergence of automated tools and approaches for log mining and analysis. In parallel, the embedding of logging statements in the source code is a manual and error-prone task, and developers often might forget to add a logging statement in the software's source code. To address the logging challenge, many e orts have aimed to automate logging statements in the source code, and in addition, many tools have been proposed to perform large-scale log le analysis by use of machine learning and data mining techniques. However, the current logging process is yet mostly manual, and thus, proper placement and content of logging statements remain as challenges. To overcome these challenges, methods that aim to automate log placement and content prediction, i.e., `where and what to log', are of high interest. In addition, approaches that can automatically mine and extract insight from large-scale logs are also well sought after. Thus, in this research, we focus on predicting the log statements, and for this purpose, we perform an experimental study on open-source Java projects. We introduce a log-aware code-clone detection method to predict the location and description of logging statements. Additionally, we incorporate natural language processing (NLP) and deep learning methods to further enhance the performance of the log statements' description prediction. We also introduce deep learning based approaches for automated analysis of software logs. In particular, we analyze execution logs and extract natural language characteristics of logs to enable the application of natural language models for automated log le analysis. Then, we propose automated tools for analyzing log files and measuring the information gain from logs for different log analysis tasks such as anomaly detection. We then continue our NLP-enabled approach by leveraging the state-of-the-art language models, i.e., Transformers, to perform automated log parsing

    Metadata-driven Data Migration from Object-relational Database to NoSQL Document-oriented Database

    Get PDF
    The object-relational databases (ORDB) are powerful for managing complex data, but they suffer from problems of scalability and managing large-scale data. Therefore, the importance of the migration of ORDB to NoSQL derives from the fact that the large volume of data can be handled in the best way with high scalability and availability. This paper reports our metadata-driven approach for the migration of the ORDB to document-oriented NoSQL database. Our data migration approach involves three major stages: a preprocessing stage, to extract the data and the schema's components, a processing stage, to provide the data transformation, and a post-processing stage, to store the migrated data as BSON documents. The approach maintains the benefits of Oracle ORDB in NoSQL MongoDB by supporting integrity constraint checking. To validate our approach, we developed OR2DOD (Object Relational to Document-Oriented Databases) system, and the experimental results confirm the effectiveness of our proposal

    Друга міжнародна конференція зі сталого майбутнього: екологічні, технологічні, соціальні та економічні питання (ICSF 2021). Кривий Ріг, Україна, 19-21 травня 2021 року

    Get PDF
    Second International Conference on Sustainable Futures: Environmental, Technological, Social and Economic Matters (ICSF 2021). Kryvyi Rih, Ukraine, May 19-21, 2021.Друга міжнародна конференція зі сталого майбутнього: екологічні, технологічні, соціальні та економічні питання (ICSF 2021). Кривий Ріг, Україна, 19-21 травня 2021 року
    corecore