34 research outputs found

    Physical database design in document stores

    Get PDF
    Tesi en modalitat de cotutela, Universitat Politècnica de Catalunya i Université libre de BruxellesNoSQL is an umbrella term used to classify alternate storage systems to the traditional Relational Database Management Systems (RDBMSs). Among these, Document stores have gained popularity mainly due to the semi-structured data storage model and the rich query capabilities. They encourage users to use a data-first approach as opposed to a design-first one. Database design on document stores is mainly carried out in a trial-and-error or ad-hoc rule-based manner instead of a formal process such as normalization in an RDBMS. However, these approaches could easily lead to a non-optimal design resulting additional costs in the long run. This PhD thesis aims to provide a novel multi-criteria-based approach to database design in document stores. Most of such existing approaches are based on optimizing query performance. However, other factors include storage requirement and complexity of the stored documents specific to each use case. There is a large solution space of alternative designs due to the different combinations of referencing and nesting of data. Thus, we believe multi-criteria optimization is ideal to solve this problem. To achieve this, we need to address several issues that will enable us to apply multi-criteria optimization for the data design problem. First, we evaluate the impact of alternate storage representations of semi-structured data. There are multiple and equivalent ways to physically represent semi-structured data, but there is a lack of evidence about the potential impact on space and query performance. Thus, we embark on the task of quantifying that precisely for document stores. We empirically compare multiple ways of representing semi-structured data, allowing us to derive a set of guidelines for efficient physical database design considering both JSON and relational options in the same palette. Then, we need a formal canonical model that can represent alternative designs. We propose a hypergraph-based approach for representing heterogeneous datastore designs. We extend and formalize an existing common programming interface to NoSQL systems as hypergraphs. We define design constraints and query transformation rules for representative data store types. Next, we propose a simple query rewriting algorithm and provide a prototype implementation together with storage statistics estimator. Next, we require a formal query cost model to estimate and evaluate query performance on alternative document store designs. Document stores use primitive approaches to query processing, such as relying on the end-user to specify the usage of indexes instead of a formal cost model. But we require a reliable approach to compare alternative designs on how they perform on a specific query. For this, we define a generic storage and query cost model based on disk access and memory allocation. As all document stores carry out data operations in memory, we first estimate the memory usage by considering the characteristics of the stored documents, their access patterns, and memory management algorithms. Then, using this estimation and metadata storage size, we introduce a cost model for random access queries. We validate our work on two well-known document store implementations. The results show that the memory usage estimates have an average precision of 91% and predicted costs are highly correlated to the actual execution times. During this work, we also managed to suggest several improvements to document stores. Finally, we implement the automated database design solution using multi-criteria optimization. We introduce an algebra of transformations that can systematically modify a design of our canonical representation. Then, using them, we implement a local search algorithm driven by a loss function that can propose near-optimal designs with high probability. We compare our prototype against an existing document store data design solution. Our proposed designs have better performance and are more compact with less redundancy.NoSQL descriu sistemes d'emmagatzematge alternatius als tradicionals de gestió de bases de dades relacionals (RDBMS). Entre aquests, els magatzems de documents han guanyat popularitat principalment a causa del model de dades semiestructurat i les riques capacitats de consulta. Animen els usuaris a utilitzar un enfocament de dades primer, en lloc d'un enfocament de disseny primer. El disseny de dades en magatzems de documents es porta a terme principalment en forma d'assaig-error o basat en regles ad-hoc en lloc d'un procés formal i sistemàtic com ara la normalització en un RDBMS. Aquest enfocament condueix fàcilment a un disseny no òptim que generarà costos addicionals a llarg termini. La majoria dels enfocaments existents es basen en l'optimització del rendiment de les consultes. Aquesta tesi pretén, en canvi, proporcionar un nou enfocament basat en diversos criteris per al disseny de bases de dades en magatzems de documents, inclouen el requisit d'espai i la complexitat dels documents emmagatzemats específics per a cada cas d'ús. En general, hi ha un gran espai de solucions de dissenys alternatives. Per tant, creiem que l'optimització multicriteri és ideal per resoldre aquest problema. Per aconseguir-ho, hem d'abordar diversos problemes que ens permetran aplicar l'optimització multicriteri. En primer, avaluem l'impacte de les representacions alternatives de dades semiestructurades. Hi ha maneres múltiples i equivalents de representar dades semiestructurades, però hi ha una manca d'evidència sobre l'impacte potencial en l'espai i el rendiment de les consultes. Així, ens embarquem en la tasca de quantificar-ho. Comparem empíricament múltiples representacions de dades semiestructurades, cosa que ens permet derivar directrius per a un disseny eficient tenint en compte les opcions dels JSON i relacionals alhora. Aleshores, necessitem un model canònic que pugui representar dissenys alternatius i proposem un enfocament basat en hipergrafs. Estenem i formalitzem una interfície de programació comuna existent als sistemes NoSQL com a hipergrafs. Definim restriccions de disseny i regles de transformació de consultes per a tipus de magatzem de dades representatius. A continuació, proposem un algorisme de reescriptura de consultes senzill i proporcionem una implementació juntament amb un estimador d'estadístiques d'emmagatzematge. Els magatzems de documents utilitzen enfocaments primitius per al processament de consultes, com ara confiar en l'usuari final per especificar l'ús d'índexs en lloc d'un model de cost. Conseqüentment, necessitem un model de cost de consulta per estimar i avaluar el rendiment en dissenys alternatius. Per això, definim un model genèric propi basat en l'accés a disc i l'assignació de memòria. Com que tots els magatzems de documents duen a terme operacions de dades a memòria, primer estimem l'ús de la memòria tenint en compte les característiques dels documents emmagatzemats, els seus patrons d'accés i els algorismes de gestió de memòria. A continuació, utilitzant aquesta estimació i la mida d'emmagatzematge de metadades, introduïm un model de costos per a consultes d'accés aleatori. Validem el nostre treball en dues implementacions conegudes. Els resultats mostren que les estimacions d'ús de memòria tenen una precisió mitjana del 91% i els costos previstos estan altament correlacionats amb els temps d'execució reals. Finalment, implementem la solució de disseny automatitzat de bases de dades mitjançant l'optimització multicriteri. Introduïm una àlgebra de transformacions que pot modificar sistemàticament un disseny en la nostra representació canònica. A continuació, utilitzant-la, implementem un algorisme de cerca local impulsat per una funció de pèrdua que pot proposar dissenys gairebé òptims amb alta probabilitat. Comparem el nostre prototip amb una solució de disseny de dades de magatzem de documents existent. Els nostres dissenys proposats tenen un millor rendiment i són més compactes, amb menys redundànciaNoSQL est un terme générique utilisé pour classer les systèmes de stockage alternatifs aux systèmes de gestion de bases de données relationnelles (SGBDR) traditionnels. Au moment de la rédaction de cet article, il existe plus de 200 systèmes NoSQL disponibles qui peuvent être classés en quatre catégories principales sur le modèle de stockage de données : magasins de valeurs-clés, magasins de documents, magasins de familles de colonnes et magasins de graphiques. Les magasins de documents ont gagné en popularité principalement en raison du modèle de stockage de données semi-structuré et des capacités de requêtes riches par rapport aux autres systèmes NoSQL, ce qui en fait un candidat idéal pour le prototypage rapide. Les magasins de documents encouragent les utilisateurs à utiliser une approche axée sur les données plutôt que sur la conception. La conception de bases de données sur les magasins de documents est principalement effectuée par essais et erreurs ou selon des règles ad hoc plutôt que par un processus formel tel que la normalisation dans un SGBDR. Cependant, ces approches pourraient facilement conduire à une conception de base de données non optimale entraînant des coûts supplémentaires de traitement des requêtes, de stockage des données et de refonte. Cette thèse de doctorat vise à fournir une nouvelle approche multicritère de la conception de bases de données dans les magasins de documents. La plupart des approches existantes de conception de bases de données sont basées sur l’optimisation des performances des requêtes. Cependant, d’autres facteurs incluent les exigences de stockage et la complexité des documents stockés spécifique à chaque cas d’utilisation. De plus, il existe un grand espace de solution de conceptions alternatives en raison des différentes combinaisons de référencement et d’imbrication des données. Par conséquent, nous pensons que l’optimisation multicritères est idéale par l’intermédiaire d’une expérience éprouvée dans la résolution de tels problèmes dans divers domaines. Cependant, pour y parvenir, nous devons résoudre plusieurs problèmes qui nous permettront d’appliquer une optimisation multicritère pour le problème de conception de données. Premièrement, nous évaluons l’impact des représentations alternatives de stockage des données semi-structurées. Il existe plusieurs manières équivalentes de représenter physiquement des données semi-structurées, mais il y a un manque de preuves concernant l’impact potentiel sur l’espace et sur les performances des requêtes. Ainsi, nous nous lançons dans la tâche de quantifier cela précisément pour les magasins de documents. Nous comparons empiriquement plusieurs façons de représenter des données semi-structurées, ce qui nous permet de dériver un ensemble de directives pour une conception de base de données physique efficace en tenant compte à la fois des options JSON et relationnelles dans la même palette. Ensuite, nous avons besoin d’un modèle canonique formel capable de représenter des conceptions alternatives. Dans cette mesure, nous proposons une approche basée sur des hypergraphes pour représenter des conceptions de magasins de données hétérogènes. Prenant une interface de programmation commune existante aux systèmes NoSQL, nous l’étendons et la formalisons sous forme d’hypergraphes. Ensuite, nous définissons les contraintes de conception et les règles de transformation des requêtes pour trois types de magasins de données représentatifs. Ensuite, nous proposons un algorithme de réécriture de requête simple à partir d’un algorithme générique dans un magasin de données sous-jacent spécifique et fournissons une implémentation prototype. De plus, nous introduisons un estimateur de statistiques de stockage sur les magasins de données sous-jacents. Enfin, nous montrons la faisabilité de notre approche sur un cas d’utilisation d’un système polyglotte existant ainsi que son utilité dans les calculs de métadonnées et de chemins de requêtes physiques. Ensuite, nous avons besoin d’un modèle de coûts de requêtes formel pour estimer et évaluer les performances des requêtes sur des conceptions alternatives de magasin de documents. Les magasins de documents utilisent des approches primitives du traitement des requêtes, telles que l’évaluation de tous les plans de requête possibles pour trouver le plan gagnant et son utilisation dans les requêtes similaires ultérieures, ou l’appui sur l’usager final pour spécifier l’utilisation des index au lieu d’un modèle de coûts formel. Cependant, nous avons besoin d’une approche fiable pour comparer deux conceptions alternatives sur la façon dont elles fonctionnent sur une requête spécifique. Pour cela, nous définissons un modèle de coûts de stockage et de requête générique basé sur l’accès au disque et l’allocation de mémoire qui permet d’estimer l’impact des décisions de conception. Étant donné que tous les magasins de documents effectuent des opérations sur les données en mémoire, nous estimons d’abord l’utilisation de la mémoire en considérant les caractéristiques des documents stockés, leurs modèles d’accès et les algorithmes de gestion de la mémoire. Ensuite, en utilisant cette estimation et la taille de stockage des métadonnées, nous introduisons un modèle de coûts pour les requêtes à accès aléatoire. Il s’agit de la première tenta ive d’une telle approche au meilleur de notre connaissance. Enfin, nous validons notre travail sur deux implémentations de magasin de documents bien connues : MongoDB et Couchbase. Les résultats démontrent que les estimations d’utilisation de la mémoire ont une précision moyenne de 91% et que les coûts prévus sont fortement corrélés aux temps d’exécution réels. Au cours de ce travail, nous avons réussi à proposer plusieurs améliorations aux systèmes de stockage de documents. Ainsi, ce modèle de coûts contribue également à identifier les discordances entre les implémentations de stockage de documents et leurs attentes théoriques. Enfin, nous implémentons la solution de conception automatisée de bases de données en utilisant l’optimisation multicritères. Tout d’abord, nous introduisons une algèbre de transformations qui peut systématiquement modifier une conception de notre représentation canonique. Ensuite, en utilisant ces transformations, nous implémentons un algorithme de recherche locale piloté par une fonction de perte qui peut proposer des conceptions quasi optimales avec une probabilité élevée. Enfin, nous comparons notre prototype à une solution de conception de données de magasin de documents existante uniquement basée sur le coût des requêtes. Nos conceptions proposées ont de meilleures performances et sont plus compactes avec moins de redondancePostprint (published version

    Scalable Informative Rule Mining

    Get PDF
    In this thesis we present SIRUM: a system for Scalable Informative RUle Mining from multi-dimensional data. Informative rules have recently been studied in several contexts, including data summarization, data cube exploration and data quality. The objective is to produce a concise set of rules (patterns) over the values of the dimension attributes that provide the most information about the distribution of a numeric measure attribute. SIRUM optimizes this task for big, wide and distributed datasets. We implemented SIRUM in Spark and observed significant performance improvements on real data due to our optimizations

    Just-in-time Analytics Over Heterogeneous Data and Hardware

    Get PDF
    Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of datasets to gain insights. At the same time, data variety increases continuously across multiple axes. First, data comes in multiple formats, such as the binary tabular data of a DBMS, raw textual files, and domain-specific formats. Second, different datasets follow different data models, such as the relational and the hierarchical one. Data location also varies: Some datasets reside in a central "data lake", whereas others lie in remote data sources. In addition, users execute widely different analysis tasks over all these data types. Finally, the process of gathering and integrating diverse datasets introduces several inconsistencies and redundancies in the data, such as duplicate entries for the same real-world concept. In summary, heterogeneity significantly affects the way data analysis is performed. In this thesis, we aim for data virtualization: Abstracting data out of its original form and manipulating it regardless of the way it is stored or structured, without a performance penalty. To achieve data virtualization, we design and implement systems that i) mask heterogeneity through the use of heterogeneity-aware, high-level building blocks and ii) offer fast responses through on-demand adaptation techniques. Regarding the high-level building blocks, we use a query language and algebra to handle multiple collection types, such as relations and hierarchies, express transformations between these collection types, as well as express complex data cleaning tasks over them. In addition, we design a location-aware compiler and optimizer that masks away the complexity of accessing multiple remote data sources. Regarding on-demand adaptation, we present a design to produce a new system per query. The design uses customization mechanisms that trigger runtime code generation to mimic the system most appropriate to answer a query fast: Query operators are thus created based on the query workload and the underlying data models; the data access layer is created based on the underlying data formats. In addition, we exploit emerging hardware by customizing the system implementation based on the available heterogeneous processors â CPUs and GPGPUs. We thus pair each workload with its ideal processor type. The end result is a just-in-time database system that is specific to the query, data, workload, and hardware instance. This thesis redesigns the data management stack to natively cater for data heterogeneity and exploit hardware heterogeneity. Instead of centralizing all relevant datasets, converting them to a single representation, and loading them in a monolithic, static, suboptimal system, our design embraces heterogeneity. Overall, our design decouples the type of performed analysis from the original data layout; users can perform their analysis across data stores, data models, and data formats, but at the same time experience the performance offered by a custom system that has been built on demand to serve their specific use case

    Set up of automated user interface testing system

    Get PDF
    Automation has evolved dramatically in the past decade, one aspect of this being software automation testing. The system used in this thesis is Nightwatch.js which is a Node.js-based framework solution for web applications and websites. With the demonstration conducted in this paper, the author aims to specify the vital role of automation testing framework in the software industry. The commissioning party was Quux Oy, a software company located in Valkeakoski (Finland). The testing in the company are still done manually and there is an imperative need for the setup of an automated user interface testing system. The thesis project included a theoretical review using online sources such as articles, forums, electronic sources, and the main website of the Nightwatch itself. The theory was focused on the definition of automation and on defining Nightwatch.js system along with all the required features to set up the system. The implementation part was where the whole set up process was being examined and recorded. The outcome of the thesis project is a test suite where all the components required to be tested were covered. The targets of the thesis were achieved, with a justification for using this testing system in Quux Oy and its whole set up process

    SciQL, A query language for science applications

    Get PDF
    Scientific applications are still poorly served by contemporary relational database systems. At best, the system provides a bridge towards an external library using user-defined functions, explicit import/export facilities or linked-in Java/C# interpreters. Time has come to rectify this with SciQL, a SQL-query language for science applications with arrays as first class citizens. It provides a seamless symbiosis of array-, set-, and sequence- interpretation using a clear separation of the mathematical object from its underlying storage representation. The language extends value-based grouping in SQL with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between its index attributes. It leads to a generalization of window-based query processing. The SciQL architecture benefits from a column store system with an adaptive storage scheme, including keeping multiple representations around for reduced impedance mismatch. This paper is focused on the language features, its architectural consequences and extensive examples of its intended use

    Arquitectura, técnicas y modelos para posibilitar la Ciencia de Datos en el Archivo de la Misión Gaia

    Get PDF
    Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadores y Automática, leída el 26/05/2017.The massive amounts of data that the world produces every day pose new challenges to modern societies in terms of how to leverage their inherent value. Social networks, instant messaging, video, smart devices and scientific missions are just mere examples of the vast number of sources generating data every second. As the world becomes more and more digitalized, new needs arise for organizing, archiving, sharing, analyzing, visualizing and protecting the ever-increasing data sets, so that we can truly develop into a data-driven economy that reduces inefficiencies and increases sustainability, creating new business opportunities on the way. Traditional approaches for harnessing data are not suitable any more as they lack the means for scaling to the larger volumes in a timely and cost efficient manner. This has somehow changed with the advent of Internet companies like Google and Facebook, which have devised new ways of tackling this issue. However, the variety and complexity of the value chains in the private sector as well as the increasing demands and constraints in which the public one operates, needs an ongoing research that can yield newer strategies for dealing with data, facilitate the integration of providers and consumers of information, and guarantee a smooth and prompt transition when adopting these cutting-edge technological advances. This thesis aims at providing novel architectures and techniques that will help perform this transition towards Big Data in massive scientific archives. It highlights the common pitfalls that must be faced when embracing it and how to overcome them, especially when the data sets, their transformation pipelines and the tools used for the analysis are already present in the organizations. Furthermore, a new perspective for facilitating a smoother transition is laid out. It involves the usage of higher-level and use case specific frameworks and models, which will naturally bridge the gap between the technological and scientific domains. This alternative will effectively widen the possibilities of scientific archives and therefore will contribute to the reduction of the time to science. The research will be applied to the European Space Agency cornerstone mission Gaia, whose final data archive will represent a tremendous discovery potential. It will create the largest and most precise three dimensional chart of our galaxy (the Milky Way), providing unprecedented position, parallax and proper motion measurements for about one billion stars. The successful exploitation of this data archive will depend to a large degree on the ability to offer the proper architecture, i.e. infrastructure and middleware, upon which scientists will be able to do exploration and modeling with this huge data set. In consequence, the approach taken needs to enable data fusion with other scientific archives, as this will produce the synergies leading to an increment in scientific outcome, both in volume and in quality. The set of novel techniques and frameworks presented in this work addresses these issues by contextualizing them with the data products that will be generated in the Gaia mission. All these considerations have led to the foundations of the architecture that will be leveraged by the Science Enabling Applications Work Package. Last but not least, the effectiveness of the proposed solution will be demonstrated through the implementation of some ambitious statistical problems that will require significant computational capabilities, and which will use Gaia-like simulated data (the first Gaia data release has recently taken place on September 14th, 2016). These ambitious problems will be referred to as the Grand Challenge, a somewhat grandiloquent name that consists in inferring a set of parameters from a probabilistic point of view for the Initial Mass Function (IMF) and Star Formation Rate (SFR) of a given set of stars (with a huge sample size), from noisy estimates of their masses and ages respectively. This will be achieved by using Hierarchical Bayesian Modeling (HBM). In principle, the HBM can incorporate stellar evolution models to infer the IMF and SFR directly, but in this first step presented in this thesis, we will start with a somewhat less ambitious goal: inferring the PDMF and PDAD. Moreover, the performance and scalability analyses carried out will also prove the suitability of the models for the large amounts of data that will be available in the Gaia data archive.Las grandes cantidades de datos que se producen en el mundo diariamente plantean nuevos retos a la sociedad en términos de cómo extraer su valor inherente. Las redes sociales, mensajería instantánea, los dispositivos inteligentes y las misiones científicas son meros ejemplos del gran número de fuentes generando datos en cada momento. Al mismo tiempo que el mundo se digitaliza cada vez más, aparecen nuevas necesidades para organizar, archivar, compartir, analizar, visualizar y proteger la creciente cantidad de datos, para que podamos desarrollar economías basadas en datos e información que sean capaces de reducir las ineficiencias e incrementar la sostenibilidad, creando nuevas oportunidades de negocio por el camino. La forma en la que se han manejado los datos tradicionalmente no es la adecuada hoy en día, ya que carece de los medios para escalar a los volúmenes más grandes de datos de una forma oportuna y eficiente. Esto ha cambiado de alguna manera con la llegada de compañías que operan en Internet como Google o Facebook, ya que han concebido nuevas aproximaciones para abordar el problema. Sin embargo, la variedad y complejidad de las cadenas de valor en el sector privado y las crecientes demandas y limitaciones en las que el sector público opera, necesitan una investigación continua en la materia que pueda proporcionar nuevas estrategias para procesar las enormes cantidades de datos, facilitar la integración de productores y consumidores de información, y garantizar una transición rápida y fluida a la hora de adoptar estos avances tecnológicos innovadores. Esta tesis tiene como objetivo proporcionar nuevas arquitecturas y técnicas que ayudarán a realizar esta transición hacia Big Data en archivos científicos masivos. La investigación destaca los escollos principales a encarar cuando se adoptan estas nuevas tecnologías y cómo afrontarlos, principalmente cuando los datos y las herramientas de transformación utilizadas en el análisis existen en la organización. Además, se exponen nuevas medidas para facilitar una transición más fluida. Éstas incluyen la utilización de software de alto nivel y específico al caso de uso en cuestión, que haga de puente entre el dominio científico y tecnológico. Esta alternativa ampliará de una forma efectiva las posibilidades de los archivos científicos y por tanto contribuirá a la reducción del tiempo necesario para generar resultados científicos a partir de los datos recogidos en las misiones de astronomía espacial y planetaria. La investigación se aplicará a la misión de la Agencia Espacial Europea (ESA) Gaia, cuyo archivo final de datos presentará un gran potencial para el descubrimiento y hallazgo desde el punto de vista científico. La misión creará el catálogo en tres dimensiones más grande y preciso de nuestra galaxia (la Vía Láctea), proporcionando medidas sin precedente acerca del posicionamiento, paralaje y movimiento propio de alrededor de mil millones de estrellas. Las oportunidades para la explotación exitosa de este archivo de datos dependerán en gran medida de la capacidad de ofrecer la arquitectura adecuada, es decir infraestructura y servicios, sobre la cual los científicos puedan realizar la exploración y modelado con esta inmensa cantidad de datos. Por tanto, la estrategia a realizar debe ser capaz de combinar los datos con otros archivos científicos, ya que esto producirá sinergias que contribuirán a un incremento en la ciencia producida, tanto en volumen como en calidad de la misma. El conjunto de técnicas e infraestructuras innovadoras presentadas en este trabajo aborda estos problemas, contextualizándolos con los productos de datos que se generarán en la misión Gaia. Todas estas consideraciones han conducido a los fundamentos de la arquitectura que se utilizará en el paquete de trabajo de aplicaciones que posibilitarán la ciencia en el archivo de la misión Gaia (Science Enabling Applications). Por último, la eficacia de la solución propuesta se demostrará a través de la implementación de dos problemas estadísticos que requerirán cantidades significativas de cómputo, y que usarán datos simulados en el mismo formato en el que se producirán en el archivo de la misión Gaia (la primera versión de datos recogidos por la misión está disponible desde el día 14 de Septiembre de 2016). Estos ambiciosos problemas representan el Gran Reto (Grand Challenge), un nombre grandilocuente que consiste en inferir una serie de parámetros desde un punto de vista probabilístico para la función de masa inicial (Initial Mass Function) y la tasa de formación estelar (Star Formation Rate) dado un conjunto de estrellas (con una muestra grande), desde estimaciones con ruido de sus masas y edades respectivamente. Esto se abordará utilizando modelos jerárquicos bayesianos (Hierarchical Bayesian Modeling). Enprincipio,losmodelospropuestos pueden incorporar otros modelos de evolución estelar para inferir directamente la función de masa inicial y la tasa de formación estelar, pero en este primer paso presentado en esta tesis, empezaremos con un objetivo algo menos ambicioso: la inferencia de la función de masa y distribución de edades actual (Present-Day Mass Function y Present-Day Age Distribution respectivamente). Además, se llevará a cabo el análisis de rendimiento y escalabilidad para probar la idoneidad de la implementación de dichos modelos dadas las enormes cantidades de datos que estarán disponibles en el archivo de la misión Gaia...Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

    Automated Injection of Curated Knowledge Into Real-Time Clinical Systems: CDS Architecture for the 21st Century

    Get PDF
    abstract: Clinical Decision Support (CDS) is primarily associated with alerts, reminders, order entry, rule-based invocation, diagnostic aids, and on-demand information retrieval. While valuable, these foci have been in production use for decades, and do not provide a broader, interoperable means of plugging structured clinical knowledge into live electronic health record (EHR) ecosystems for purposes of orchestrating the user experiences of patients and clinicians. To date, the gap between knowledge representation and user-facing EHR integration has been considered an “implementation concern” requiring unscalable manual human efforts and governance coordination. Drafting a questionnaire engineered to meet the specifications of the HL7 CDS Knowledge Artifact specification, for example, carries no reasonable expectation that it may be imported and deployed into a live system without significant burdens. Dramatic reduction of the time and effort gap in the research and application cycle could be revolutionary. Doing so, however, requires both a floor-to-ceiling precoordination of functional boundaries in the knowledge management lifecycle, as well as formalization of the human processes by which this occurs. This research introduces ARTAKA: Architecture for Real-Time Application of Knowledge Artifacts, as a concrete floor-to-ceiling technological blueprint for both provider heath IT (HIT) and vendor organizations to incrementally introduce value into existing systems dynamically. This is made possible by service-ization of curated knowledge artifacts, then injected into a highly scalable backend infrastructure by automated orchestration through public marketplaces. Supplementary examples of client app integration are also provided. Compilation of knowledge into platform-specific form has been left flexible, in so far as implementations comply with ARTAKA’s Context Event Service (CES) communication and Health Services Platform (HSP) Marketplace service packaging standards. Towards the goal of interoperable human processes, ARTAKA’s treatment of knowledge artifacts as a specialized form of software allows knowledge engineers to operate as a type of software engineering practice. Thus, nearly a century of software development processes, tools, policies, and lessons offer immediate benefit: in some cases, with remarkable parity. Analyses of experimentation is provided with guidelines in how choice aspects of software development life cycles (SDLCs) apply to knowledge artifact development in an ARTAKA environment. Portions of this culminating document have been further initiated with Standards Developing Organizations (SDOs) intended to ultimately produce normative standards, as have active relationships with other bodies.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201

    The Architecture of an Autonomic, Resource-Aware, Workstation-Based Distributed Database System

    Get PDF
    Distributed software systems that are designed to run over workstation machines within organisations are termed workstation-based. Workstation-based systems are characterised by dynamically changing sets of machines that are used primarily for other, user-centric tasks. They must be able to adapt to and utilize spare capacity when and where it is available, and ensure that the non-availability of an individual machine does not affect the availability of the system. This thesis focuses on the requirements and design of a workstation-based database system, which is motivated by an analysis of existing database architectures that are typically run over static, specially provisioned sets of machines. A typical clustered database system -- one that is run over a number of specially provisioned machines -- executes queries interactively, returning a synchronous response to applications, with its data made durable and resilient to the failure of machines. There are no existing workstation-based databases. Furthermore, other workstation-based systems do not attempt to achieve the requirements of interactivity and durability, because they are typically used to execute asynchronous batch processing jobs that tolerate data loss -- results can be re-computed. These systems use external servers to store the final results of computations rather than workstation machines. This thesis describes the design and implementation of a workstation-based database system and investigates its viability by evaluating its performance against existing clustered database systems and testing its availability during machine failures.Comment: Ph.D. Thesi

    Efficient Online Processing for Advanced Analytics

    Get PDF
    With the advent of emerging technologies and the Internet of Things, the importance of online data analytics has become more pronounced. Businesses and companies are adopting approaches that provide responsive analytics to stay competitive in the global marketplace. Online analytics allow data analysts to promptly react to patterns or to gain preliminary insights from early results that aid in research, decision making, and effective strategy planning. The growth of data-velocity in a variety of domains including, high-frequency trading, social networks, infrastructure monitoring, and advertising require adopting online engines that can efficiently process continuous streams of data. This thesis presents foundations, techniques, and systems' design that extend the state-of-the-art in online query processing to efficiently support relational joins with arbitrary join-predicates (beyond traditional equi-joins); and to support other data models (beyond relational) that target machine learning and graph computations. The thesis is divided into two parts: We first present a brief overview of Squall, our open-source online query processing engine that supports SQL-like queries on top of streams. Then, we focus on extending Squall to support efficient theta-join processing. Scalable distributed join processing requires a partitioning policy that evenly distributes the processing load while minimizing the size of maintained state and duplicated messages. Efficient load-balance demands apriori-statistics which are not available in the online setting. We propose a novel operator that continuously adjusts itself to the data dynamics, through adaptive dataflow routing and state repartitioning. It is also resilient to data-skew, maintains high throughput rates, avoids blocking during state repartitioning, and behaves as a black-box dataflow operator with provable performance guarantees. Our evaluation demonstrates that the proposed operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time up to 7x. In the second part, we present a novel framework that supports the Incremental View Maintenance (IVM) of workloads expressed as linear algebra programs. Linear algebra represents a concrete substrate for advanced analytical tasks including, machine learning, scientific computation, and graph algorithms. Previous works on relational calculus IVM are not applicable to matrix algebra workloads. This is because a single entry change to an input-matrix results in changes all over the intermediate views, rendering IVM useless in comparison to re-evaluation. We present Lago, a unified modular compiler framework that supports the IVM of a broad class of linear algebra programs. Lago automatically derives and optimizes incremental trigger programs of analytical computations, while freeing the user from erroneous manual derivations, low-level implementation details, and performance tuning. We present a novel technique that captures Δ\Delta changes as low-rank matrices. Low-rank matrices are representable in a compressed factored form that enables cheaper computations. Lago automatically propagates the factored representation across program statements to derive an efficient trigger program. Moreover, Lago extends its support to other domains that use different semi-ring configurations, e.g., graph applications. Our evaluation results demonstrate orders of magnitude (10x-1
    corecore