12 research outputs found

    Bridging the ML-Human Gap in Scientific Data Navigation

    Get PDF
    Off-the-shelf ML libraries combined with accessible scientific computing infrastructures continue to find new avenues for automation and augmentation of researcher work in the lab. Widely applicable pre-trained neural networks have greatly reduced the barrier of entry toward applying classification models, leaving the main challenge to be the translation of domain expert knowledge into machine intelligence. I have developed several specialized models solving specific lab problems with minimal training regimens by building atop published general-purpose frameworks. Applications include reinforcement-guided molecular dynamics simulations, human reaction-based dataset navigation through machine-readable P300 brain waves, and floating-zone furnace user guidance through classification of live boron-carbide crystal growth video. Evaluation of these purpose-built models constructed with limited, expensive training data is achieved in a combination of the established domain metrics with statistics techniques

    OctopusDB : flexible and scalable storage management for arbitrary database engines

    Get PDF
    We live in a dynamic age with the economy, the technology, and the people around us changing faster than ever before. Consequently, the data management needs in our modern world are much different than those envisioned by the early database inventors in the 70s. Today, enterprises face the challenge of managing ever-growing dataset sizes with dynamically changing query workloads. As a result, modern data managing systems, including relational as well as big data management systems, can no longer afford to be carved-in-stone solutions. Instead, data managing systems must inherently provide flexible data management techniques in order to cope with the constantly changing business needs. The current practice to deal with changing query workloads is to have a different specialized product for each workload type, e.g. row stores for OLTP workload, column stores for OLAP workload, streaming systems for streaming workload, and scan-oriented systems for shared query processing. However, this means that the enterprises have to now glue different data managing products together and copy data from one product to another, in order to support several query workloads. This has the additional penalty of managing a zoo of data managing systems in the first place, which is tedious, expensive, as well as counter-productive for modern enterprises. This thesis presents an alternative approach to supporting several query workloads in a data managing system. We observe that each specialized database product has a different data store, indicating that different query workloads work well with different data layouts. Therefore, a key requirement for supporting several query workloads is to support several data layouts. Therefore, in this thesis, we study ways to inject different data layouts into existing (and familiar) data managing systems. The goal is to develop a flexible storage layer which can support several query workloads in a single data managing system. We present a set of non-invasive techniques, coined Trojan Techniques, to inject different data layouts into a data managing system. The core idea of Trojan Techniques is to drop the assumption of having one fixed data store per data managing system. Trojan Techniques are non-invasive in the sense that they do not make heavy untenable changes to the system. Rather, they affect the data managing system from inside, almost at the core. As a result, Trojan Techniques bring significant improvements in query performance. It is interesting to note that in our approach we follow a design pattern that has been used in other non-invasive research works as well, such as PAX, fractal prefetching B+-trees, and RowCol. We propose four Trojan Techniques. First, Trojan Indexes add an additional index access path in Hadoop MapReduce. Second, Trojan Joins allow for co-partitioned joins in Hadoop MapReduce. Third, Trojan Layouts allow for row, column, or column-grouped layouts in Hadoop MapReduce. Together, these three techniques provide a highly flexible data storage layer for Hadoop MapReduce. Our final proposal, Trojan Columns, introduces columnar functionality in row-oriented relational databases, including closed source commercial databases, thus bridging the gap between row and column oriented databases. Our experimental results show that Trojan Techniques can improve the performance of Hadoop MapReduce by a factor of up to 18, and that of a top-notch commercial database product by a factor of up to 17.Wir leben in einer dynamischen Zeit, in der sich Wirtschaft, Technologie und Gesellschaft schneller verändern als jemals zuvor. Folglich unterscheiden sich die Anforderungen an Datenverarbeitung heute sehr von dem, was sich die Pioniere dieses Forschungsgebiets in den 70er Jahren ursprünglich ausgemalt hatten. Heutzutage sehen sich Firmen mit der Herausforderung konfrontiert, stark fluktuierende Anfragelasten über einer stetig wachsender Datenmengen zu bewältigen. Daher können es sich moderne Datenbanksysteme, sowohl relationale als auch Big Data Systeme, nicht mehr leisten, wie starre, in Stein gemeißelte Lösungen zu funktionieren. Stattdessen sollten moderne Datenbanksysteme von Grunde auf für flexible Datenverwaltung konzipiert werden, um mit sich ständig ändernden Anforderungen Schritt halten zu können. Die gegenwärtige Praxis im Umgang mit häufig wechselnden Anfragemustern besteht allerdings noch darin, jeweils unterschiedliche, spezialisierte Lösungen für die verschiedenen Anfragetypen zu nutzen - zum Beispiel zeilenorientierte Systeme für OLTP Anfragen, spaltenorientierte Systeme für OLAP Anfragen, Data Stream Management Systeme für kontinuierliche Datenströme und Scan-basierte Systeme für die Bearbeitung von vielen gleichzeitigen Anfragen. Leider setzt dieses Vorgehen aber voraus, dass die Unternehmen es schaffen die verschiedensten Systeme irgendwie miteinander zu verknüpfen und einen Datenaustausch zwischen ihnen zu gewährleisten. Ein zusätzlicher Nachteil ist, dass hierbei oft ein ganzes Sortiment von Datenbankprodukten eingerichtet und gepflegt werden muss, was sowohl zeit- als auch kostenintensiv und damit letztlich aufwendig ist. Diese Dissertation präsentiert eine alternative Lösung, um wechselnde Anfragemuster effizient mit einem einzigen Datenverwaltungssystem zu unterstützen. Aus der Beobachtung, dass jedes spezielle Datenbankprodukt unterschiedliche Ansätze zur Datenspeicherung nutzt, folgern wir, dass verschiedene Anfragen jeweils auf bestimmten Datenlayouts effizienter beantwortet werden können als auf anderen. Deshalb ist eine zentrale Anforderung zur effizienten Verarbeitung unterschiedlicher Anfragetypen mit nur einem System, dass dieses System verschiedene Datenlayouts unterstützen muss. Dazu untersuchen wir in dieser Arbeit Möglichkeiten, um verschiedene Datenlayouts nachträglich in bestehende (und bekannte) Datenbanksysteme einzuschleusen. Das Ziel hierbei ist die Entwicklung einer flexiblen Speicherschicht, die verschiedenste Anfragen in einem einzigen Datenbanksystem unterstützen kann. Wir haben hierzu eine Reihe von nichtinvasiven Techniken, auch Trojanische Techniken genannt, entwickelt, mit denen sich verschiedene Datenlayouts nachträglich in existierende Systeme einschleusen lassen. Die Grundidee hinter diesen Trojanischen Techniken ist es, die Annahme, dass jedes Datenbanksystem nur eine festgelegte Art der Datenspeicherung haben kann, fallen zu lassen. Die Trojanischen Techniken erfordern nur minimale Änderungen am ursprünglichen Datenbanksystem, sondern beeinflussen dessen Verhalten von innen heraus. Der Einsatz Trojanischen Techniken kann die Anfragegeschwindigkeit erheblich steigern. Wir folgen mit diesem Ansatz einem Entwurfsmuster, das auch in anderen nichtinvasiven Forschungsprojekten wie PAX, fpB+-Bäume und RowCol verwendet wurde. Wir stellen in dieser Arbeit vier verschiedene Trojanische Techniken vor. Als erstes zeigen wir, wie Trojanische Indexe die Integration eines Index in Hadoop MapReduce ermöglichen. Ergänzt wird dies durch Trojanische Joins, welche kopartitionierte Joins in Hadoop MapReduce ermöglichen. Danach zeigen wir, wie Trojanische Layouts Hadoop MapReduce um zeilen-, spalten- und gruppierte spaltenorientierte Datenlayouts erweitern. Zusammen bilden diese Techniken eine flexible Speicherschicht für das Hadoop MapReduce Framework. Unsere vierte Technik, Trojanische Spalten, erlaubt es uns, spaltenorientierte Datenverarbeitung nachträglich in zeilenbasierten Datenbanksysteme einzuführen und lässt sich sogar auf kommerzielle closed-source Produkten anwenden. Wir schließen damit die Lücke zwischen zeilen- und spaltenorientierten Datenbanksystemen. In unseren Experimenten zeigen wir, dass die Trojanischen Techniken die Leistung des Hadoop MapReduce Frameworks um das bis zu 18fache und die Geschwindigkeit einer aktuellen kommerziellen Datenbank um das 17fache erhöhen können

    A Data-driven Methodology Towards Mobility- and Traffic-related Big Spatiotemporal Data Frameworks

    Get PDF
    Human population is increasing at unprecedented rates, particularly in urban areas. This increase, along with the rise of a more economically empowered middle class, brings new and complex challenges to the mobility of people within urban areas. To tackle such challenges, transportation and mobility authorities and operators are trying to adopt innovative Big Data-driven Mobility- and Traffic-related solutions. Such solutions will help decision-making processes that aim to ease the load on an already overloaded transport infrastructure. The information collected from day-to-day mobility and traffic can help to mitigate some of such mobility challenges in urban areas. Road infrastructure and traffic management operators (RITMOs) face several limitations to effectively extract value from the exponentially growing volumes of mobility- and traffic-related Big Spatiotemporal Data (MobiTrafficBD) that are being acquired and gathered. Research about the topics of Big Data, Spatiotemporal Data and specially MobiTrafficBD is scattered, and existing literature does not offer a concrete, common methodological approach to setup, configure, deploy and use a complete Big Data-based framework to manage the lifecycle of mobility-related spatiotemporal data, mainly focused on geo-referenced time series (GRTS) and spatiotemporal events (ST Events), extract value from it and support decision-making processes of RITMOs. This doctoral thesis proposes a data-driven, prescriptive methodological approach towards the design, development and deployment of MobiTrafficBD Frameworks focused on GRTS and ST Events. Besides a thorough literature review on Spatiotemporal Data, Big Data and the merging of these two fields through MobiTraffiBD, the methodological approach comprises a set of general characteristics, technical requirements, logical components, data flows and technological infrastructure models, as well as guidelines and best practices that aim to guide researchers, practitioners and stakeholders, such as RITMOs, throughout the design, development and deployment phases of any MobiTrafficBD Framework. This work is intended to be a supporting methodological guide, based on widely used Reference Architectures and guidelines for Big Data, but enriched with inherent characteristics and concerns brought about by Big Spatiotemporal Data, such as in the case of GRTS and ST Events. The proposed methodology was evaluated and demonstrated in various real-world use cases that deployed MobiTrafficBD-based Data Management, Processing, Analytics and Visualisation methods, tools and technologies, under the umbrella of several research projects funded by the European Commission and the Portuguese Government.A população humana cresce a um ritmo sem precedentes, particularmente nas áreas urbanas. Este aumento, aliado ao robustecimento de uma classe média com maior poder económico, introduzem novos e complexos desafios na mobilidade de pessoas em áreas urbanas. Para abordar estes desafios, autoridades e operadores de transportes e mobilidade estão a adotar soluções inovadoras no domínio dos sistemas de Dados em Larga Escala nos domínios da Mobilidade e Tráfego. Estas soluções irão apoiar os processos de decisão com o intuito de libertar uma infraestrutura de estradas e transportes já sobrecarregada. A informação colecionada da mobilidade diária e da utilização da infraestrutura de estradas pode ajudar na mitigação de alguns dos desafios da mobilidade urbana. Os operadores de gestão de trânsito e de infraestruturas de estradas (em inglês, road infrastructure and traffic management operators — RITMOs) estão limitados no que toca a extrair valor de um sempre crescente volume de Dados Espaciotemporais em Larga Escala no domínio da Mobilidade e Tráfego (em inglês, Mobility- and Traffic-related Big Spatiotemporal Data —MobiTrafficBD) que estão a ser colecionados e recolhidos. Os trabalhos de investigação sobre os tópicos de Big Data, Dados Espaciotemporais e, especialmente, de MobiTrafficBD, estão dispersos, e a literatura existente não oferece uma metodologia comum e concreta para preparar, configurar, implementar e usar uma plataforma (framework) baseada em tecnologias Big Data para gerir o ciclo de vida de dados espaciotemporais em larga escala, com ênfase nas série temporais georreferenciadas (em inglês, geo-referenced time series — GRTS) e eventos espacio- temporais (em inglês, spatiotemporal events — ST Events), extrair valor destes dados e apoiar os RITMOs nos seus processos de decisão. Esta dissertação doutoral propõe uma metodologia prescritiva orientada a dados, para o design, desenvolvimento e implementação de plataformas de MobiTrafficBD, focadas em GRTS e ST Events. Além de uma revisão de literatura completa nas áreas de Dados Espaciotemporais, Big Data e na junção destas áreas através do conceito de MobiTrafficBD, a metodologia proposta contem um conjunto de características gerais, requisitos técnicos, componentes lógicos, fluxos de dados e modelos de infraestrutura tecnológica, bem como diretrizes e boas práticas para investigadores, profissionais e outras partes interessadas, como RITMOs, com o objetivo de guiá-los pelas fases de design, desenvolvimento e implementação de qualquer pla- taforma MobiTrafficBD. Este trabalho deve ser visto como um guia metodológico de suporte, baseado em Arqui- teturas de Referência e diretrizes amplamente utilizadas, mas enriquecido com as característi- cas e assuntos implícitos relacionados com Dados Espaciotemporais em Larga Escala, como no caso de GRTS e ST Events. A metodologia proposta foi avaliada e demonstrada em vários cenários reais no âmbito de projetos de investigação financiados pela Comissão Europeia e pelo Governo português, nos quais foram implementados métodos, ferramentas e tecnologias nas áreas de Gestão de Dados, Processamento de Dados e Ciência e Visualização de Dados em plataformas MobiTrafficB

    Adaptive Data Storage and Placement in Distributed Database Systems

    Get PDF
    Distributed database systems are widely used to provide scalable storage, update and query facilities for application data. Distributed databases primarily use data replication and data partitioning to spread load across nodes or sites. The presence of hotspots in workloads, however, can result in imbalanced load on the distributed system resulting in performance degradation. Moreover, updates to partitioned and replicated data can require expensive distributed coordination to ensure that they are applied atomically and consistently. Additionally, data storage formats, such as row and columnar layouts, can significantly impact latencies of mixed transactional and analytical workloads. Consequently, how and where data is stored among the sites in a distributed database can significantly affect system performance, particularly if the workload is not known ahead of time. To address these concerns, this thesis proposes adaptive data placement and storage techniques for distributed database systems. This thesis demonstrates that the performance of distributed database systems can be improved by automatically adapting how and where data is stored by leveraging online workload information. A two-tiered architecture for adaptive distributed database systems is proposed that includes an adaptation advisor that decides at which site(s) and how transactions execute. The adaptation advisor makes these decisions based on submitted transactions. This design is used in three adaptive distributed database systems presented in this thesis: (i) DynaMast that efficiently transfers data mastership to guarantee single-site transactions while maintaining well-understood and established transactional semantics, (ii) MorphoSys that selectively and adaptively replicates, partitions and remasters data based on a learned cost model to improve transaction processing, and (iii) Proteus that uses learned workload models to predictively and adaptively change storage layouts to support both high transactional throughput and low latency analytical queries. Collectively, this thesis is a concrete step towards autonomous database systems that allow users to specify only the data to store and the queries to execute, leaving the system to judiciously choose the storage and execution mechanisms to deliver high performance

    Optimization Techniques for Complex Multi-query Applications

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Density-Aware Linear Algebra in a Column-Oriented In-Memory Database System

    Get PDF
    Linear algebra operations appear in nearly every application in advanced analytics, machine learning, and of various science domains. Until today, many data analysts and scientists tend to use statistics software packages or hand-crafted solutions for their analysis. In the era of data deluge, however, the external statistics packages and custom analysis programs that often run on single-workstations are incapable to keep up with the vast increase in data volume and size. In particular, there is an increasing demand of scientists for large scale data manipulation, orchestration, and advanced data management capabilities. These are among the key features of a mature relational database management system (DBMS). With the rise of main memory database systems, it now has become feasible to also consider applications that built up on linear algebra. This thesis presents a deep integration of linear algebra functionality into an in-memory column-oriented database system. In particular, this work shows that it has become feasible to execute linear algebra queries on large data sets directly in a DBMS-integrated engine (LAPEG), without the need of transferring data and being restricted by hard disc latencies. From various application examples that are cited in this work, we deduce a number of requirements that are relevant for a database system that includes linear algebra functionality. Beside the deep integration of matrices and numerical algorithms, these include optimization of expressions, transparent matrix handling, scalability and data-parallelism, and data manipulation capabilities. These requirements are addressed by our linear algebra engine. In particular, the core contributions of this thesis are: firstly, we show that the columnar storage layer of an in-memory DBMS yields an easy adoption of efficient sparse matrix data types and algorithms. Furthermore, we show that the execution of linear algebra expressions significantly benefits from different techniques that are inspired from database technology. In a novel way, we implemented several of these optimization strategies in LAPEG’s optimizer (SpMachO), which uses an advanced density estimation method (SpProdest) to predict the matrix density of intermediate results. Moreover, we present an adaptive matrix data type AT Matrix to obviate the need of scientists for selecting appropriate matrix representations. The tiled substructure of AT Matrix is exploited by our matrix multiplication to saturate the different sockets of a multicore main-memory platform, reaching up to a speed-up of 6x compared to alternative approaches. Finally, a major part of this thesis is devoted to the topic of data manipulation; where we propose a matrix manipulation API and present different mutable matrix types to enable fast insertions and deletes. We finally conclude that our linear algebra engine is well-suited to process dynamic, large matrix workloads in an optimized way. In particular, the DBMS-integrated LAPEG is filling the linear algebra gap, and makes columnar in-memory DBMS attractive as efficient, scalable ad-hoc analysis platform for scientists

    Data warehousing technologies for large-scale and right-time data

    Get PDF

    Efficient Incremental Data Analysis

    Get PDF
    Many data-intensive applications require real-time analytics over streaming data. In a growing number of domains -- sensor network monitoring, social web applications, clickstream analysis, high-frequency algorithmic trading, and fraud detections to name a few -- applications continuously monitor stream events to promptly react to certain data conditions. These applications demand responsive analytics even when faced with high volume and velocity of incoming changes, large numbers of users, and complex processing requirements. Developing suitable online analytics engine that meets these requirements is challenging. In this thesis, we study techniques for efficient online processing of complex analytical queries, ranging from standard database queries to complex machine learning and digital signal processing workflows. First, we focus on the problem of efficient incremental computation for database queries. We have developed a system, called DBToaster, that compiles declarative queries into high-performance stream processing engines that keep query results (views) fresh at very high update rates. At the heart of our system is a recursive query compilation algorithm that materializes a set of supporting higher-order delta views to achieve a substantially lower view maintenance cost. We study the trade-offs between single-tuple and batch incremental processing in local execution, and we present a novel approach for compiling view maintenance code into data-parallel programs optimized for distributed execution. DBToaster supports millions of complete view refreshes per second for a broad range of queries and outperforms commercial database and stream engines by orders of magnitude. We also study the incremental computation for queries written as iterative linear algebra, which can capture many machine learning and scientific calculations. We have developed a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change and make incremental view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our last research question concerns the integration of general-purpose query processors and domain-specific operations to enable deep data exploration in both online and offline analysis. We advocate a deep integration of signal processing operations and general-purpose query processors. We demonstrate that in-situ processing of tempo-relational and signal data through a unified query language empowers users to express end-to-end workflows more succinctly inside one system while at the same time offering orders of magnitude better performance than existing popular data management systems

    Linked Open Data - Creating Knowledge Out of Interlinked Data: Results of the LOD2 Project

    Get PDF
    Database Management; Artificial Intelligence (incl. Robotics); Information Systems and Communication Servic

    Tackling the veracity and variety of big data

    Get PDF
    This thesis tackles the veracity and variety challenges of big data, especially focusing on graphs and relational data. We start with proposing a class of graph association rules (GARs) to specify regularities between entities in graphs, which capture both missing links and inconsistencies. A GAR is a combination of a graph pattern and a dependency; it may take as predicates machine learning classifiers for link prediction. We formalize association deduction with GARs in terms of the chase, and prove its Church-Rosser property. We show that the satisfiability, implication and association deduction problems for GARs are coNP-complete, NP-complete and NP-complete, respectively. The incremental deduction problem is DP-complete for GARs. In addition, we provide parallel algorithms for association deduction and incremental deduction. We next develop a parallel algorithm to discover GARs, which applies an applicationdriven strategy to cut back rules and data that are irrelevant to users’ interest, by training a machine learning model to identify data pertaining to a given application. Moreover, we introduce a sampling method to reduce a big graph G to a set H of small sample graphs. Given expected support and recall bounds, this method is able to deduce samples in H and mine rules from H to satisfy the bounds in the entire G. Then we propose a class of temporal association rules (TACOs) for event prediction in temporal graphs. TACOs are defined on temporal graphs in terms of change patterns and (temporal) conditions, and may carry machine learning predicates for temporal event prediction. We settle the complexity of reasoning about TACOs, including their satisfiability, implication and prediction problems. We develop a system that discovers TACOs by iteratively training a rule creator based on generative models in a creatorcritic framework, and predicts events by applying the discovered TACOs in parallel. Finally, we propose an approach to querying relations D and graphs G taken together in SQL. The key idea is that if a tuple t in D and a vertex v in G are determined to refer to the same real-world entity, then we join t and v, correlate their information and complement tuple t with additional attributes of v from graphs. We show how to do this in SQL extended with only syntactic sugar, for both static joins when t is a tuple in D and dynamic joins when t comes from intermediate results of sub-queries on D. To support the semantic joins, we propose an attribute extraction scheme based on Kmeans clustering, to identify and fetch graph properties that are linked to v via paths. Moreover, we develop a scheme to extract a relation schema for entities in graphs, and a heuristic join method based on the schema to strike a balance between the complexity and accuracy of dynamic joins
    corecore