304 research outputs found

    GraphflowDB: Scalable Query Processing on Graph-Structured Relations

    Get PDF
    Finding patterns over graph-structured datasets is ubiquitous and integral to a wide range of analytical applications, e.g., recommendation and fraud detection. When expressed in the high-level query languages of database management systems (DBMSs), these patterns correspond to many-to-many join computations, which generate very large intermediate relations during query processing and degrade the performance of existing systems. This thesis argues that modern query processors need to adopt two novel techniques to be efficient on growing many-to-many joins: (i) worst-case optimal join algorithms; and (ii) factorized representations. Traditional query processors generate join plans that use binary joins, which in iteration take two relations, base or intermediate, to join and produce a new relation. The theory of worst-case optimal joins have shown that this style of join processing can be provably suboptimal and hence generate unnecessarily large intermediate results. This can be avoided on cyclic join queries if the join is performed in a multi-way fashion a join-attribute-at-a-time. As its first contribution, this thesis proposes the design and implementation of a query processor and optimizer that can generate plans that mix worst-case optimal joins, i.e., attribute-at-a-time joins and binary joins, i.e., table-at-a-time joins. In contrast to prior approaches with novel join optimizers that require solving hard computational problems, such as computing low-width hypertree decompositions of queries, our join optimizer is cost-based and uses a traditional dynamic programming approach with a new cost metric. On acyclic queries, or acyclic parts of queries, sometimes the generation of large intermediate results cannot be avoided. Yet, the theory of factorization has shown that often such intermediate results can be highly compressible if they contain multi-valued dependencies between join attributes. Factorization proposes two relation representation schemes, called f- and d-representations, to represent the large intermediate results generated under many-to-many joins in a compressed format. Existing proposals to adopt factorized representations require designing processing on fully materialized general tries and novel operators that operate on entire tries, which are not easy to adopt in existing systems. As a second contribution, we describe the implementation of a novel query processing approach we call factorized vector execution that adopts f-representations. Factorized vector execution extends the traditional vectorized query processors to use multiple blocks of vectors instead of a single block allowing us to factorize intermediate results and delay or even avoid Cartesian products. Importantly, our design ensures that every core operator in the system still performs computations on vectors. As a third contribution, we further describe how to extend our factorized vector execution model with novel operators to adopt d-representations, which extend f-representations with cached and reused sub-relations. Our design here is based on using nested hash tables that can point to sub-relations instead of copying them and on directed acyclic graph-based query plans. All of our techniques are implemented in the GraphflowDB system, which was developed throughout the years to facilitate the research in this thesis. We demonstrate that GraphflowDB’s query processor can outperform existing approaches and systems by orders of magnitude on both micro-benchmarks and end-to-end benchmarks. The designs proposed in this thesis adopt common-wisdom query processing techniques of pipelining, vector-based execution, and morsel-driven parallelism to ensure easy adoption in existing systems. We believe the design can serve as a blueprint for how to adopt these techniques in existing DBMSs to make them more efficient on workloads with many-to-many joins

    LASSO – an observatorium for the dynamic selection, analysis and comparison of software

    Full text link
    Mining software repositories at the scale of 'big code' (i.e., big data) is a challenging activity. As well as finding a suitable software corpus and making it programmatically accessible through an index or database, researchers and practitioners have to establish an efficient analysis infrastructure and precisely define the metrics and data extraction approaches to be applied. Moreover, for analysis results to be generalisable, these tasks have to be applied at a large enough scale to have statistical significance, and if they are to be repeatable, the artefacts need to be carefully maintained and curated over time. Today, however, a lot of this work is still performed by human beings on a case-by-case basis, with the level of effort involved often having a significant negative impact on the generalisability and repeatability of studies, and thus on their overall scientific value. The general purpose, 'code mining' repositories and infrastructures that have emerged in recent years represent a significant step forward because they automate many software mining tasks at an ultra-large scale and allow researchers and practitioners to focus on defining the questions they would like to explore at an abstract level. However, they are currently limited to static analysis and data extraction techniques, and thus cannot support (i.e., help automate) any studies which involve the execution of software systems. This includes experimental validations of techniques and tools that hypothesise about the behaviour (i.e., semantics) of software, or data analysis and extraction techniques that aim to measure dynamic properties of software. In this thesis a platform called LASSO (Large-Scale Software Observatorium) is introduced that overcomes this limitation by automating the collection of dynamic (i.e., execution-based) information about software alongside static information. It features a single, ultra-large scale corpus of executable software systems created by amalgamating existing Open Source software repositories and a dedicated DSL for defining abstract selection and analysis pipelines. Its key innovations are integrated capabilities for searching for selecting software systems based on their exhibited behaviour and an 'arena' that allows their responses to software tests to be compared in a purely data-driven way. We call the platform a 'software observatorium' since it is a place where the behaviour of large numbers of software systems can be observed, analysed and compared

    Mechatronics and optimization development for wind tunnel tests

    Get PDF
    In the realm of the automotive industry, the development of vehicles entails the fulfillment of numerous requirements such as appealing design, comfort, safety, and efficiency. Notably, in recent years, the significance of efficiency has grown due to increasing environmental concerns regarding internal combustion engine (ICE) vehicles and limitations on the range of battery electric vehicles (BEVs). Of the various engineering aspects, aerodynamics assumes a pivotal role in determining the performance of cars, exerting a substantial influence on vehicle efficiency. To investigate and enhance aerodynamics, automotive companies adopt a combined approach involving both digital and real-world testing. The former is accomplished through the utilization of Computed Fluid Dynamic (CFD) analyses, while the latter entails wind tunnel testing of clay car models. This thesis covers the current approach to the study of aerodynamics, focusing on the issues that affect the existing workflow, including downtime and inaccuracies. In response to these challenges, a novel workflow based on automated mechatronics optimization is introduced and a prototype is tested, thereby showcasing a fresh and more efficient way of working with clay car models tested in wind tunnel facilities. The proposed workflow aims to enhance the aerodynamic optimization of vehicles by implementing a scalable, plug-and-play system that expedites the process and yields advanced, efficient designs. This endeavor has brought to remarkable results, such as the development of an innovative diffuser configuration that enhances efficiency during side-wind conditions, as well as a 73.4\% reduction in time within the current wind tunnel workflow through the application of automated mechatronics

    Performance, memory efficiency and programmability: the ambitious triptych of combining vertex-centricity with HPC

    Get PDF
    The field of graph processing has grown significantly due to the flexibility and wide applicability of the graph data structure. In the meantime, so has interest from the community in developing new approaches to graph processing applications. In 2010, Google introduced the vertex-centric programming model through their framework Pregel. This consists of expressing computation from the perspective of a vertex, whilst inter-vertex communications are achieved via data exchanges along incoming and outgoing edges, using the message-passing abstraction provided. Pregel ’s high-level programming interface, designed around a set of simple functions, provides ease of programmability to the user. The aim is to enable the development of graph processing applications without requiring expertise in optimisation or parallel programming. Such challenges are instead abstracted from the user and offloaded to the underlying framework. However, fine-grained synchronisation, unpredictable memory access patterns and multiple sources of load imbalance make it difficult to implement the vertex centric model efficiently on high performance computing platforms without sacrificing programmability. This research focuses on combining vertex-centric and High-Performance Comput- ing (HPC), resulting in the development of a shared-memory framework, iPregel, which demonstrates that a performance and memory efficiency similar to that of non-vertex- centric approaches can be achieved while preserving the programmability benefits of vertex-centric. Non-volatile memory is then explored to extend single-node capabilities, during which multiple versions of iPregel are implemented to experiment with the various data movement strategies. Then, distributed memory parallelism is investigated to overcome the resource limitations of single node processing. A second framework named DiP, which ports applicable iPregel ’s optimisations to distributed memory, prioritises performance to high scalability. This research has resulted in a set of techniques and optimisations illustrated through a shared-memory framework iPregel and a distributed-memory framework DiP. The former closes a gap of several orders of magnitude in both performance and memory efficiency, even able to process a graph of 750 billion edges using non-volatile memory. The latter has proved that this competitiveness can also be scaled beyond a single node, enabling the processing of the largest graph generated in this research, comprising 1.6 trillion edges. Most importantly, both frameworks achieved these performance and capability gains whilst also preserving programmability, which is the cornerstone of the vertex-centric programming model. This research therefore demonstrates that by combining vertex-centricity and High-Performance Computing (HPC), it is possible to maintain performance, memory efficiency and programmability

    Modeling, Simulation and Data Processing for Additive Manufacturing

    Get PDF
    Additive manufacturing (AM) or, more commonly, 3D printing is one of the fundamental elements of Industry 4.0. and the fourth industrial revolution. It has shown its potential example in the medical, automotive, aerospace, and spare part sectors. Personal manufacturing, complex and optimized parts, short series manufacturing and local on-demand manufacturing are some of the current benefits. Businesses based on AM have experienced double-digit growth in recent years. Accordingly, we have witnessed considerable efforts in developing processes and materials in terms of speed, costs, and availability. These open up new applications and business case possibilities all the time, which were not previously in existence. Most research has focused on material and AM process development or effort to utilize existing materials and processes for industrial applications. However, improving the understanding and simulation of materials and AM process and understanding the effect of different steps in the AM workflow can increase the performance even more. The best way of benefit of AM is to understand all the steps related to that—from the design and simulation to additive manufacturing and post-processing ending the actual application.The objective of this Special Issue was to provide a forum for researchers and practitioners to exchange their latest achievements and identify critical issues and challenges for future investigations on “Modeling, Simulation and Data Processing for Additive Manufacturing”. The Special Issue consists of 10 original full-length articles on the topic

    Integrating Data Science and Earth Science

    Get PDF
    This open access book presents the results of three years collaboration between earth scientists and data scientist, in developing and applying data science methods for scientific discovery. The book will be highly beneficial for other researchers at senior and graduate level, interested in applying visual data exploration, computational approaches and scientifc workflows

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems

    Women in Artificial intelligence (AI)

    Get PDF
    This Special Issue, entitled "Women in Artificial Intelligence" includes 17 papers from leading women scientists. The papers cover a broad scope of research areas within Artificial Intelligence, including machine learning, perception, reasoning or planning, among others. The papers have applications to relevant fields, such as human health, finance, or education. It is worth noting that the Issue includes three papers that deal with different aspects of gender bias in Artificial Intelligence. All the papers have a woman as the first author. We can proudly say that these women are from countries worldwide, such as France, Czech Republic, United Kingdom, Australia, Bangladesh, Yemen, Romania, India, Cuba, Bangladesh and Spain. In conclusion, apart from its intrinsic scientific value as a Special Issue, combining interesting research works, this Special Issue intends to increase the invisibility of women in AI, showing where they are, what they do, and how they contribute to developments in Artificial Intelligence from their different places, positions, research branches and application fields. We planned to issue this book on the on Ada Lovelace Day (11/10/2022), a date internationally dedicated to the first computer programmer, a woman who had to fight the gender difficulties of her times, in the XIX century. We also thank the publisher for making this possible, thus allowing for this book to become a part of the international activities dedicated to celebrating the value of women in ICT all over the world. With this book, we want to pay homage to all the women that contributed over the years to the field of AI

    Artificial intelligence driven anomaly detection for big data systems

    Get PDF
    The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources. The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98–99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology. The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naïve anomaly detection training. The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces

    Intelligent Biosignal Analysis Methods

    Get PDF
    This book describes recent efforts in improving intelligent systems for automatic biosignal analysis. It focuses on machine learning and deep learning methods used for classification of different organism states and disorders based on biomedical signals such as EEG, ECG, HRV, and others
    • 

    corecore