228,907 research outputs found

    Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

    Full text link
    Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems

    A theoretical and methodological framework for machine learning in survival analysis: Enabling transparent and accessible predictive modelling on right-censored time-to-event data

    Get PDF
    Survival analysis is an important field of Statistics concerned with mak- ing time-to-event predictions with ‘censored’ data. Machine learning, specifically supervised learning, is the field of Statistics concerned with using state-of-the-art algorithms in order to make predictions on unseen data. This thesis looks at unifying these two fields as current research into the two is still disjoint, with ‘classical survival’ on one side and su- pervised learning (primarily classification and regression) on the other. This PhD aims to improve the quality of machine learning research in survival analysis by focusing on transparency, accessibility, and predic- tive performance in model building and evaluation. This is achieved by examining historic and current proposals and implementations for models and measures (both classical and machine learning) in survival analysis and making novel contributions. In particular this includes: i) a survey of survival models including a crit- ical and technical survey of almost all supervised learning model classes currently utilised in survival, as well as novel adaptations; ii) a survey of evaluation measures for survival models, including key definitions, proofs and theorems for survival scoring rules that had previously been missing from the literature; iii) introduction and formalisation of composition and reduction in survival analysis, with a view on increasing transparency of modelling strategies and improving predictive performance; iv) imple- mentation of several R software packages, in particular mlr3proba for machine learning in survival analysis; and v) the first large-scale bench- mark experiment on right-censored time-to-event data with 24 survival models and 66 datasets. Survival analysis has many important applications in medical statistics, engineering and finance, and as such requires the same level of rigour as other machine learning fields such as regression and classification; this thesis aims to make this clear by describing a framework from prediction and evaluation to implementation

    Culture and E-Learning: Automatic Detection of a Users’ Culture from Survey Data

    Get PDF
    Knowledge about the culture of a user is especially important for the design of e-learning applications. In the experiment reported here, questionnaire data was used to build machine learning models to automatically predict the culture of a user. This work can be applied to automatic culture detection and subsequently to the adaptation of user interfaces in e-learning

    A Survey on Compiler Autotuning using Machine Learning

    Full text link
    Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated quarterly here (Send me your new published papers to be added in the subsequent version) History: Received November 2016; Revised August 2017; Revised February 2018; Accepted March 2018

    SANTO: Social Aerial NavigaTion in Outdoors

    Get PDF
    In recent years, the advances in remote connectivity, miniaturization of electronic components and computing power has led to the integration of these technologies in daily devices like cars or aerial vehicles. From these, a consumer-grade option that has gained popularity are the drones or unmanned aerial vehicles, namely quadrotors. Although until recently they have not been used for commercial applications, their inherent potential for a number of tasks where small and intelligent devices are needed is huge. However, although the integrated hardware has advanced exponentially, the refinement of software used for these applications has not beet yet exploited enough. Recently, this shift is visible in the improvement of common tasks in the field of robotics, such as object tracking or autonomous navigation. Moreover, these challenges can become bigger when taking into account the dynamic nature of the real world, where the insight about the current environment is constantly changing. These settings are considered in the improvement of robot-human interaction, where the potential use of these devices is clear, and algorithms are being developed to improve this situation. By the use of the latest advances in artificial intelligence, the human brain behavior is simulated by the so-called neural networks, in such a way that computing system performs as similar as possible as the human behavior. To this end, the system does learn by error which, in an akin way to the human learning, requires a set of previous experiences quite considerable, in order for the algorithm to retain the manners. Applying these technologies to robot-human interaction do narrow the gap. Even so, from a bird's eye, a noticeable time slot used for the application of these technologies is required for the curation of a high-quality dataset, in order to ensure that the learning process is optimal and no wrong actions are retained. Therefore, it is essential to have a development platform in place to ensure these principles are enforced throughout the whole process of creation and optimization of the algorithm. In this work, multiple already-existing handicaps found in pipelines of this computational gauge are exposed, approaching each of them in a independent and simple manner, in such a way that the solutions proposed can be leveraged by the maximum number of workflows. On one side, this project concentrates on reducing the number of bugs introduced by flawed data, as to help the researchers to focus on developing more sophisticated models. On the other side, the shortage of integrated development systems for this kind of pipelines is envisaged, and with special care those using simulated or controlled environments, with the goal of easing the continuous iteration of these pipelines.Thanks to the increasing popularity of drones, the research and development of autonomous capibilities has become easier. However, due to the challenge of integrating multiple technologies, the available software stack to engage this task is restricted. In this thesis, we accent the divergencies among unmanned-aerial-vehicle simulators and propose a platform to allow faster and in-depth prototyping of machine learning algorithms for this drones
    • …
    corecore