11,413 research outputs found

    HoloDetect: Few-Shot Learning for Error Detection

    Full text link
    We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages

    An ADMM Based Framework for AutoML Pipeline Configuration

    Full text link
    We study the AutoML problem of automatically configuring machine learning pipelines by jointly selecting algorithms and their appropriate hyper-parameters for all steps in supervised learning pipelines. This black-box (gradient-free) optimization with mixed integer & continuous variables is a challenging problem. We propose a novel AutoML scheme by leveraging the alternating direction method of multipliers (ADMM). The proposed framework is able to (i) decompose the optimization problem into easier sub-problems that have a reduced number of variables and circumvent the challenge of mixed variable categories, and (ii) incorporate black-box constraints along-side the black-box optimization objective. We empirically evaluate the flexibility (in utilizing existing AutoML techniques), effectiveness (against open source AutoML toolkits),and unique capability (of executing AutoML with practically motivated black-box constraints) of our proposed scheme on a collection of binary classification data sets from UCI ML& OpenML repositories. We observe that on an average our framework provides significant gains in comparison to other AutoML frameworks (Auto-sklearn & TPOT), highlighting the practical advantages of this framework

    Active Learning With Complementary Sampling for Instructing Class-Biased Multi-Label Text Emotion Classification

    Get PDF
    High-quality corpora have been very scarce for the text emotion research. Existing corpora with multi-label emotion annotations have been either too small or too class-biased to properly support a supervised emotion learning. In this paper, we propose a novel active learning method for efficiently instructing the human annotations for a less-biased and high-quality multi-label emotion corpus. Specifically, to compensate annotation for the minority-class examples, we propose a complementary sampling strategy based on unlabeled resources by measuring a probabilistic distance between the expected emotion label distribution in a temporary corpus and an uniform distribution. Qualitative evaluations are also given to the unlabeled examples, in which we evaluate the model uncertainties for multi-label emotion predictions, their syntactic representativeness for the other unlabeled examples, and their diverseness to the labeled examples, for a high-quality sampling. Through active learning, a supervised emotion classifier gets progressively improved by learning from these new examples. Experiment results suggest that by following these sampling strategies we can develop a corpus of high-quality examples with significantly relieved bias for emotion classes. Compared to the learning procedures based on traditional active learning algorithms, our learning procedure indicates the most efficient learning curve and estimates the best multi-label emotion predictions

    The Challenge of Machine Learning in Space Weather Nowcasting and Forecasting

    Get PDF
    The numerous recent breakthroughs in machine learning (ML) make imperative to carefully ponder how the scientific community can benefit from a technology that, although not necessarily new, is today living its golden age. This Grand Challenge review paper is focused on the present and future role of machine learning in space weather. The purpose is twofold. On one hand, we will discuss previous works that use ML for space weather forecasting, focusing in particular on the few areas that have seen most activity: the forecasting of geomagnetic indices, of relativistic electrons at geosynchronous orbits, of solar flares occurrence, of coronal mass ejection propagation time, and of solar wind speed. On the other hand, this paper serves as a gentle introduction to the field of machine learning tailored to the space weather community and as a pointer to a number of open challenges that we believe the community should undertake in the next decade. The recurring themes throughout the review are the need to shift our forecasting paradigm to a probabilistic approach focused on the reliable assessment of uncertainties, and the combination of physics-based and machine learning approaches, known as gray-box.Comment: under revie

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    corecore