222 research outputs found

    Exploiting Class Learnability in Noisy Data

    Full text link
    In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.Comment: Accepted to AAAI 201

    Large-scale medical image annotation with quality-controlled crowdsourcing

    Get PDF
    Accurate annotations of medical images are essential for various clinical applications. The remarkable advances in machine learning, especially deep learning based techniques, show great potential for automatic image segmentation. However, these solutions require a huge amount of accurately annotated reference data for training. Especially in the domain of medical image analysis, the availability of domain experts for reference data generation is becoming a major bottleneck for machine learning applications. In this context, crowdsourcing has gained increasing attention as a tool for low-cost and large-scale data annotation. As a method to outsource cognitive tasks to anonymous non-expert workers over the internet, it has evolved into a valuable tool for data annotation in various research fields. Major challenges in crowdsourcing remain the high variance in the annotation quality as well as the lack of domain specific knowledge of the individual workers. Current state-of-the-art methods for quality control usually induce further costs, as they rely on a redundant distribution of tasks or perform additional annotations on tasks with already known reference outcome. Aim of this thesis is to apply common crowdsourcing techniques for large-scale medical image annotation and create a cost effective quality control method for crowd-sourced image annotation. The problem of large-scale medical image annotation is addressed by introducing a hybrid crowd-algorithm approach that allowed expert-level organ segmentation in CT scans. A pilot study performed on the case of liver segmentation in abdominal CT scans showed that the proposed approach is able to create organ segmentations matching the quality of those create by medical experts. Recording the behavior of individual non-expert online workers during the annotation process in clickstreams enabled the derivation of an annotation quality measure that could successfully be used to merge crowd-sourced segmentations. A comprehensive validation study performed with various object classes from publicly available data sets demonstrated that the presented quality control measure generalizes well over different object classes and clearly outperforms state-of-the-art methods in terms of costs and segmentation quality. In conclusion, the methods introduced in this thesis are an essential contribution to reduce the annotation costs and further improve the quality of crowd-sourced image segmentation

    Efficiently and Effectively Learning Models of Similarity from Human Feedback

    Get PDF
    Vital to the success of many machine learning tasks is the ability to reason about how objects relate. For this, machine learning methods utilize a model of similarity that describes how objects are to be compared. While traditional methods commonly compare objects as feature vectors by standard measures such as the Euclidean distance or cosine similarity, other models of similarity can be used that include auxiliary information outside of that which is conveyed through features. To build such models, information must be given about object relationships that is beneficial to the task being considered. In many tasks, such as object recognition, ranking, product recommendation, and data visualization, a model based on human perception can lead to high performance. Other tasks require models that reflect certain domain expertise. In both cases, humans are able to provide information that can be used to build useful models of similarity. It is this reason that motivates similarity-learning methods that use human feedback to guide the construction of models of similarity. Associated with the task of learning similarity from human feedback are many practical challenges that must be considered. In this dissertation we explicitly define these challenges as being those of efficiency and effectiveness. Efficiency deals with both making the most of obtained feedback, as well as, reducing the computational run time of the learning algorithms themselves. Effectiveness concerns itself with producing models that accurately reflect the given feedback, but also with ensuring the queries posed to humans are those they can answer easily and without errors. After defining these challenges, we create novel learning methods that explicitly focus on one or more of these challenges as a means to improve on the state-of-the-art in similarity-learning. Specifically, we develop methods for learning models of perceptual similarity, as well as models that reflect domain expertise. In doing so, we enable similarity-learning methods to be practically applied in more real-world problem settings

    Plastic Rain in Protected Areas of the United States

    Get PDF
    Eleven billion metric tons of plastic are projected to accumulate in the environment by 2025. Because plastics are persistent, they fragment into pieces that are susceptible to wind entrainment. Using high-resolution spatial and temporal data, we tested whether plastics deposited in wet versus dry conditions have distinct atmospheric life histories. Further, we report on the rates and sources of deposition to remote U.S. conservation areas. We show that urban centers and resuspension from soils or water are principal sources for wet-deposited plastics. By contrast, plastics deposited under dry conditions were smaller in size, and the rates of deposition were related to indices that suggest longer-range or global transport. Deposition rates averaged 132 plastics per square meter per day, which amounts to \u3e1000 metric tons of plastic deposition to western U.S. protected lands annually
    corecore