9 research outputs found

    Leveraging the Potentials of Dedicated Collaborative Interactive Learning: Conceptual Foundations to Overcome Uncertainty by Human-Machine Collaboration

    Get PDF
    When a learning system learns from data that was previously assigned to categories, we say that the learning system learns in a supervised way. By supervised , we mean that a higher entity, for example a human, has arranged the data into categories. Fully categorizing the data is cost intensive and time consuming. Moreover, the categories (labels) provided by humans might be subject to uncertainty, as humans are prone to error. This is where dedicate collaborative interactive learning (D-CIL) comes together: The learning system can decide from which data it learns, copes with uncertainty regarding the categories, and does not require a fully labeled dataset. Against this background, we create the foundations of two central challenges in this early development stage of D-CIL: task complexity and uncertainty. We present an approach to crowdsourcing traffic sign labels with self-assessment that will support leveraging the potentials of D-CIL

    Active Learning with Label Proportions

    Get PDF

    Exploiting Counter-Examples for Active Learning with Partial labels

    Full text link
    This paper studies a new problem, \emph{active learning with partial labels} (ALPL). In this setting, an oracle annotates the query samples with partial labels, relaxing the oracle from the demanding accurate labeling process. To address ALPL, we first build an intuitive baseline that can be seamlessly incorporated into existing AL frameworks. Though effective, this baseline is still susceptible to the \emph{overfitting}, and falls short of the representative partial-label-based samples during the query process. Drawing inspiration from human inference in cognitive science, where accurate inferences can be explicitly derived from \emph{counter-examples} (CEs), our objective is to leverage this human-like learning pattern to tackle the \emph{overfitting} while enhancing the process of selecting representative samples in ALPL. Specifically, we construct CEs by reversing the partial labels for each instance, and then we propose a simple but effective WorseNet to directly learn from this complementary pattern. By leveraging the distribution gap between WorseNet and the predictor, this adversarial evaluation manner could enhance both the performance of the predictor itself and the sample selection process, allowing the predictor to capture more accurate patterns in the data. Experimental results on five real-world datasets and four benchmark datasets show that our proposed method achieves comprehensive improvements over ten representative AL frameworks, highlighting the superiority of WorseNet. The source code will be available at \url{https://github.com/Ferenas/APLL}.Comment: 29 pages, Under revie

    New Paradigms for Active Learning

    Get PDF
    In traditional active learning, learning algorithms (or learners) mainly focus on the performance of the final model built and the total number of queries needed for learning a good model. However, in many real-world applications, active learners have to focus on the learning process for achieving finer goals, such as minimizing the number of mistakes in predicting unlabeled examples. These learning goals are common and important in real-world applications. For example, in direct marketing, a sales agent (learner) has to focus on the process of selecting customers to approach, and tries to make correct predictions (i.e., fewer mistakes) on the customers who will buy the product. However, traditional active learning algorithms cannot achieve the finer learning goals due to the different focuses. In this thesis, we study how to control the learning process in active learning such that those goals can be accomplished. According to various learning tasks and goals, we address four new active paradigms as follows. The first paradigm is learning actively and conservatively. Under this paradigm, the learner actively selects and predicts the most certain example (thus, conservatively) iteratively during the learning process. The goal of this paradigm is to minimize the number of mistakes in predicting unlabeled examples during active learning. Intuitively the conservative strategy is less likely to make mistakes, i.e., more likely to achieve the learning goal. We apply this new learning strategy in an educational software, as well as direct marketing. The second paradigm is learning actively and aggressively. Under this paradigm, unlabeled examples and multiple oracles are available. The learner actively selects the best multiple oracles to label the most uncertain example (thus, aggressively) iteratively during the learning process. The learning goal is to learn a good model with guaranteed label quality. The third paradigm is learning actively with conservative-aggressive tradeoff. Under this learning paradigm, firstly, unlabeled examples are available and learners are allowed to select examples actively to learn. Secondly, to obtain the labels, two actions can be considered: querying oracles and making predictions. Lastly, cost has to be paid for querying oracles or for making wrong predictions. The tradeoff between the two actions is necessary for achieving the learning goal: minimizing the total cost for obtaining the labels. The last paradigm is learning actively with minimal/maximal effort. Under this paradigm, the labels of the examples are all provided and learners are allowed to select examples actively to learn. The learning goal is to control the learning process by selecting examples actively such that the learning can be accomplished with minimal effort or a good model can be built fast with maximal effort. For each of the four learning paradigms, we propose effective learning algorithms accordingly and demonstrate empirically that related learning problems in real applications can be solved well and the learning goals can be accomplished. In summary, this thesis focuses on controlling the learning process to achieve fine goals in active learning. According to various real application tasks, we propose four novel learning paradigms, and for each paradigm we propose efficient learning algorithms to solve the learning problems. The experimental results show that our learning algorithms outperform other state-of-the-art learning algorithms

    User-Centric Active Learning for Outlier Detection

    Get PDF
    Outlier detection searches for unusual, rare observations in large, often high-dimensional data sets. One of the fundamental challenges of outlier detection is that ``unusual\u27\u27 typically depends on the perception of a user, the recipient of the detection result. This makes finding a formal definition of ``unusual\u27\u27 that matches with user expectations difficult. One way to deal with this issue is active learning, i.e., methods that ask users to provide auxiliary information, such as class label annotations, to return algorithmic results that are more in line with the user input. Active learning is well-suited for outlier detection, and many respective methods have been proposed over the last years. However, existing methods build upon strong assumptions. One example is the assumption that users can always provide accurate feedback, regardless of how algorithmic results are presented to them -- an assumption which is unlikely to hold when data is high-dimensional. It is an open question to which extent existing assumptions are in the way of realizing active learning in practice. In this thesis, we study this question from different perspectives with a differentiated, user-centric view on active learning. In the beginning, we structure and unify the research area on active learning for outlier detection. Specifically, we present a rigorous specification of the learning setup, structure the basic building blocks, and propose novel evaluation standards. Throughout our work, this structure has turned out to be essential to select a suitable active learning method, and to assess novel contributions in this field. We then present two algorithmic contributions to make active learning for outlier detection user-centric. First, we bring together two research areas that have been looked at independently so far: outlier detection in subspaces and active learning. Subspace outlier detection are methods to improve outlier detection quality in high-dimensional data, and to make detection results more easy to interpret. Our approach combines them with active learning such that one can balance between detection quality and annotation effort. Second, we address one of the fundamental difficulties with adapting active learning to specific applications: selecting good hyperparameter values. Existing methods to estimate hyperparameter values are heuristics, and it is unclear in which settings they work well. In this thesis, we therefore propose the first principled method to estimate hyperparameter values. Our approach relies on active learning to estimate hyperparameter values, and returns a quality estimate of the values selected. In the last part of the thesis, we look at validating active learning for outlier detection practically. There, we have identified several technical and conceptual challenges which we have experienced firsthand in our research. We structure and document them, and finally derive a roadmap towards validating active learning for outlier detection with user studies

    Active Learning from Knowledge-Rich Data

    Get PDF
    With the ever-increasing demand for the quality and quantity of the training samples, it is difficult to replicate the success of modern machine learning models in knowledge-rich domains, where the labeled data for training is scarce and labeling new data is expensive. While machine learning and AI have achieved significant progress in many common domains, the lack of large-scale labeled data samples poses a grand challenge for the wide application of advanced statistical learning models in key knowledge-rich domains, such as medicine, biology, physical science, and more. Active learning (AL) offers a promising and powerful learning paradigm that can significantly reduce the data-annotation stress by allowing the model to only sample the informative objects to learn from human experts. Previous AL models leverage simple criteria to explore the data space and achieve fast convergence of AL. However, those active sampling methods are less effective in exploring knowledge-rich data spaces and result in slow convergence of AL. In this thesis, we propose novel AL methods to address knowledge-rich data exploration challenges with respect to different types of machine learning tasks. Specifically, for multi-class tasks, we propose three approaches that leverage different types of sparse kernel machines to better capture the data covariance and use them to guide effective data exploration in a complex feature space. For multi-label tasks, it is essential to capture label correlations, and we model them in three different approaches to guide effective data exploration in a large and correlated label space. For data exploration in a very high-dimension feature space, we present novel uncertainty measures to better control the exploration behavior of deep learning models and leverage a uniquely designed regularizer to achieve effective exploration in high-dimension space. Our proposed models not only exhibit a good behavior of exploration for different types of knowledge-rich data but also manage to achieve an optimal exploration-exploitation balance with strong theoretical underpinnings. In the end, we study active learning in a more realistic scenario where human annotators provide noisy labels. We propose a re-sampling paradigm that leverages the machine\u27s awareness to reduce the noise rate. We theoretically prove the effectiveness of the re-sampling paradigm and design a novel spatial-temporal active re-sampling function by leveraging the critical spatial and temporal properties of the maximum-margin kernel classifiers

    Advanced Entity Resolution Techniques

    Get PDF
    Entity resolution is the task of determining which records in one or more data sets correspond to the same real-world entities. Entity resolution is an important problem with a range of applications for government agencies, commercial organisations, and research institutions. Due to the important practical applications and many open challenges, entity resolution is an active area of research and a variety of techniques have been developed for each part of the entity resolution process. This thesis is about trying to improve the viability of sophisticated entity resolution techniques for real-world entity resolution problems. Collective entity resolution techniques are a subclass of entity resolution approaches that incorporate relationships into the entity resolution process and introduce dependencies between matching decisions. Group linkage techniques match multiple related records at the same time. Temporal entity resolution techniques incorporate changing attribute values and relationships into the entity resolution process. Population reconstruction techniques match records with different entity roles and very limited information in the presence of domain constraints. Sophisticated entity resolution techniques such as these produce good results when applied to small data sets in an academic environment. However, they suffer from a number of limitations which make them harder to apply to real-world problems. In this thesis, we aim to address several of these limitations with the goal that this will enable such advanced entity resolution techniques to see more use in practical applications. One of the main limitations of existing advanced entity resolution techniques is poor scalability. We propose a novel size-constrained blocking framework, that allows the user to set minimum and maximum block-size thresholds, and then generates blocks where the number of records in each block is within the size range. This allows efficiency requirements to be met, improves parallelisation, and allows expensive techniques with poor scalability such as Markov logic networks to be used. Another significant limitation of advanced entity resolution techniques in practice is a lack of training data. Collective entity resolution techniques make use of relationship information so a bootstrapping process is required in order to generate initial relationships. Many techniques for temporal entity resolution, group linkage and population reconstruction also require training data. In this thesis we propose a novel approach for automatically generating high quality training data using a combination of domain constraints and ambiguity. We also show how we can incorporate these constraints and ambiguity measures into active learning to further improve the training data set. We also address the problem of parameter tuning and evaluation. Advanced entity resolution approaches typically have a large number of parameters that need to be tuned for good performance. We propose a novel approach using transitive closure that eliminates unsound parameter choices in the blocking and similarity calculation steps and reduces the number of iterations of the entity resolution process and the corresponding evaluation. Finally, we present a case study where we extend our training data generation approach for situations where relationships exist between records. We make use of the relationship information to validate the matches generated by our technique, and we also extend the concept of ambiguity to cover groups, allowing us to increase the size of the generated set of matches. We apply this approach to a very complex and challenging data set of population registry data and demonstrate that we can still create high quality training data when other approaches are inadequate

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p
    corecore