33 research outputs found

    Learning Active Learning from Data

    Get PDF
    In this paper, we suggest a novel data-driven approach to active learning (AL). The key idea is to train a regressor that predicts the expected error reduction for a candidate sample in a particular learning state. By formulating the query selection procedure as a regression problem we are not restricted to working with existing AL heuristics; instead, we learn strategies based on experience from previous AL outcomes. We show that a strategy can be learnt either from simple synthetic 2D datasets or from a subset of domain-specific data. Our method yields strategies that work well on real data from a wide range of domains

    Active Learning for NLP with Large Language Models

    Full text link
    Human annotation of training samples is expensive, laborious, and sometimes challenging, especially for Natural Language Processing (NLP) tasks. To reduce the labeling cost and enhance the sample efficiency, Active Learning (AL) technique can be used to label as few samples as possible to reach a reasonable or similar results. To reduce even more costs and with the significant advances of Large Language Models (LLMs), LLMs can be a good candidate to annotate samples. This work investigates the accuracy and cost of using LLMs (GPT-3.5 and GPT-4) to label samples on 3 different datasets. A consistency-based strategy is proposed to select samples that are potentially incorrectly labeled so that human annotations can be used for those samples in AL settings, and we call it mixed annotation strategy. Then we test performance of AL under two different settings: (1) using human annotations only; (2) using the proposed mixed annotation strategy. The accuracy of AL models under 3 AL query strategies are reported on 3 text classification datasets, i.e., AG's News, TREC-6, and Rotten Tomatoes. On AG's News and Rotten Tomatoes, the models trained with the mixed annotation strategy achieves similar or better results compared to that with human annotations. The method reveals great potentials of LLMs as annotators in terms of accuracy and cost efficiency in active learning settings.Comment: init repor

    Balancing Exploration and Exploitation: A New Algorithm for Active Machine Learning

    Get PDF
    Active machine learning algorithms are used when large numbers of unlabeled examples are available and getting labels for them is costly (e.g. requiring consulting a human expert). Many conventional active learning algorithms focus on refining the decision boundary, at the expense of exploring new regions that the current hypothesis misclassifies. We propose a new active learning algorithm that balances such exploration with refining of the decision boundary by dynamically adjusting the probability to explore at each step. Our experimental results demonstrate improved performance on data sets that require extensive exploration while remaining competitive on data sets that do not. Our algorithm also shows significant tolerance of noise

    Combining active learning and semi-supervised learning techniques to extract protein interaction sentences

    Get PDF
    Background: Protein-protein interaction (PPI) extraction has been a focal point of many biomedical research and database curation tools. Both Active Learning and Semi-supervised SVMs have recently been applied to extract PPI automatically. In this paper, we explore combining the AL with the SSL to improve the performance of the PPI task. Methods: We propose a novel PPI extraction technique called PPISpotter by combining Deterministic Annealing-based SSL and an AL technique to extract protein-protein interaction. In addition, we extract a comprehensive set of features from MEDLINE records by Natural Language Processing (NLP) techniques, which further improve the SVM classifiers. In our feature selection technique, syntactic, semantic, and lexical properties of text are incorporated into feature selection that boosts the system performance significantly. Results: By conducting experiments with three different PPI corpuses, we show that PPISpotter is superior to the other techniques incorporated into semi-supervised SVMs such as Random Sampling, Clustering, and Transductive SVMs by precision, recall, and F-measure. Conclusions: Our system is a novel, state-of-the-art technique for efficiently extracting protein-protein interaction pairs.X116sciescopu

    Semantic Labeling of Mobile LiDAR Point Clouds via Active Learning and Higher Order MRF

    Get PDF
    【Abstract】Using mobile Light Detection and Ranging point clouds to accomplish road scene labeling tasks shows promise for a variety of applications. Most existing methods for semantic labeling of point clouds require a huge number of fully supervised point cloud scenes, where each point needs to be manually annotated with a specific category. Manually annotating each point in point cloud scenes is labor intensive and hinders practical usage of those methods. To alleviate such a huge burden of manual annotation, in this paper, we introduce an active learning method that avoids annotating the whole point cloud scenes by iteratively annotating a small portion of unlabeled supervoxels and creating a minimal manually annotated training set. In order to avoid the biased sampling existing in traditional active learning methods, a neighbor-consistency prior is exploited to select the potentially misclassified samples into the training set to improve the accuracy of the statistical model. Furthermore, lots of methods only consider short-range contextual information to conduct semantic labeling tasks, but ignore the long-range contexts among local variables. In this paper, we use a higher order Markov random field model to take into account more contexts for refining the labeling results, despite of lacking fully supervised scenes. Evaluations on three data sets show that our proposed framework achieves a high accuracy in labeling point clouds although only a small portion of labels is provided. Moreover, comparative experiments demonstrate that our proposed framework is superior to traditional sampling methods and exhibits comparable performance to those fully supervised models.10.13039/501100001809-National Natural Science Foundation of China; Collaborative Innovation Center of Haixi Government Affairs Big Data Sharin

    Semi-automated image analysis for the identification of bivalve larvae from a Cape Cod estuary

    Get PDF
    Author Posting. © Association for the Sciences of Limnology and Oceanography, 2012. This article is posted here by permission of Association for the Sciences of Limnology and Oceanography for personal use, not for redistribution. The definitive version was published in Limnology and Oceanography: Methods 10 (2012): 538-554, doi:10.4319/lom.2012.10.538.Machine-learning methods for identifying planktonic organisms are becoming well-established. Although similar morphologies among species make traditional image identification methods difficult for larval bivalves, species-specific shell birefringence patterns under polarized light permit identification by color and texture-based features. This approach uses cross-polarized images of bivalve larvae, extracts Gabor and color angle features from each image, and classifies images using a Support Vector Machine. We adapted this method, which was established on hatchery-reared larvae, to identify bivalve larvae from a series of field samples from a Cape Cod estuary in 2009. This method had 98% identification accuracy for four hatchery-reared species. We used a multiplex polymerase chain reaction (PCR) method to confirm field identifications and to compare accuracies to the software classifications. Image classification of larvae collected in the field had lower accuracies than both the classification of hatchery species and PCR-based identification due to error in visually classifying unknown larvae and variability in larval images from the field. A six-species field training set had the best correspondence to our visual classifications with 75% overall agreement and individual species agreements from 63% to 88%. Larval abundance estimates for a time-series of field samples showed good correspondence with visual methods after correction. Overall, this approach represents a cost- and time-saving alternative to molecular-based identifications and can produce sufficient results to address long-term abundance and transport-based questions on a species-specific level, a rarity in studies of bivalve larvae.This project was supported by an award to S. Gallager and C. Mingione Thompson from the Estuarine Reserves Division, Office of Ocean and Coastal Resource Management, National Ocean Service, National Oceanic and Atmospheric Administration and a grant from Woods Hole Oceanographic Institution’s Coastal Ocean Institute
    corecore