18 research outputs found

    Estimation of errors in text and data processing

    Get PDF
    The company Adiss Lab Lts. obtained 1 000 000 medical reports that are either in free form text, or in XML format. One of the main goals of their development is to integrate an algorithm for information extraction (IE) in their platform. The verification of the algorithm’s output for a report is done by a medical doctor (MD) for a certain fee. Validating the correctness of all data would be overwhelming and very expensive. Hence, the problem, as presented by the company, is to provide a method (algorithm) which determines the minimum amount of reports that will validate the correctness of the IE algorithm and a procedure for selecting these reports. In order to solve the problem we have considered an algorithm-centric approach uses active learning and semi-supervised learning

    ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation

    Full text link
    The incredible capabilities of generative artificial intelligence models have inevitably led to their application in the domain of drug discovery. Within this domain, the vastness of chemical space motivates the development of more efficient methods for identifying regions with molecules that exhibit desired characteristics. In this work, we present a computationally efficient active learning methodology that requires evaluation of only a subset of the generated data in the constructed sample space to successfully align a generative model with respect to a specified objective. We demonstrate the applicability of this methodology to targeted molecular generation by fine-tuning a GPT-based molecular generator toward a protein with FDA-approved small-molecule inhibitors, c-Abl kinase. Remarkably, the model learns to generate molecules similar to the inhibitors without prior knowledge of their existence, and even reproduces two of them exactly. We also show that the methodology is effective for a protein without any commercially available small-molecule inhibitors, the HNH domain of the CRISPR-associated protein 9 (Cas9) enzyme. We believe that the inherent generality of this method ensures that it will remain applicable as the exciting field of in silico molecular generation evolves. To facilitate implementation and reproducibility, we have made all of our software available through the open-source ChemSpaceAL Python package

    Discrepancy-Based Active Learning for Domain Adaptation

    Full text link
    The goal of the paper is to design active learning strategies which lead to domain adaptation under an assumption of covariate shift in the case of Lipschitz labeling function. Building on previous work by Mansour et al. (2009) we adapt the concept of discrepancy distance between source and target distributions to restrict the maximization over the hypothesis class to a localized class of functions which are performing accurate labeling on the source domain. We derive generalization error bounds for such active learning strategies in terms of Rademacher average and localized discrepancy for general loss functions which satisfy a regularity condition. A practical K-medoids algorithm that can address the case of large data set is inferred from the theoretical bounds. Our numerical experiments show that the proposed algorithm is competitive against other state-of-the-art active learning techniques in the context of domain adaptation, in particular on large data sets of around one hundred thousand images.Comment: 28 pages, 11 figure

    Efficient Methods for Natural Language Processing: A Survey

    Full text link
    Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.Comment: Accepted at TACL, pre publication versio

    Active learning in VAE latent space

    Get PDF

    Reducing the Burden of Aerial Image Labelling Through Human-in-the-Loop Machine Learning Methods

    Get PDF
    This dissertation presents an introduction to human-in-the-loop deep learning methods for remote sensing applications. It is motivated by the need to decrease the time spent by volunteers on semantic segmentation of remote sensing imagery. We look at two human-in-the-loop approaches of speeding up the labelling of the remote sensing data: interactive segmentation and active learning. We develop these methods specifically in response to the needs of the disaster relief organisations who require accurately labelled maps of disaster-stricken regions quickly, in order to respond to the needs of the affected communities. To begin, we survey the current approaches used within the field. We analyse the shortcomings of these models which include outputs ill-suited for uploading to mapping databases, and an inability to label new regions well, when the new regions differ from the regions trained on. The methods developed then look at addressing these shortcomings. We first develop an interactive segmentation algorithm. Interactive segmentation aims to segment objects with a supervisory signal from a user to assist the model. Work within interactive segmentation has focused largely on segmenting one or few objects within an image. We make a few adaptions to allow an existing method to scale to remote sensing applications where there are tens of objects within a single image that needs to be segmented. We show a quantitative improvements of up to 18% in mean intersection over union, as well as qualitative improvements. The algorithm works well when labelling new regions, and the qualitative improvements show outputs more suitable for uploading to mapping databases. We then investigate active learning in the context of remote sensing. Active learning looks at reducing the number of labelled samples required by a model to achieve an acceptable performance level. Within the context of deep learning, the utility of the various active learning strategies developed is uncertain, with conflicting results within the literature. We evaluate and compare a variety of sample acquisition strategies on the semantic segmentation tasks in scenarios relevant to disaster relief mapping. Our results show that all active learning strategies evaluated provide minimal performance increases over a simple random sample acquisition strategy. However, we present analysis of the results illustrating how the various strategies work and intuition of when certain active learning strategies might be preferred. This analysis could be used to inform future research. We conclude by providing examples of the synergies of these two approaches, and indicate how this work, on reducing the burden of aerial image labelling for the disaster relief mapping community, can be further extended

    Potentials and Limitations of Active Learning

    Get PDF
    This article investigates the potential and limitations of using Active Learning (AL) to reduce AI’s carbon footprint and increase the accessibility of machine learning to low-resource projects. First, this paper reviews the recent literature on sustainable AI. The core of the article concerns AL as an emissions reduction technique. Because AL reduces the required data for model training, one can hypothesize that energy consumption  and, accordingly, carbon emissions – also decreases. This paper tests this assumption. The leading questions concern whether AL is more efficient than traditional data sampling strategies and how we can optimize AL for sustainability. The experiments show that the benefit of AL strongly depends on its parameter settings and the data set size. Only in limited scenarios does the size reduction outweigh the computational costs for AL. For projects with more resources for annotations, AL is beneficial from an ecological perspective and should ideally be paired with model compression techniques. For smaller projects, however, AL can even have a negative impact on machine learning’s carbon footprint
    corecore