18 research outputs found
Estimation of errors in text and data processing
The company Adiss Lab Lts. obtained 1 000 000 medical reports that are either in free form text, or in XML format. One of the main goals of their development is to integrate an algorithm for information extraction (IE) in their platform. The verification of the algorithm’s output for a report is done by a medical doctor (MD) for a certain fee. Validating the correctness of all data would be overwhelming and very expensive. Hence, the problem, as presented by the company, is to provide a method (algorithm) which determines the minimum amount of reports that will validate the correctness of the IE algorithm and a procedure for selecting these reports.
In order to solve the problem we have considered an algorithm-centric approach uses active learning and semi-supervised learning
ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation
The incredible capabilities of generative artificial intelligence models have
inevitably led to their application in the domain of drug discovery. Within
this domain, the vastness of chemical space motivates the development of more
efficient methods for identifying regions with molecules that exhibit desired
characteristics. In this work, we present a computationally efficient active
learning methodology that requires evaluation of only a subset of the generated
data in the constructed sample space to successfully align a generative model
with respect to a specified objective. We demonstrate the applicability of this
methodology to targeted molecular generation by fine-tuning a GPT-based
molecular generator toward a protein with FDA-approved small-molecule
inhibitors, c-Abl kinase. Remarkably, the model learns to generate molecules
similar to the inhibitors without prior knowledge of their existence, and even
reproduces two of them exactly. We also show that the methodology is effective
for a protein without any commercially available small-molecule inhibitors, the
HNH domain of the CRISPR-associated protein 9 (Cas9) enzyme. We believe that
the inherent generality of this method ensures that it will remain applicable
as the exciting field of in silico molecular generation evolves. To facilitate
implementation and reproducibility, we have made all of our software available
through the open-source ChemSpaceAL Python package
Discrepancy-Based Active Learning for Domain Adaptation
The goal of the paper is to design active learning strategies which lead to
domain adaptation under an assumption of covariate shift in the case of
Lipschitz labeling function. Building on previous work by Mansour et al. (2009)
we adapt the concept of discrepancy distance between source and target
distributions to restrict the maximization over the hypothesis class to a
localized class of functions which are performing accurate labeling on the
source domain. We derive generalization error bounds for such active learning
strategies in terms of Rademacher average and localized discrepancy for general
loss functions which satisfy a regularity condition. A practical K-medoids
algorithm that can address the case of large data set is inferred from the
theoretical bounds. Our numerical experiments show that the proposed algorithm
is competitive against other state-of-the-art active learning techniques in the
context of domain adaptation, in particular on large data sets of around one
hundred thousand images.Comment: 28 pages, 11 figure
Efficient Methods for Natural Language Processing: A Survey
Recent work in natural language processing (NLP) has yielded appealing
results from scaling model parameters and training data; however, using only
scale to improve performance means that resource consumption also grows. Such
resources include data, time, storage, or energy, all of which are naturally
limited and unevenly distributed. This motivates research into efficient
methods that require fewer resources to achieve similar results. This survey
synthesizes and relates current methods and findings in efficient NLP. We aim
to provide both guidance for conducting NLP under limited resources, and point
towards promising research directions for developing more efficient methods.Comment: Accepted at TACL, pre publication versio
Reducing the Burden of Aerial Image Labelling Through Human-in-the-Loop Machine Learning Methods
This dissertation presents an introduction to human-in-the-loop deep learning methods for remote sensing applications. It is motivated by the need to decrease the time spent by volunteers on semantic segmentation of remote sensing imagery. We look at two human-in-the-loop approaches of speeding up the labelling of the remote sensing data: interactive segmentation and active learning. We develop these methods specifically in response to the needs of the disaster relief organisations who require accurately labelled maps of disaster-stricken regions quickly, in order to respond to the needs of the affected communities. To begin, we survey the current approaches used within the field. We analyse the shortcomings of these models which include outputs ill-suited for uploading to mapping databases, and an inability to label new regions well, when the new regions differ from the regions trained on. The methods developed then look at addressing these shortcomings. We first develop an interactive segmentation algorithm. Interactive segmentation aims to segment objects with a supervisory signal from a user to assist the model. Work within interactive segmentation has focused largely on segmenting one or few objects within an image. We make a few adaptions to allow an existing method to scale to remote sensing applications where there are tens of objects within a single image that needs to be segmented. We show a quantitative improvements of up to 18% in mean intersection over union, as well as qualitative improvements. The algorithm works well when labelling new regions, and the qualitative improvements show outputs more suitable for uploading to mapping databases. We then investigate active learning in the context of remote sensing. Active learning looks at reducing the number of labelled samples required by a model to achieve an acceptable performance level. Within the context of deep learning, the utility of the various active learning strategies developed is uncertain, with conflicting results within the literature. We evaluate and compare a variety of sample acquisition strategies on the semantic segmentation tasks in scenarios relevant to disaster relief mapping. Our results show that all active learning strategies evaluated provide minimal performance increases over a simple random sample acquisition strategy. However, we present analysis of the results illustrating how the various strategies work and intuition of when certain active learning strategies might be preferred. This analysis could be used to inform future research. We conclude by providing examples of the synergies of these two approaches, and indicate how this work, on reducing the burden of aerial image labelling for the disaster relief mapping community, can be further extended
Potentials and Limitations of Active Learning
This article investigates the potential and limitations of using Active Learning (AL) to reduce AI’s carbon footprint and increase the accessibility of machine learning to low-resource projects. First, this paper reviews the recent literature on sustainable AI. The core of the article concerns AL as an emissions reduction technique. Because AL reduces the required data for model training, one can hypothesize that energy consumption and, accordingly, carbon emissions – also decreases. This paper tests this assumption. The leading questions concern whether AL is more efficient than traditional data sampling strategies and how we can optimize AL for sustainability. The experiments show that the benefit of AL strongly depends on its parameter settings and the data set size. Only in limited scenarios does the size reduction outweigh the computational costs for AL. For projects with more resources for annotations, AL is beneficial from an ecological perspective and should ideally be paired with model compression techniques. For smaller projects, however, AL can even have a negative impact on machine learning’s carbon footprint