442,296 research outputs found
Committee-Based Sample Selection for Probabilistic Classifiers
In many real-world learning tasks, it is expensive to acquire a sufficient
number of labeled examples for training. This paper investigates methods for
reducing annotation cost by `sample selection'. In this approach, during
training the learning program examines many unlabeled examples and selects for
labeling only those that are most informative at each stage. This avoids
redundantly labeling examples that contribute little new information. Our work
follows on previous research on Query By Committee, extending the
committee-based paradigm to the context of probabilistic classification. We
describe a family of empirical methods for committee-based sample selection in
probabilistic classification models, which evaluate the informativeness of an
example by measuring the degree of disagreement between several model variants.
These variants (the committee) are drawn randomly from a probability
distribution conditioned by the training set labeled so far. The method was
applied to the real-world natural language processing task of stochastic
part-of-speech tagging. We find that all variants of the method achieve a
significant reduction in annotation cost, although their computational
efficiency differs. In particular, the simplest variant, a two member committee
with no parameters to tune, gives excellent results. We also show that sample
selection yields a significant reduction in the size of the model used by the
tagger
Robust and Fair Machine Learning under Distribution Shift
Machine learning algorithms have been widely used in real world applications. The development of these techniques has brought huge benefits for many AI-related tasks, such as natural language processing, image classification, video analysis, and so forth. In traditional machine learning algorithms, we usually assume that the training data and test data are independently and identically distributed (iid), indicating that the model learned from the training data can be well applied to the test data with good prediction performance. However, this assumption is quite restrictive because the distribution shift can exist from the training data to the test data in many scenarios. In addition, the goal of traditional machine learning model is to maximize the prediction performance, e.g., accuracy, based on the historical training data, which may tend to make unfair predictions for some particular individual or groups. In the literature, researchers either focus on building robust machine learning models under data distribution shift or achieving fairness separately, without considering to solve them simultaneously.
The goal of this dissertation is to solve the above challenging issues in fair machine learning under distribution shift. We start from building an agnostic fair framework in federated learning as the data distribution is more diversified and distribution shift exists from the training data to the test data. Then we build a robust framework to address the sample selection bias for fair classification. Next we solve the sample selection bias issue for fair regression. Finally, we propose an adversarial framework to build a personalized model in the distributed setting where the distribution shift exists between different users.
In this dissertation, we conduct the following research for fair machine learning under distribution shift. • We develop a fairness-aware agnostic federated learning framework (AgnosticFair) to deal with the challenge of unknown testing distribution; • We propose a framework for robust and fair learning under sample selection bias; • We develop a framework for fair regression under sample selection bias when dependent variable values of a set of samples from the training data are missing as a result of another hidden process; • We propose a learning framework that allows an individual user to build a personalized model in a distributed setting, where the distribution shift exists among different users
Synthia's Melody: A Benchmark Framework for Unsupervised Domain Adaptation in Audio
Despite significant advancements in deep learning for vision and natural
language, unsupervised domain adaptation in audio remains relatively
unexplored. We, in part, attribute this to the lack of an appropriate benchmark
dataset. To address this gap, we present Synthia's melody, a novel audio data
generation framework capable of simulating an infinite variety of 4-second
melodies with user-specified confounding structures characterised by musical
keys, timbre, and loudness. Unlike existing datasets collected under
observational settings, Synthia's melody is free of unobserved biases, ensuring
the reproducibility and comparability of experiments. To showcase its utility,
we generate two types of distribution shifts-domain shift and sample selection
bias-and evaluate the performance of acoustic deep learning models under these
shifts. Our evaluations reveal that Synthia's melody provides a robust testbed
for examining the susceptibility of these models to varying levels of
distribution shift
Harnessing the Power of Beta Scoring in Deep Active Learning for Multi-Label Text Classification
Within the scope of natural language processing, the domain of multi-label
text classification is uniquely challenging due to its expansive and uneven
label distribution. The complexity deepens due to the demand for an extensive
set of annotated data for training an advanced deep learning model, especially
in specialized fields where the labeling task can be labor-intensive and often
requires domain-specific knowledge. Addressing these challenges, our study
introduces a novel deep active learning strategy, capitalizing on the Beta
family of proper scoring rules within the Expected Loss Reduction framework. It
computes the expected increase in scores using the Beta Scoring Rules, which
are then transformed into sample vector representations. These vector
representations guide the diverse selection of informative samples, directly
linking this process to the model's expected proper score. Comprehensive
evaluations across both synthetic and real datasets reveal our method's
capability to often outperform established acquisition techniques in
multi-label text classification, presenting encouraging outcomes across various
architectural and dataset scenarios.Comment: 7 pages AAAI 202
CohortGPT: An Enhanced GPT for Participant Recruitment in Clinical Study
Participant recruitment based on unstructured medical texts such as clinical
notes and radiology reports has been a challenging yet important task for the
cohort establishment in clinical research. Recently, Large Language Models
(LLMs) such as ChatGPT have achieved tremendous success in various downstream
tasks thanks to their promising performance in language understanding,
inference, and generation. It is then natural to test their feasibility in
solving the cohort recruitment task, which involves the classification of a
given paragraph of medical text into disease label(s). However, when applied to
knowledge-intensive problem settings such as medical text classification, where
the LLMs are expected to understand the decision made by human experts and
accurately identify the implied disease labels, the LLMs show a mediocre
performance. A possible explanation is that, by only using the medical text,
the LLMs neglect to use the rich context of additional information that
languages afford. To this end, we propose to use a knowledge graph as auxiliary
information to guide the LLMs in making predictions. Moreover, to further boost
the LLMs adapt to the problem setting, we apply a chain-of-thought (CoT) sample
selection strategy enhanced by reinforcement learning, which selects a set of
CoT samples given each individual medical report. Experimental results and
various ablation studies show that our few-shot learning method achieves
satisfactory performance compared with fine-tuning strategies and gains superb
advantages when the available data is limited. The code and sample dataset of
the proposed CohortGPT model is available at:
https://anonymous.4open.science/r/CohortGPT-4872/Comment: 16 pages, 10 figure
Combining deep learning with token selection for patient phenotyping from electronic health records.
Artificial intelligence provides the opportunity to reveal important information buried in large amounts of complex data. Electronic health records (eHRs) are a source of such big data that provide a multitude of health related clinical information about patients. However, text data from eHRs, e.g., discharge summary notes, are challenging in their analysis because these notes are free-form texts and the writing formats and styles vary considerably between different records. For this reason, in this paper we study deep learning neural networks in combination with natural language processing to analyze text data from clinical discharge summaries. We provide a detail analysis of patient phenotyping, i.e., the automatic prediction of ten patient disorders, by investigating the influence of network architectures, sample sizes and information content of tokens. Importantly, for patients suffering from Chronic Pain, the disorder that is the most difficult one to classify, we find the largest performance gain for a combined word- and sentence-level input convolutional neural network (ws-CNN). As a general result, we find that the combination of data quality and data quantity of the text data is playing a crucial role for using more complex network architectures that improve significantly beyond a word-level input CNN model. From our investigations of learning curves and token selection mechanisms, we conclude that for such a transition one requires larger sample sizes because the amount of information per sample is quite small and only carried by few tokens and token categories. Interestingly, we found that the token frequency in the eHRs follow a Zipf law and we utilized this behavior to investigate the information content of tokens by defining a token selection mechanism. The latter addresses also issues of explainable AI
Combined optimization of feature selection and algorithm parameters in machine learning of language
Comparative machine learning experiments have become an important methodology in empirical approaches to natural language processing (i) to investigate which machine learning algorithms have the 'right bias' to solve specific natural language processing tasks, and (ii) to investigate which sources of information add to accuracy in a learning approach. Using automatic word sense disambiguation as an example task, we show that with the methodology currently used in comparative machine learning experiments, the results may often not be reliable because of the role of and interaction between feature selection and algorithm parameter optimization. We propose genetic algorithms as a practical approach to achieve both higher accuracy within a single approach, and more reliable comparisons
- …