719 research outputs found
Modelling Instance-Level Annotator Reliability for Natural Language Labelling Tasks
When constructing models that learn from noisy labels produced by multiple
annotators, it is important to accurately estimate the reliability of
annotators. Annotators may provide labels of inconsistent quality due to their
varying expertise and reliability in a domain. Previous studies have mostly
focused on estimating each annotator's overall reliability on the entire
annotation task. However, in practice, the reliability of an annotator may
depend on each specific instance. Only a limited number of studies have
investigated modelling per-instance reliability and these only considered
binary labels. In this paper, we propose an unsupervised model which can handle
both binary and multi-class labels. It can automatically estimate the
per-instance reliability of each annotator and the correct label for each
instance. We specify our model as a probabilistic model which incorporates
neural networks to model the dependency between latent variables and instances.
For evaluation, the proposed method is applied to both synthetic and real data,
including two labelling tasks: text classification and textual entailment.
Experimental results demonstrate our novel method can not only accurately
estimate the reliability of annotators across different instances, but also
achieve superior performance in predicting the correct labels and detecting the
least reliable annotators compared to state-of-the-art baselines.Comment: 9 pages, 1 figures, 10 tables, 2019 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL2019
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
A supervised learning framework in the context of multiple annotators
The increasing popularity of crowdsourcing platforms, i.e., Amazon Mechanical Turk, is changing how datasets for supervised learning are built. In these cases, instead of having datasets labeled by one source (which is supposed to be an expert who provided the absolute gold standard), we have datasets labeled by multiple annotators with different and unknown expertise. Hence, we face a multi-labeler scenario, which typical supervised learning models cannot tackle. For such a reason, much attention has recently been given to the approaches that capture multiple annotators’ wisdom. However, such methods residing on two key assumptions: the labeler’s performance does not depend on the input space and independence among the annotators, which are hardly feasible in real-world settings..
Contour Proposal Networks for Biomedical Instance Segmentation
We present a conceptually simple framework for object instance segmentation
called Contour Proposal Network (CPN), which detects possibly overlapping
objects in an image while simultaneously fitting closed object contours using
an interpretable, fixed-sized representation based on Fourier Descriptors. The
CPN can incorporate state of the art object detection architectures as backbone
networks into a single-stage instance segmentation model that can be trained
end-to-end. We construct CPN models with different backbone networks, and apply
them to instance segmentation of cells in datasets from different modalities.
In our experiments, we show CPNs that outperform U-Nets and Mask R-CNNs in
instance segmentation accuracy, and present variants with execution times
suitable for real-time applications. The trained models generalize well across
different domains of cell types. Since the main assumption of the framework are
closed object contours, it is applicable to a wide range of detection problems
also outside the biomedical domain. An implementation of the model architecture
in PyTorch is freely available
Learning with Single View Co-training and Marginalized Dropout
The generalization properties of most existing machine learning techniques are predicated on the assumptions that 1) a sufficiently large quantity of training data is available; 2) the training and testing data come from some common distribution. Although these assumptions are often met in practice, there are also many scenarios in which training data from the relevant distribution is insufficient. We focus on making use of additional data, which is readily available or can be obtained easily but comes from a different distribution than the testing data, to aid learning.
We present five learning scenarios, depending on how the distribution we used to sample the additional training data differs from the testing distribution: 1) learning with weak supervision; 2) domain adaptation; 3) learning from multiple domains; 4) learning from corrupted data; 5) learning with partial supervision.
We introduce two strategies and manifest them in five ways to cope with the difference between the training and testing distribution. The first strategy, which gives rise to Pseudo Multi-view Co-training: PMC) and Co-training for Domain Adaptation: CODA), is inspired by the co-training algorithm for multi-view data. PMC generalizes co-training to the more common single view data and allows us to learn from weakly labeled data retrieved free from the web. CODA integrates PMC with an another feature selection component to address the feature incompatibility between domains for domain adaptation. PMC and CODA are evaluated on a variety of real datasets, and both yield record performance.
The second strategy marginalized dropout leads to marginalized Stacked Denoising Autoencoders: mSDA), Marginalized Corrupted Features: MCF) and FastTag: FastTag). mSDA diminishes the difference between distributions associated with different domains by learning a new representation through marginalized corruption and reconstruciton. MCF learns from a known distribution which is created by corrupting a small set of training data, and improves robustness of learned classifiers by training on ``infinitely\u27\u27 many data sampled from the distribution. FastTag applies marginalized dropout to the output of partially labeled data to recover missing labels for multi-label tasks. These three algorithms not only achieve the state-of-art performance in various tasks, but also deliver orders of magnitude speed up at training and testing comparing to competing algorithms
Learning Algorithms for Fat Quantification and Tumor Characterization
Obesity is one of the most prevalent health conditions. About 30% of the world\u27s and over 70% of the United States\u27 adult populations are either overweight or obese, causing an increased risk for cardiovascular diseases, diabetes, and certain types of cancer. Among all cancers, lung cancer is the leading cause of death, whereas pancreatic cancer has the poorest prognosis among all major cancers. Early diagnosis of these cancers can save lives. This dissertation contributes towards the development of computer-aided diagnosis tools in order to aid clinicians in establishing the quantitative relationship between obesity and cancers. With respect to obesity and metabolism, in the first part of the dissertation, we specifically focus on the segmentation and quantification of white and brown adipose tissue. For cancer diagnosis, we perform analysis on two important cases: lung cancer and Intraductal Papillary Mucinous Neoplasm (IPMN), a precursor to pancreatic cancer. This dissertation proposes an automatic body region detection method trained with only a single example. Then a new fat quantification approach is proposed which is based on geometric and appearance characteristics. For the segmentation of brown fat, a PET-guided CT co-segmentation method is presented. With different variants of Convolutional Neural Networks (CNN), supervised learning strategies are proposed for the automatic diagnosis of lung nodules and IPMN. In order to address the unavailability of a large number of labeled examples required for training, unsupervised learning approaches for cancer diagnosis without explicit labeling are proposed. We evaluate our proposed approaches (both supervised and unsupervised) on two different tumor diagnosis challenges: lung and pancreas with 1018 CT and 171 MRI scans respectively. The proposed segmentation, quantification and diagnosis approaches explore the important adiposity-cancer association and help pave the way towards improved diagnostic decision making in routine clinical practice
- …