11,673 research outputs found
Convex Formulation of Multiple Instance Learning from Positive and Unlabeled Bags
Multiple instance learning (MIL) is a variation of traditional supervised
learning problems where data (referred to as bags) are composed of sub-elements
(referred to as instances) and only bag labels are available. MIL has a variety
of applications such as content-based image retrieval, text categorization and
medical diagnosis. Most of the previous work for MIL assume that the training
bags are fully labeled. However, it is often difficult to obtain an enough
number of labeled bags in practical situations, while many unlabeled bags are
available. A learning framework called PU learning (positive and unlabeled
learning) can address this problem. In this paper, we propose a convex PU
learning method to solve an MIL problem. We experimentally show that the
proposed method achieves better performance with significantly lower
computational costs than an existing method for PU-MIL
A Robust AUC Maximization Framework with Simultaneous Outlier Detection and Feature Selection for Positive-Unlabeled Classification
The positive-unlabeled (PU) classification is a common scenario in real-world
applications such as healthcare, text classification, and bioinformatics, in
which we only observe a few samples labeled as "positive" together with a large
volume of "unlabeled" samples that may contain both positive and negative
samples. Building robust classifier for the PU problem is very challenging,
especially for complex data where the negative samples overwhelm and mislabeled
samples or corrupted features exist. To address these three issues, we propose
a robust learning framework that unifies AUC maximization (a robust metric for
biased labels), outlier detection (for excluding wrong labels), and feature
selection (for excluding corrupted features). The generalization error bounds
are provided for the proposed model that give valuable insight into the
theoretical performance of the method and lead to useful practical guidance,
e.g., to train a model, we find that the included unlabeled samples are
sufficient as long as the sample size is comparable to the number of positive
samples in the training process. Empirical comparisons and two real-world
applications on surgical site infection (SSI) and EEG seizure detection are
also conducted to show the effectiveness of the proposed model
Understanding and Monitoring Human Trafficking via Social Sensors: A Sociological Approach
Human trafficking is a serious social problem, and it is challenging mainly
because of its difficulty in collecting and organizing related information.
With the increasing popularity of social media platforms, it provides us a
novel channel to tackle the problem of human trafficking through detecting and
analyzing a large amount of human trafficking related information. Existing
supervised learning methods cannot be directly applied to this problem due to
the distinct characteristics of the social media data. First, the short, noisy,
and unstructured textual information makes traditional learning algorithms less
effective in detecting human trafficking related tweets. Second, complex social
interactions lead to a high-dimensional feature space and thus present great
computational challenges. In the meanwhile, social sciences theories such as
homophily have been well established and achieved success in various social
media mining applications. Motivated by the sociological findings, in this
paper, we propose to investigate whether the Network Structure Information
(NSI) could be potentially helpful for the human trafficking problem. In
particular, a novel mathematical optimization framework is proposed to
integrate the network structure into content modeling. Experimental results on
a real-world dataset demonstrate the effectiveness of our proposed framework in
detecting human trafficking related information.Comment: 8 pages, 3 figure
Combination of multiple Deep Learning architectures for Offensive Language Detection in Tweets
This report contains the details regarding our submission to the OffensEval
2019 (SemEval 2019 - Task 6). The competition was based on the Offensive
Language Identification Dataset. We first discuss the details of the classifier
implemented and the type of input data used and pre-processing performed. We
then move onto critically evaluating our performance. We have achieved a
macro-average F1-score of 0.76, 0.68, 0.54, respectively for Task a, Task b,
and Task c, which we believe reflects on the level of sophistication of the
models implemented. Finally, we will be discussing the difficulties encountered
and possible improvements for the future
Robust sketching for multiple square-root LASSO problems
Many learning tasks, such as cross-validation, parameter search, or
leave-one-out analysis, involve multiple instances of similar problems, each
instance sharing a large part of learning data with the others. We introduce a
robust framework for solving multiple square-root LASSO problems, based on a
sketch of the learning data that uses low-rank approximations. Our approach
allows a dramatic reduction in computational effort, in effect reducing the
number of observations from (the number of observations to start with) to
(the number of singular values retained in the low-rank model), while not
sacrificing---sometimes even improving---the statistical performance.
Theoretical analysis, as well as numerical experiments on both synthetic and
real data, illustrate the efficiency of the method in large scale applications
Fair Classification and Social Welfare
Now that machine learning algorithms lie at the center of many resource
allocation pipelines, computer scientists have been unwittingly cast as partial
social planners. Given this state of affairs, important questions follow. What
is the relationship between fairness as defined by computer scientists and
notions of social welfare? In this paper, we present a welfare-based analysis
of classification and fairness regimes. We translate a loss minimization
program into a social welfare maximization problem with a set of implied
welfare weights on individuals and groups--weights that can be analyzed from a
distribution justice lens. In the converse direction, we ask what the space of
possible labelings is for a given dataset and hypothesis class. We provide an
algorithm that answers this question with respect to linear hyperplanes in
that runs in . Our main findings on the relationship
between fairness criteria and welfare center on sensitivity analyses of
fairness-constrained empirical risk minimization programs. We characterize the
ranges of perturbations to a fairness parameter
that yield better, worse, and neutral outcomes in utility for individuals and
by extension, groups. We show that applying more strict fairness criteria that
are codified as parity constraints, can worsen welfare outcomes for both
groups. More generally, always preferring "more fair" classifiers does not
abide by the Pareto Principle---a fundamental axiom of social choice theory and
welfare economics. Recent work in machine learning has rallied around these
notions of fairness as critical to ensuring that algorithmic systems do not
have disparate negative impact on disadvantaged social groups. By showing that
these constraints often fail to translate into improved outcomes for these
groups, we cast doubt on their effectiveness as a means to ensure justice.Comment: 23 pages, 2 figure
Learning with Inadequate and Incorrect Supervision
Practically, we are often in the dilemma that the labeled data at hand are
inadequate to train a reliable classifier, and more seriously, some of these
labeled data may be mistakenly labeled due to the various human factors.
Therefore, this paper proposes a novel semi-supervised learning paradigm that
can handle both label insufficiency and label inaccuracy. To address label
insufficiency, we use a graph to bridge the data points so that the label
information can be propagated from the scarce labeled examples to unlabeled
examples along the graph edges. To address label inaccuracy, Graph Trend
Filtering (GTF) and Smooth Eigenbase Pursuit (SEP) are adopted to filter out
the initial noisy labels. GTF penalizes the l_0 norm of label difference
between connected examples in the graph and exhibits better local adaptivity
than the traditional l_2 norm-based Laplacian smoother. SEP reconstructs the
correct labels by emphasizing the leading eigenvectors of Laplacian matrix
associated with small eigenvalues, as these eigenvectors reflect real label
smoothness and carry rich class separation cues. We term our algorithm as
`Semi-supervised learning under Inadequate and Incorrect Supervision' (SIIS).
Thorough experimental results on image classification, text categorization, and
speech recognition demonstrate that our SIIS is effective in label error
correction, leading to superior performance to the state-of-the-art methods in
the presence of label noise and label scarcity
Triangle Generative Adversarial Networks
A Triangle Generative Adversarial Network (-GAN) is developed for
semi-supervised cross-domain joint distribution matching, where the training
data consists of samples from each domain, and supervision of domain
correspondence is provided by only a few paired samples. -GAN consists
of four neural networks, two generators and two discriminators. The generators
are designed to learn the two-way conditional distributions between the two
domains, while the discriminators implicitly define a ternary discriminative
function, which is trained to distinguish real data pairs and two kinds of fake
data pairs. The generators and discriminators are trained together using
adversarial learning. Under mild assumptions, in theory the joint distributions
characterized by the two generators concentrate to the data distribution. In
experiments, three different kinds of domain pairs are considered, image-label,
image-image and image-attribute pairs. Experiments on semi-supervised image
classification, image-to-image translation and attribute-based image generation
demonstrate the superiority of the proposed approach.Comment: To appear in NIPS 201
On Quantifying Qualitative Geospatial Data: A Probabilistic Approach
Living in the era of data deluge, we have witnessed a web content explosion,
largely due to the massive availability of User-Generated Content (UGC). In
this work, we specifically consider the problem of geospatial information
extraction and representation, where one can exploit diverse sources of
information (such as image and audio data, text data, etc), going beyond
traditional volunteered geographic information. Our ambition is to include
available narrative information in an effort to better explain geospatial
relationships: with spatial reasoning being a basic form of human cognition,
narratives expressing such experiences typically contain qualitative spatial
data, i.e., spatial objects and spatial relationships.
To this end, we formulate a quantitative approach for the representation of
qualitative spatial relations extracted from UGC in the form of texts. The
proposed method quantifies such relations based on multiple text observations.
Such observations provide distance and orientation features which are utilized
by a greedy Expectation Maximization-based (EM) algorithm to infer a
probability distribution over predefined spatial relationships; the latter
represent the quantified relationships under user-defined probabilistic
assumptions. We evaluate the applicability and quality of the proposed approach
using real UGC data originating from an actual travel blog text corpus. To
verify the quality of the result, we generate grid-based maps visualizing the
spatial extent of the various relations
A Reduction of the Elastic Net to Support Vector Machines with an Application to GPU Computing
The past years have witnessed many dedicated open-source projects that built
and maintain implementations of Support Vector Machines (SVM), parallelized for
GPU, multi-core CPUs and distributed systems. Up to this point, no comparable
effort has been made to parallelize the Elastic Net, despite its popularity in
many high impact applications, including genetics, neuroscience and systems
biology. The first contribution in this paper is of theoretical nature. We
establish a tight link between two seemingly different algorithms and prove
that Elastic Net regression can be reduced to SVM with squared hinge loss
classification. Our second contribution is to derive a practical algorithm
based on this reduction. The reduction enables us to utilize prior efforts in
speeding up and parallelizing SVMs to obtain a highly optimized and parallel
solver for the Elastic Net and Lasso. With a simple wrapper, consisting of only
11 lines of MATLAB code, we obtain an Elastic Net implementation that naturally
utilizes GPU and multi-core CPUs. We demonstrate on twelve real world data
sets, that our algorithm yields identical results as the popular (and highly
optimized) glmnet implementation but is one or several orders of magnitude
faster.Comment: 10 page
- …