3,242 research outputs found
Mining the Demographics of Political Sentiment from Twitter Using Learning from Label Proportions
Opinion mining and demographic attribute inference have many applications in
social science. In this paper, we propose models to infer daily joint
probabilities of multiple latent attributes from Twitter data, such as
political sentiment and demographic attributes. Since it is costly and
time-consuming to annotate data for traditional supervised classification, we
instead propose scalable Learning from Label Proportions (LLP) models for
demographic and opinion inference using U.S. Census, national and state
political polls, and Cook partisan voting index as population level data. In
LLP classification settings, the training data is divided into a set of
unlabeled bags, where only the label distribution in of each bag is known,
removing the requirement of instance-level annotations. Our proposed LLP model,
Weighted Label Regularization (WLR), provides a scalable generalization of
prior work on label regularization to support weights for samples inside bags,
which is applicable in this setting where bags are arranged hierarchically
(e.g., county-level bags are nested inside of state-level bags). We apply our
model to Twitter data collected in the year leading up to the 2016 U.S.
presidential election, producing estimates of the relationships among political
sentiment and demographics over time and place. We find that our approach
closely tracks traditional polling data stratified by demographic category,
resulting in error reductions of 28-44% over baseline approaches. We also
provide descriptive evaluations showing how the model may be used to estimate
interactions among many variables and to identify linguistic temporal
variation, capabilities which are typically not feasible using traditional
polling methods
Bayesian Semi-supervised Learning with Graph Gaussian Processes
We propose a data-efficient Gaussian process-based Bayesian approach to the
semi-supervised learning problem on graphs. The proposed model shows extremely
competitive performance when compared to the state-of-the-art graph neural
networks on semi-supervised learning benchmark experiments, and outperforms the
neural networks in active learning experiments where labels are scarce.
Furthermore, the model does not require a validation data set for early
stopping to control over-fitting. Our model can be viewed as an instance of
empirical distribution regression weighted locally by network connectivity. We
further motivate the intuitive construction of the model with a Bayesian linear
model interpretation where the node features are filtered by an operator
related to the graph Laplacian. The method can be easily implemented by
adapting off-the-shelf scalable variational inference algorithms for Gaussian
processes.Comment: To appear in NIPS 2018 Fixed an error in Figure 2. The previous arxiv
version contains two identical sub-figure
A Survey on Metric Learning for Feature Vectors and Structured Data
The need for appropriate ways to measure the distance or similarity between
data is ubiquitous in machine learning, pattern recognition and data mining,
but handcrafting such good metrics for specific problems is generally
difficult. This has led to the emergence of metric learning, which aims at
automatically learning a metric from data and has attracted a lot of interest
in machine learning and related fields for the past ten years. This survey
paper proposes a systematic review of the metric learning literature,
highlighting the pros and cons of each approach. We pay particular attention to
Mahalanobis distance metric learning, a well-studied and successful framework,
but additionally present a wide range of methods that have recently emerged as
powerful alternatives, including nonlinear metric learning, similarity learning
and local metric learning. Recent trends and extensions, such as
semi-supervised metric learning, metric learning for histogram data and the
derivation of generalization guarantees, are also covered. Finally, this survey
addresses metric learning for structured data, in particular edit distance
learning, and attempts to give an overview of the remaining challenges in
metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved
presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new
method
Semi-supervised Learning based on Distributionally Robust Optimization
We propose a novel method for semi-supervised learning (SSL) based on
data-driven distributionally robust optimization (DRO) using optimal transport
metrics. Our proposed method enhances generalization error by using the
unlabeled data to restrict the support of the worst case distribution in our
DRO formulation. We enable the implementation of our DRO formulation by
proposing a stochastic gradient descent algorithm which allows to easily
implement the training procedure. We demonstrate that our Semi-supervised DRO
method is able to improve the generalization error over natural supervised
procedures and state-of-the-art SSL estimators. Finally, we include a
discussion on the large sample behavior of the optimal uncertainty region in
the DRO formulation. Our discussion exposes important aspects such as the role
of dimension reduction in SSL
- …