13,053 research outputs found
Scalable Privacy-Compliant Virality Prediction on Twitter
The digital town hall of Twitter becomes a preferred medium of communication
for individuals and organizations across the globe. Some of them reach
audiences of millions, while others struggle to get noticed. Given the impact
of social media, the question remains more relevant than ever: how to model the
dynamics of attention in Twitter. Researchers around the world turn to machine
learning to predict the most influential tweets and authors, navigating the
volume, velocity, and variety of social big data, with many compromises. In
this paper, we revisit content popularity prediction on Twitter. We argue that
strict alignment of data acquisition, storage and analysis algorithms is
necessary to avoid the common trade-offs between scalability, accuracy and
privacy compliance. We propose a new framework for the rapid acquisition of
large-scale datasets, high accuracy supervisory signal and multilanguage
sentiment prediction while respecting every privacy request applicable. We then
apply a novel gradient boosting framework to achieve state-of-the-art results
in virality ranking, already before including tweet's visual or propagation
features. Our Gradient Boosted Regression Tree is the first to offer
explainable, strong ranking performance on benchmark datasets. Since the
analysis focused on features available early, the model is immediately
applicable to incoming tweets in 18 languages.Comment: AffCon@AAAI-19 Best Paper Award; Presented at AAAI-19 W1: Affective
Content Analysi
Axiomatic Interpretability for Multiclass Additive Models
Generalized additive models (GAMs) are favored in many regression and binary
classification problems because they are able to fit complex, nonlinear
functions while still remaining interpretable. In the first part of this paper,
we generalize a state-of-the-art GAM learning algorithm based on boosted trees
to the multiclass setting, and show that this multiclass algorithm outperforms
existing GAM learning algorithms and sometimes matches the performance of full
complexity models such as gradient boosted trees.
In the second part, we turn our attention to the interpretability of GAMs in
the multiclass setting. Surprisingly, the natural interpretability of GAMs
breaks down when there are more than two classes. Naive interpretation of
multiclass GAMs can lead to false conclusions. Inspired by binary GAMs, we
identify two axioms that any additive model must satisfy in order to not be
visually misleading. We then develop a technique called Additive
Post-Processing for Interpretability (API), that provably transforms a
pre-trained additive model to satisfy the interpretability axioms without
sacrificing accuracy. The technique works not just on models trained with our
learning algorithm, but on any multiclass additive model, including multiclass
linear and logistic regression. We demonstrate the effectiveness of API on a
12-class infant mortality dataset.Comment: KDD 201
A Confidence-Based Approach for Balancing Fairness and Accuracy
We study three classical machine learning algorithms in the context of
algorithmic fairness: adaptive boosting, support vector machines, and logistic
regression. Our goal is to maintain the high accuracy of these learning
algorithms while reducing the degree to which they discriminate against
individuals because of their membership in a protected group.
Our first contribution is a method for achieving fairness by shifting the
decision boundary for the protected group. The method is based on the theory of
margins for boosting. Our method performs comparably to or outperforms previous
algorithms in the fairness literature in terms of accuracy and low
discrimination, while simultaneously allowing for a fast and transparent
quantification of the trade-off between bias and error.
Our second contribution addresses the shortcomings of the bias-error
trade-off studied in most of the algorithmic fairness literature. We
demonstrate that even hopelessly naive modifications of a biased algorithm,
which cannot be reasonably said to be fair, can still achieve low bias and high
accuracy. To help to distinguish between these naive algorithms and more
sensible algorithms we propose a new measure of fairness, called resilience to
random bias (RRB). We demonstrate that RRB distinguishes well between our naive
and sensible fairness algorithms. RRB together with bias and accuracy provides
a more complete picture of the fairness of an algorithm
Multimodal Machine Learning for Automated ICD Coding
This study presents a multimodal machine learning model to predict ICD-10
diagnostic codes. We developed separate machine learning models that can handle
data from different modalities, including unstructured text, semi-structured
text and structured tabular data. We further employed an ensemble method to
integrate all modality-specific models to generate ICD-10 codes. Key evidence
was also extracted to make our prediction more convincing and explainable. We
used the Medical Information Mart for Intensive Care III (MIMIC -III) dataset
to validate our approach. For ICD code prediction, our best-performing model
(micro-F1 = 0.7633, micro-AUC = 0.9541) significantly outperforms other
baseline models including TF-IDF (micro-F1 = 0.6721, micro-AUC = 0.7879) and
Text-CNN model (micro-F1 = 0.6569, micro-AUC = 0.9235). For interpretability,
our approach achieves a Jaccard Similarity Coefficient (JSC) of 0.1806 on text
data and 0.3105 on tabular data, where well-trained physicians achieve 0.2780
and 0.5002 respectively.Comment: Machine Learning for Healthcare 201
Boosting insights in insurance tariff plans with tree-based machine learning methods
Pricing actuaries typically operate within the framework of generalized
linear models (GLMs). With the upswing of data analytics, our study puts focus
on machine learning methods to develop full tariff plans built from both the
frequency and severity of claims. We adapt the loss functions used in the
algorithms such that the specific characteristics of insurance data are
carefully incorporated: highly unbalanced count data with excess zeros and
varying exposure on the frequency side combined with scarce, but potentially
long-tailed data on the severity side. A key requirement is the need for
transparent and interpretable pricing models which are easily explainable to
all stakeholders. We therefore focus on machine learning with decision trees:
starting from simple regression trees, we work towards more advanced ensembles
such as random forests and boosted trees. We show how to choose the optimal
tuning parameters for these models in an elaborate cross-validation scheme, we
present visualization tools to obtain insights from the resulting models and
the economic value of these new modeling approaches is evaluated. Boosted trees
outperform the classical GLMs, allowing the insurer to form profitable
portfolios and to guard against potential adverse risk selection
Induction of Non-Monotonic Logic Programs to Explain Boosted Tree Models Using LIME
We present a heuristic based algorithm to induce \textit{nonmonotonic} logic
programs that will explain the behavior of XGBoost trained classifiers. We use
the technique based on the LIME approach to locally select the most important
features contributing to the classification decision. Then, in order to explain
the model's global behavior, we propose the LIME-FOLD algorithm ---a
heuristic-based inductive logic programming (ILP) algorithm capable of learning
non-monotonic logic programs---that we apply to a transformed dataset produced
by LIME. Our proposed approach is agnostic to the choice of the ILP algorithm.
Our experiments with UCI standard benchmarks suggest a significant improvement
in terms of classification evaluation metrics. Meanwhile, the number of induced
rules dramatically decreases compared to ALEPH, a state-of-the-art ILP system
The Grammar of Interactive Explanatory Model Analysis
The growing need for in-depth analysis of predictive models leads to a series
of new methods for explaining their local and global properties. Which of these
methods is the best? It turns out that this is an ill-posed question. One
cannot sufficiently explain a black-box machine learning model using a single
method that gives only one perspective. Isolated explanations are prone to
misunderstanding, which inevitably leads to wrong or simplistic reasoning. This
problem is known as the Rashomon effect and refers to diverse, even
contradictory interpretations of the same phenomenon. Surprisingly, the
majority of methods developed for explainable machine learning focus on a
single aspect of the model behavior. In contrast, we showcase the problem of
explainability as an interactive and sequential analysis of a model. This paper
presents how different Explanatory Model Analysis (EMA) methods complement each
other and why it is essential to juxtapose them together. The introduced
process of Interactive EMA (IEMA) derives from the algorithmic side of
explainable machine learning and aims to embrace ideas developed in cognitive
sciences. We formalize the grammar of IEMA to describe potential human-model
dialogues. IEMA is implemented in the human-centered framework that adopts
interactivity, customizability and automation as its main traits. Combined,
these methods enhance the responsible approach to predictive modeling.Comment: 17 pages, 10 figures, 3 table
- …
