50 research outputs found
Tracking the History and Evolution of Entities: Entity-centric Temporal Analysis of Large Social Media Archives
How did the popularity of the Greek Prime Minister evolve in 2015? How did
the predominant sentiment about him vary during that period? Were there any
controversial sub-periods? What other entities were related to him during these
periods? To answer these questions, one needs to analyze archived documents and
data about the query entities, such as old news articles or social media
archives. In particular, user-generated content posted in social networks, like
Twitter and Facebook, can be seen as a comprehensive documentation of our
society, and thus meaningful analysis methods over such archived data are of
immense value for sociologists, historians and other interested parties who
want to study the history and evolution of entities and events. To this end, in
this paper we propose an entity-centric approach to analyze social media
archives and we define measures that allow studying how entities were reflected
in social media in different time periods and under different aspects, like
popularity, attitude, controversiality, and connectedness with other entities.
A case study using a large Twitter archive of four years illustrates the
insights that can be gained by such an entity-centric and multi-aspect
analysis.Comment: This is a preprint of an article accepted for publication in the
International Journal on Digital Libraries (2018
RHALE: Robust and Heterogeneity-aware Accumulated Local Effects
Accumulated Local Effects (ALE) is a widely-used explainability method for
isolating the average effect of a feature on the output, because it handles
cases with correlated features well. However, it has two limitations. First, it
does not quantify the deviation of instance-level (local) effects from the
average (global) effect, known as heterogeneity. Second, for estimating the
average effect, it partitions the feature domain into user-defined, fixed-sized
bins, where different bin sizes may lead to inconsistent ALE estimations. To
address these limitations, we propose Robust and Heterogeneity-aware ALE
(RHALE). RHALE quantifies the heterogeneity by considering the standard
deviation of the local effects and automatically determines an optimal
variable-size bin-splitting. In this paper, we prove that to achieve an
unbiased approximation of the standard deviation of local effects within each
bin, bin splitting must follow a set of sufficient conditions. Based on these
conditions, we propose an algorithm that automatically determines the optimal
partitioning, balancing the estimation bias and variance. Through evaluations
on synthetic and real datasets, we demonstrate the superiority of RHALE
compared to other methods, including the advantages of automatic bin splitting,
especially in cases with correlated features.Comment: Accepted at ECAI 2023 (European Conference on Artificial
Intelligence
AdaCC: cumulative cost-sensitive boosting for imbalanced classification
Class imbalance poses a major challenge for machine learning as most supervised learning models might exhibit bias towards the majority class and under-perform in the minority class. Cost-sensitive learning tackles this problem by treating the classes differently, formulated typically via a user-defined fixed misclassification cost matrix provided as input to the learner. Such parameter tuning is a challenging task that requires domain knowledge and moreover, wrong adjustments might lead to overall predictive performance deterioration. In this work, we propose a novel cost-sensitive boosting approach for imbalanced data that dynamically adjusts the misclassification costs over the boosting rounds in response to model’s performance instead of using a fixed misclassification cost matrix. Our method, called AdaCC, is parameter-free as it relies on the cumulative behavior of the boosting model in order to adjust the misclassification costs for the next boosting round and comes with theoretical guarantees regarding the training error. Experiments on 27 real-world datasets from different domains with high class imbalance demonstrate the superiority of our method over 12 state-of-the-art cost-sensitive boosting approaches exhibiting consistent improvements in different measures, for instance, in the range of [0.3–28.56%] for AUC, [3.4–21.4%] for balanced accuracy, [4.8–45%] for gmean and [7.4–85.5%] for recall
AdaCC: Cumulative Cost-Sensitive Boosting for Imbalanced Classification
Class imbalance poses a major challenge for machine learning as most
supervised learning models might exhibit bias towards the majority class and
under-perform in the minority class. Cost-sensitive learning tackles this
problem by treating the classes differently, formulated typically via a
user-defined fixed misclassification cost matrix provided as input to the
learner. Such parameter tuning is a challenging task that requires domain
knowledge and moreover, wrong adjustments might lead to overall predictive
performance deterioration. In this work, we propose a novel cost-sensitive
boosting approach for imbalanced data that dynamically adjusts the
misclassification costs over the boosting rounds in response to model's
performance instead of using a fixed misclassification cost matrix. Our method,
called AdaCC, is parameter-free as it relies on the cumulative behavior of the
boosting model in order to adjust the misclassification costs for the next
boosting round and comes with theoretical guarantees regarding the training
error. Experiments on 27 real-world datasets from different domains with high
class imbalance demonstrate the superiority of our method over 12
state-of-the-art cost-sensitive boosting approaches exhibiting consistent
improvements in different measures, for instance, in the range of [0.3%-28.56%]
for AUC, [3.4%-21.4%] for balanced accuracy, [4.8%-45%] for gmean and
[7.4%-85.5%] for recall.Comment: 30 page
Explaining text classifiers through progressive neighborhood approximation with realistic samples
The importance of neighborhood construction in local explanation methods has
been already highlighted in the literature. And several attempts have been made
to improve neighborhood quality for high-dimensional data, for example, texts,
by adopting generative models. Although the generators produce more realistic
samples, the intuitive sampling approaches in the existing solutions leave the
latent space underexplored. To overcome this problem, our work, focusing on
local model-agnostic explanations for text classifiers, proposes a progressive
approximation approach that refines the neighborhood of a to-be-explained
decision with a careful two-stage interpolation using counterfactuals as
landmarks. We explicitly specify the two properties that should be satisfied by
generative models, the reconstruction ability and the locality-preserving
property, to guide the selection of generators for local explanation methods.
Moreover, noticing the opacity of generative models during the study, we
propose another method that implements progressive neighborhood approximation
with probability-based editions as an alternative to the generator-based
solution. The explanation results from both methods consist of word-level and
instance-level explanations benefiting from the realistic neighborhood. Through
exhaustive experiments, we qualitatively and quantitatively demonstrate the
effectiveness of the two proposed methods