Search CORE

36,723 research outputs found

Subjectively Interesting Subgroup Discovery on Real-valued Targets

Author: De Bie Tijl
Duivesteijn Wouter
Kang Bo
Lijffijt Jefrey
Oikarinen Emilia
Puolamäki Kai
Publication venue
Publication date: 01/01/2017
Field of study

Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the fact that there are exponentially many variable combinations to potentially consider, and there are infinitely many if we consider weighted combinations, even for linear combinations. Hence, an obvious question is whether we can automate the search for interesting patterns and visualizations. In this paper, we consider the setting where a user wants to learn as efficiently as possible about real-valued attributes. For example, to understand the distribution of crime rates in different geographic areas in terms of other (numerical, ordinal and/or categorical) variables that describe the areas. We introduce a method to find subgroups in the data that are maximally informative (in the formal Information Theoretic sense) with respect to a single or set of real-valued target attributes. The subgroup descriptions are in terms of a succinct set of arbitrarily-typed other attributes. The approach is based on the Subjective Interestingness framework FORSIED to enable the use of prior knowledge when finding most informative non-redundant patterns, and hence the method also supports iterative data mining.Comment: 12 pages, 10 figures, 2 tables, conference submissio

arXiv.org e-Print Archive

Repository TU/e

Crossref

Pure OAI Repository

Ghent University Academic Bibliography

Aaltodoc Publication Archive

Interpretable multiclass classification by MDL-based rule lists

Author: Proença Hugo M.
van Leeuwen Matthijs
Publication venue: 'Elsevier BV'
Publication date: 31/10/2019
Field of study

Interpretable classifiers have recently witnessed an increase in attention from the data mining community because they are inherently easier to understand and explain than their more complex counterparts. Examples of interpretable classification models include decision trees, rule sets, and rule lists. Learning such models often involves optimizing hyperparameters, which typically requires substantial amounts of data and may result in relatively large models. In this paper, we consider the problem of learning compact yet accurate probabilistic rule lists for multiclass classification. Specifically, we propose a novel formalization based on probabilistic rule lists and the minimum description length (MDL) principle. This results in virtually parameter-free model selection that naturally allows to trade-off model complexity with goodness of fit, by which overfitting and the need for hyperparameter tuning are effectively avoided. Finally, we introduce the Classy algorithm, which greedily finds rule lists according to the proposed criterion. We empirically demonstrate that Classy selects small probabilistic rule lists that outperform state-of-the-art classifiers when it comes to the combination of predictive performance and interpretability. We show that Classy is insensitive to its only parameter, i.e., the candidate set, and that compression on the training set correlates with classification performance, validating our MDL-based selection criterion

arXiv.org e-Print Archive

Leiden University Scholary Publications

Semi-Supervised Approach to Monitoring Clinical Depressive Symptoms in Social Media

Author: Al-Olimat Hussein S.
Bajaj Goonmeet
Banerjee Tanvi
Ebrahimi Monireh
Pathak Jyotishman
Sheth Amit
Thirunarayan Krishnaprasad
Yazdavar Amir Hossein
Publication venue
Publication date: 01/01/2017
Field of study

With the rise of social media, millions of people are routinely expressing their moods, feelings, and daily struggles with mental health issues on social media platforms like Twitter. Unlike traditional observational cohort studies conducted through questionnaires and self-reported surveys, we explore the reliable detection of clinical depression from tweets obtained unobtrusively. Based on the analysis of tweets crawled from users with self-reported depressive symptoms in their Twitter profiles, we demonstrate the potential for detecting clinical depression symptoms which emulate the PHQ-9 questionnaire clinicians use today. Our study uses a semi-supervised statistical model to evaluate how the duration of these symptoms and their expression on Twitter (in terms of word usage patterns and topical preferences) align with the medical findings reported via the PHQ-9. Our proactive and automatic screening tool is able to identify clinical depressive symptoms with an accuracy of 68% and precision of 72%.Comment: 8 pages, Advances in Social Networks Analysis and Mining (ASONAM), 2017 IEEE/ACM International Conferenc

arXiv.org e-Print Archive

CORE

Learning Interpretable Rules for Multi-label Classification

Author: A Gabriel
AA Freitas
AJ Knobbe
B Liu
B Minnaert
D Malerba
E Gibaja
E Gibaja
E Loza Mencía
E Montañés
F Charte
F Herrera
F Janssen
F Thabtah
G Bosc
G Tsoumakas
Grigorios Tsoumakas
H Allahyari
J Arunadevi
J Demšar
J Fürnkranz
J Han
J Hipp
J Read
JN Sulzmann
K Dembczyński
K Dembczyński
L Chekina
L Raedt De
LE Sucar
M Atzmüller
M Beckerle
M Friedman
M Zhang
Miltiadis Allamanis
MR Boutell
P Kralj Novak
PJ Hayes
R Senge
RM Cameron-Jones
Shantanu Godbole
W Duivesteijn
W Waegeman
WW Cohen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2018
Field of study

Multi-label classification (MLC) is a supervised learning problem in which, contrary to standard multiclass classification, an instance can be associated with several class labels simultaneously. In this chapter, we advocate a rule-based approach to multi-label classification. Rule learning algorithms are often employed when one is not only interested in accurate predictions, but also requires an interpretable theory that can be understood, analyzed, and qualitatively evaluated by domain experts. Ideally, by revealing patterns and regularities contained in the data, a rule-based theory yields new insights in the application domain. Recently, several authors have started to investigate how rule-based models can be used for modeling multi-label data. Discussing this task in detail, we highlight some of the problems that make rule learning considerably more challenging for MLC than for conventional classification. While mainly focusing on our own previous work, we also provide a short overview of related work in this area.Comment: Preprint version. To appear in: Explainable and Interpretable Models in Computer Vision and Machine Learning. The Springer Series on Challenges in Machine Learning. Springer (2018). See http://www.ke.tu-darmstadt.de/bibtex/publications/show/3077 for further informatio

arXiv.org e-Print Archive

TUbiblio

Crossref