5 research outputs found
Subjectively Interesting Subgroup Discovery on Real-valued Targets
Deriving insights from high-dimensional data is one of the core problems in
data mining. The difficulty mainly stems from the fact that there are
exponentially many variable combinations to potentially consider, and there are
infinitely many if we consider weighted combinations, even for linear
combinations. Hence, an obvious question is whether we can automate the search
for interesting patterns and visualizations. In this paper, we consider the
setting where a user wants to learn as efficiently as possible about
real-valued attributes. For example, to understand the distribution of crime
rates in different geographic areas in terms of other (numerical, ordinal
and/or categorical) variables that describe the areas. We introduce a method to
find subgroups in the data that are maximally informative (in the formal
Information Theoretic sense) with respect to a single or set of real-valued
target attributes. The subgroup descriptions are in terms of a succinct set of
arbitrarily-typed other attributes. The approach is based on the Subjective
Interestingness framework FORSIED to enable the use of prior knowledge when
finding most informative non-redundant patterns, and hence the method also
supports iterative data mining.Comment: 12 pages, 10 figures, 2 tables, conference submissio
Interactive visual data exploration with subjective feedback : an information-theoretic approach
Visual exploration of high-dimensional real-valued datasets is a fundamental task in exploratory data analysis (EDA). Existing methods use predefined criteria to choose the representation of data. There is a lack of methods that (i) elicit from the user what she has learned from the data and (ii) show patterns that she does not know yet. We construct a theoretical model where identified patterns can be input as knowledge to the system. The knowledge syntax here is intuitive, such as "this set of points forms a cluster", and requires no knowledge of maths. This background knowledge is used to find a Maximum Entropy distribution of the data, after which the system provides the user data projections in which the data and the Maximum Entropy distribution differ the most, hence showing the user aspects of the data that are maximally informative given the user's current knowledge. We provide an open source EDA system with tailored interactive visualizations to demonstrate these concepts. We study the performance of the system and present use cases on both synthetic and real data. We find that the model and the prototype system allow the user to learn information efficiently from various data sources and the system works sufficiently fast in practice. We conclude that the information theoretic approach to exploratory data analysis where patterns observed by a user are formalized as constraints provides a principled, intuitive, and efficient basis for constructing an EDA system
Robust subgroup discovery
We introduce the problem of robust subgroup discovery, i.e., finding a set of
interpretable descriptions of subsets that 1) stand out with respect to one or
more target attributes, 2) are statistically robust, and 3) non-redundant. Many
attempts have been made to mine either locally robust subgroups or to tackle
the pattern explosion, but we are the first to address both challenges at the
same time from a global modelling perspective. First, we formulate the broad
model class of subgroup lists, i.e., ordered sets of subgroups, for univariate
and multivariate targets that can consist of nominal or numeric variables, and
that includes traditional top-1 subgroup discovery in its definition. This
novel model class allows us to formalise the problem of optimal robust subgroup
discovery using the Minimum Description Length (MDL) principle, where we resort
to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and
numeric targets, respectively. Second, as finding optimal subgroup lists is
NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists
and guarantees that the most significant subgroup found according to the MDL
criterion is added in each iteration, which is shown to be equivalent to a
Bayesian one-sample proportions, multinomial, or t-test between the subgroup
and dataset marginal target distributions plus a multiple hypothesis testing
penalty. We empirically show on 54 datasets that SSD++ outperforms previous
subgroup set discovery methods in terms of quality and subgroup list size.Comment: For associated code, see https://github.com/HMProenca/RuleList ;
submitted to Data Mining and Knowledge Discovery Journa