1,069 research outputs found
Robust subgroup discovery
We introduce the problem of robust subgroup discovery, i.e., finding a set of
interpretable descriptions of subsets that 1) stand out with respect to one or
more target attributes, 2) are statistically robust, and 3) non-redundant. Many
attempts have been made to mine either locally robust subgroups or to tackle
the pattern explosion, but we are the first to address both challenges at the
same time from a global modelling perspective. First, we formulate the broad
model class of subgroup lists, i.e., ordered sets of subgroups, for univariate
and multivariate targets that can consist of nominal or numeric variables, and
that includes traditional top-1 subgroup discovery in its definition. This
novel model class allows us to formalise the problem of optimal robust subgroup
discovery using the Minimum Description Length (MDL) principle, where we resort
to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and
numeric targets, respectively. Second, as finding optimal subgroup lists is
NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists
and guarantees that the most significant subgroup found according to the MDL
criterion is added in each iteration, which is shown to be equivalent to a
Bayesian one-sample proportions, multinomial, or t-test between the subgroup
and dataset marginal target distributions plus a multiple hypothesis testing
penalty. We empirically show on 54 datasets that SSD++ outperforms previous
subgroup set discovery methods in terms of quality and subgroup list size.Comment: For associated code, see https://github.com/HMProenca/RuleList ;
submitted to Data Mining and Knowledge Discovery Journa
Subjectively Interesting Subgroup Discovery on Real-valued Targets
Deriving insights from high-dimensional data is one of the core problems in
data mining. The difficulty mainly stems from the fact that there are
exponentially many variable combinations to potentially consider, and there are
infinitely many if we consider weighted combinations, even for linear
combinations. Hence, an obvious question is whether we can automate the search
for interesting patterns and visualizations. In this paper, we consider the
setting where a user wants to learn as efficiently as possible about
real-valued attributes. For example, to understand the distribution of crime
rates in different geographic areas in terms of other (numerical, ordinal
and/or categorical) variables that describe the areas. We introduce a method to
find subgroups in the data that are maximally informative (in the formal
Information Theoretic sense) with respect to a single or set of real-valued
target attributes. The subgroup descriptions are in terms of a succinct set of
arbitrarily-typed other attributes. The approach is based on the Subjective
Interestingness framework FORSIED to enable the use of prior knowledge when
finding most informative non-redundant patterns, and hence the method also
supports iterative data mining.Comment: 12 pages, 10 figures, 2 tables, conference submissio
CONTRAST SET MINING MENGGUNAKAN SUBGROUP DISCOVERY ( Contrast Set Mining Through Subgroup Discovery )
ABSTRAKSI: Teknologi data mining merupakan suatu teknologi untuk menambang atau mengekstraksi pengetahuan menggunakan kumpulan data yang sangat besar. Salah satu pendekatan alternatif dalam data mining untuk mencari perbandingan dari beberapa grup perbandingan disebut contrast set mining. Contrast set mining dapat dilakukan menggunakan subgroup discovery. Salah satu algoritma dari subgroup discovery adalah APRIORI-SD. Berdasarkan hasil pengujian, diperoleh rule-rule yang dapat digunakan untuk mengklasifikasikan data dan diperoleh juga tingkat akurasi dari rule tersebut.Kata Kunci : data mining, contrast set mining, subgroup discovery,apriori-sd,ruleABSTRACT: Data mining technology can be used for mining and extracting knowledge with huge amount of data. An alternative approach in data mining for comparing of multiple groups comparison is called contrast set mining. Contrast set mining can be done by using subgroup discovery. One of the subgroup discovery algorithm is the Apriori-SD. Based on implementation of this algoritm, rules can be gained and tested for classifiying new data and also rule accuracy can be calculated based on this testing.Keyword: data mining, contrast set mining, subgroup discovery,apriori-sd,rul
Explainable subgraphs with surprising densities : a subgroup discovery approach
The connectivity structure of graphs is typically related to the attributes of the nodes. In social networks for example, the probability of a friendship between any pair of people depends on a range of attributes, such as their age, residence location, workplace, and hobbies. The high-level structure of a graph can thus possibly be described well by means of patterns of the form `the subgroup of all individuals with a certain properties X are often (or rarely) friends with individuals in another subgroup defined by properties Y', in comparison to what is expected. Such rules present potentially actionable and generalizable insight into the graph.
We present a method that finds node subgroup pairs between which the edge density is interestingly high or low, using an information-theoretic definition of interestingness. Additionally, the interestingness is quantified subjectively, to contrast with prior information an analyst may have about the connectivity. This view immediatly enables iterative mining of such patterns. This is the first method aimed at graph connectivity relations between different subgroups. Our method generalizes prior work on dense subgraphs induced by a subgroup description. Although this setting has been studied already, we demonstrate for this special case considerable practical advantages of our subjective interestingness measure with respect to a wide range of (objective) interestingness measures
Subgroup Discovery in Unstructured Data
Subgroup discovery is a descriptive and exploratory data mining technique to
identify subgroups in a population that exhibit interesting behavior with
respect to a variable of interest. Subgroup discovery has numerous applications
in knowledge discovery and hypothesis generation, yet it remains inapplicable
for unstructured, high-dimensional data such as images. This is because
subgroup discovery algorithms rely on defining descriptive rules based on
(attribute, value) pairs, however, in unstructured data, an attribute is not
well defined. Even in cases where the notion of attribute intuitively exists in
the data, such as a pixel in an image, due to the high dimensionality of the
data, these attributes are not informative enough to be used in a rule. In this
paper, we introduce the subgroup-aware variational autoencoder, a novel
variational autoencoder that learns a representation of unstructured data which
leads to subgroups with higher quality. Our experimental results demonstrate
the effectiveness of the method at learning subgroups with high quality while
supporting the interpretability of the concepts
Subgroup Discovery: Real-World Applications
Subgroup discovery is a data mining technique which extracts interesting rules with respect
to a target variable. An important characteristic of this task is the combination of predictive
and descriptive induction. In this paper, an overview about subgroup discovery is performed.
In addition, di erent real-world applications solved through evolutionary algorithms where the
suitability and potential of this type of algorithms for the development of subgroup discovery
algorithms are presented
Subgroup Discovery trhough Evolutionary Fuzzy Systems applied to Bioinformatic problems
Subgroup discovery is a descriptive data mining technique using supervised learning. This
paper presents a summary about the main properties and elements about subgroup discovery task.
In addition, we will focus on the suitability and potential of the search performed by evolutionary
algorithms in order to apply in the development of subgroup discovery algorithms, and in the use
of fuzzy logic which is a soft computing technique very close to the human reasoning. The
hybridisation of both techniques are well known as evolutionary fuzzy system.
The most relevant applications of evolutionary fuzzy systems for subgroup discovery in the
bioinformatics domains are outlined in this work. Specifically, these algorithms are applied to a
problem based on the Influenza A virus and the accute sore throat problem
- …