15 research outputs found

    Towards a Multi-Objective Optimization of Subgroups for the Discovery of Materials with Exceptional Performance

    Full text link
    Artificial intelligence (AI) can accelerate the design of materials by identifying correlations and complex patterns in data. However, AI methods commonly attempt to describe the entire, immense materials space with a single model, while it is typical that different mechanisms govern the materials behaviors across the materials space. The subgroup-discovery (SGD) approach identifies local rules describing exceptional subsets of data with respect to a given target. Thus, SGD can focus on mechanisms leading to exceptional performance. However, the identification of appropriate SG rules requires a careful consideration of the generality-exceptionality tradeoff. Here, we discuss challenges to advance the SGD approach in materials science and analyse the tradeoff between exceptionality and generality based on a Pareto front of SGD solutions

    FSSD - A Fast and Efficient Algorithm for Subgroup Set Discovery

    Get PDF
    International audienceSubgroup discovery (SD) is the task of discovering interpretable patterns in the data that stand out w.r.t. some property of interest. Discovering patterns that accurately discriminate a class from the others is one of the most common SD tasks. Standard approaches of the literature are based on local pattern discovery, which is known to provide an overwhelmingly large number of redundant patterns. To solve this issue, pattern set mining has been proposed: instead of evaluating the quality of patterns separately, one should consider the quality of a pattern set as a whole. The goal is to provide a small pattern set that is diverse and well-discriminant to the target class. In this work, we introduce a novel formulation of the task of diverse subgroup set discovery where both discriminative power and diversity of the subgroup set are incorporated in the same quality measure. We propose an efficient and parameter-free algorithm dubbed FSSD and based on a greedy scheme. FSSD uses several optimization strategies that enable to efficiently provide a high quality pattern set in a short amount of time

    Subgroup Discovery: Real-World Applications

    Get PDF
    Subgroup discovery is a data mining technique which extracts interesting rules with respect to a target variable. An important characteristic of this task is the combination of predictive and descriptive induction. In this paper, an overview about subgroup discovery is performed. In addition, di erent real-world applications solved through evolutionary algorithms where the suitability and potential of this type of algorithms for the development of subgroup discovery algorithms are presented

    Subgroup Discovery trhough Evolutionary Fuzzy Systems applied to Bioinformatic problems

    Get PDF
    Subgroup discovery is a descriptive data mining technique using supervised learning. This paper presents a summary about the main properties and elements about subgroup discovery task. In addition, we will focus on the suitability and potential of the search performed by evolutionary algorithms in order to apply in the development of subgroup discovery algorithms, and in the use of fuzzy logic which is a soft computing technique very close to the human reasoning. The hybridisation of both techniques are well known as evolutionary fuzzy system. The most relevant applications of evolutionary fuzzy systems for subgroup discovery in the bioinformatics domains are outlined in this work. Specifically, these algorithms are applied to a problem based on the Influenza A virus and the accute sore throat problem

    DEvIANT: Discovering Significant Exceptional (Dis-)Agreement Within Groups

    Get PDF
    We strive to find contexts (i.e., subgroups of entities) under which exceptional (dis-)agreement occurs among a group of individuals , in any type of data featuring individuals (e.g., parliamentarians , customers) performing observable actions (e.g., votes, ratings) on entities (e.g., legislative procedures, movies). To this end, we introduce the problem of discovering statistically significant exceptional contextual intra-group agreement patterns. To handle the sparsity inherent to voting and rating data, we use Krippendorff's Alpha measure for assessing the agreement among individuals. We devise a branch-and-bound algorithm , named DEvIANT, to discover such patterns. DEvIANT exploits both closure operators and tight optimistic estimates. We derive analytic approximations for the confidence intervals (CIs) associated with patterns for a computationally efficient significance assessment. We prove that these approximate CIs are nested along specialization of patterns. This allows to incorporate pruning properties in DEvIANT to quickly discard non-significant patterns. Empirical study on several datasets demonstrates the efficiency and the usefulness of DEvIANT. Technical Report Associated with the ECML/PKDD 2019 Paper entitled: "DEvIANT: Discovering Significant Exceptional (Dis-)Agreement Within Groups"

    Diskretointi osajoukkojen haussa

    Get PDF
    Subgroup discovery is a data mining technique to discoverer interesting subgroups from a selected population. It seeks to discover interesting relationships between different objects in a set with respect to a specific property. The discovered patterns are called subgroups and they are represented in the form of rules. Discretization is technique to replace numerical attributes with nominal ones, making it possible to use them with algorithms that do not support numerical attributes. In this thesis two datasets are discretized for the application of subgroup discovery. For the discretizations four different methods were used and three different bin amounts were applied. The used datasets are the heart disease and the Australian credit approval from the UCI Machine Learning Repository. The subgroup discovery technique produced eleven subgroups sets as result, eight from heart disease dataset and three from Australian credit approval dataset. We observed that the bin amount affects greatly on the results. Also, with the binary discretization there are subgroup sets with a high share of subgroups with discretized attributes. In addition, the importance of expert guidance is emphasized

    Comprehensible and Robust Knowledge Discovery from Small Datasets

    Get PDF
    Die Wissensentdeckung in Datenbanken (“Knowledge Discovery in Databases”, KDD) zielt darauf ab, nützliches Wissen aus Daten zu extrahieren. Daten können eine Reihe von Messungen aus einem realen Prozess repräsentieren oder eine Reihe von Eingabe- Ausgabe-Werten eines Simulationsmodells. Zwei häufig widersprüchliche Anforderungen an das erworbene Wissen sind, dass es (1) die Daten möglichst exakt zusammenfasst und (2) in einer gut verständlichen Form vorliegt. Entscheidungsbäume (“Decision Trees”) und Methoden zur Entdeckung von Untergruppen (“Subgroup Discovery”) liefern Wissenszusammenfassungen in Form von Hyperrechtecken; diese gelten als gut verständlich. Um die Bedeutung einer verständlichen Datenzusammenfassung zu demonstrieren, erforschen wir Dezentrale intelligente Netzsteuerung — ein neues System, das die Bedarfsreaktion in Stromnetzen ohne wesentliche Änderungen in der Infrastruktur implementiert. Die bisher durchgeführte konventionelle Analyse dieses Systems beschränkte sich auf die Berücksichtigung identischer Teilnehmer und spiegelte daher die Realität nicht ausreichend gut wider. Wir führen viele Simulationen mit unterschiedlichen Eingabewerten durch und wenden Entscheidungsbäume auf die resultierenden Daten an. Mit den daraus resultierenden verständlichen Datenzusammenfassung konnten wir neue Erkenntnisse zum Verhalten der Dezentrale intelligente Netzsteuerung gewinnen. Entscheidungsbäume ermöglichen die Beschreibung des Systemverhaltens für alle Eingabekombinationen. Manchmal ist man aber nicht daran interessiert, den gesamten Eingaberaum zu partitionieren, sondern Bereiche zu finden, die zu bestimmten Ausgabe führen (sog. Untergruppen). Die vorhandenen Algorithmen zum Erkennen von Untergruppen erfordern normalerweise große Datenmengen, um eine stabile und genaue Ausgabe zu erzielen. Der Datenerfassungsprozess ist jedoch häufig kostspielig. Unser Hauptbeitrag ist die Verbesserung der Untergruppenerkennung aus Datensätzen mit wenigen Beobachtungen. Die Entdeckung von Untergruppen in simulierten Daten wird als Szenarioerkennung bezeichnet. Ein häufig verwendeter Algorithmus für die Szenarioerkennung ist PRIM (Patient Rule Induction Method). Wir schlagen REDS (Rule Extraction for Discovering Scenarios) vor, ein neues Verfahren für die Szenarioerkennung. Für REDS, trainieren wir zuerst ein statistisches Zwischenmodell und verwenden dieses, um eine große Menge neuer Daten für PRIM zu erstellen. Die grundlegende statistische Intuition beschrieben wir ebenfalls. Experimente zeigen, dass REDS viel besser funktioniert als PRIM für sich alleine: Es reduziert die Anzahl der erforderlichen Simulationsläufe um 75% im Durchschnitt. Mit simulierten Daten hat man perfekte Kenntnisse über die Eingangsverteilung — eine Voraussetzung von REDS. Um REDS auf realen Messdaten anwendbar zu machen, haben wir es mit Stichproben aus einer geschätzten multivariate Verteilung der Daten kombiniert. Wir haben die resultierende Methode in Kombination mit verschiedenen Methoden zur Generierung von Daten experimentell evaluiert. Wir haben dies für PRIM und BestInterval — eine weitere repräsentative Methode zur Erkennung von Untergruppen — gemacht. In den meisten Fällen hat unsere Methodik die Qualität der entdeckten Untergruppen erhöht

    Tight Optimistic Estimates for Fast Subgroup Discovery

    No full text
    Abstract. Subgroup discovery is the task of finding subgroups of a population which exhibit both distributional unusualness and high generality. Due to the non monotonicity of the corresponding evaluation functions, standard pruning techniques cannot be used for subgroup discovery, requiring the use of optimistic estimate techniques instead. So far, however, optimistic estimate pruning has only been considered for the extremely simple case of a binary target attribute and up to now no attempt was made to move beyond suboptimal heuristic optimistic estimates. In this paper, we show that optimistic estimate pruning can be developed into a sound and highly effective pruning approach for subgroup discovery. Based on a precise definition of optimality we show that previous estimates have been tight only in special cases. Thereafter, we present tight optimistic estimates for the most popular binary and multi-class quality functions, and present a family of increasingly efficient approximations to these optimal functions. As we show in empirical experiments, the use of our newly proposed optimistic estimates can lead to a speed up of an order of magnitude compared to previous approaches.
    corecore