10 research outputs found

    PRZEGLĄD METOD SELEKCJI CECH UŻYWANYCH W DIAGNOSTYCE CZERNIAKA

    Get PDF
    Currently, a large number of trait selection methods are used. They are becoming more and more of interest among researchers. Some of the methods are of course used more frequently. The article describes the basics of selection-based algorithms. FS methods fall into three categories: filter wrappers, embedded methods. Particular attention was paid to finding examples of applications of the described methods in the diagnosisof skin melanoma.Obecnie stosuje się wiele metod selekcji cech. Cieszą się coraz większym zainteresowaniem badaczy. Oczywiście niektóre metody są stosowane częściej. W artykule zostały opisane podstawy działania algorytmów opartych na selekcji. Metody selekcji cech należące dzielą się na trzy kategorie: metody filtrowe, metody opakowujące, metody wbudowane. Zwrócono szczególnie uwagę na znalezienie przykładów zastosowań opisanych metod w diagnostyce czerniaka skóry

    Semantic variation operators for multidimensional genetic programming

    Full text link
    Multidimensional genetic programming represents candidate solutions as sets of programs, and thereby provides an interesting framework for exploiting building block identification. Towards this goal, we investigate the use of machine learning as a way to bias which components of programs are promoted, and propose two semantic operators to choose where useful building blocks are placed during crossover. A forward stagewise crossover operator we propose leads to significant improvements on a set of regression problems, and produces state-of-the-art results in a large benchmark study. We discuss this architecture and others in terms of their propensity for allowing heuristic search to utilize information during the evolutionary process. Finally, we look at the collinearity and complexity of the data representations that result from these architectures, with a view towards disentangling factors of variation in application.Comment: 9 pages, 8 figures, GECCO 201

    Consistent Feature Construction with Constrained Genetic Programming for Experimental Physics

    Full text link
    A good feature representation is a determinant factor to achieve high performance for many machine learning algorithms in terms of classification. This is especially true for techniques that do not build complex internal representations of data (e.g. decision trees, in contrast to deep neural networks). To transform the feature space, feature construction techniques build new high-level features from the original ones. Among these techniques, Genetic Programming is a good candidate to provide interpretable features required for data analysis in high energy physics. Classically, original features or higher-level features based on physics first principles are used as inputs for training. However, physicists would benefit from an automatic and interpretable feature construction for the classification of particle collision events. Our main contribution consists in combining different aspects of Genetic Programming and applying them to feature construction for experimental physics. In particular, to be applicable to physics, dimensional consistency is enforced using grammars. Results of experiments on three physics datasets show that the constructed features can bring a significant gain to the classification accuracy. To the best of our knowledge, it is the first time a method is proposed for interpretable feature construction with units of measurement, and that experts in high-energy physics validate the overall approach as well as the interpretability of the built features.Comment: Accepted in this version to CEC 201

    Interpretable Dimensionally-Consistent Feature Extraction from Electrical Network Sensors

    Get PDF
    International audienceElectrical power networks are heavily monitored systems, requiring operators to perform intricate information synthesis before understanding the underlying network state. Our study aims at helping this synthesis step by automatically creating features from the sensor data. We propose a supervised feature extraction approach using a grammar-guided evolution, which outputs interpretable and dimensionally consistent features. Operations restrictions on dimensions are introduced in the learning process through context-free grammars. They ensure coherence with physical laws, dimensional-consistency, and also introduce technical expertise in the created features. We compare our approach to other state-of-the-art feature extraction methods on a real dataset taken from the French electrical network sensors

    Feature clustering for pso-based feature construction on high-dimensional data

    Get PDF
    Feature construction (FC) refers to a process that uses the original features to construct new features with better discrimination ability. Particle Swarm Optimisation (PSO) is an effective search technique that has been successfully utilised in FC. However, the application of PSO for feature construction using high dimensional data has been a challenge due to its large search space and high computational cost. Moreover, unnecessary features that were irrelevant, redundant and contained noise were constructed when PSO was applied to the whole feature. Therefore, the main purpose of this paper is to select the most informative features and construct new features from the selected features for a better classification performance. The feature clustering methods were used to aggregate similar features into clusters, whereby the dimensionality of the data was lowered by choosing representative features from every cluster to form the final feature subset. The clustering of each features are proven to be accurate in feature selection (FS), however, only one study investigated its application in FC for classification. The study identified some limitations, such as the implementation of only two binary classes and the decreasing accuracy of the data. This paper proposes a cluster based PSO feature construction approach called ClusPSOFC. The Redundancy-Based Feature Clustering (RFC) algorithm was applied to choose the most informative features from the original data, while PSO was used to construct new features from those selected by RFC. Experimental results were obtained by using six UCI data sets and six high-dimensional data to demonstrate the efficiency of the proposed method when compared to the original full features, other PSO based FC methods, and standard genetic programming based feature construction (GPFC). Hence, the ClusPSOFC method is effective for feature construction in the classification of high dimensional data

    Comprehensible and Robust Knowledge Discovery from Small Datasets

    Get PDF
    Die Wissensentdeckung in Datenbanken (“Knowledge Discovery in Databases”, KDD) zielt darauf ab, nützliches Wissen aus Daten zu extrahieren. Daten können eine Reihe von Messungen aus einem realen Prozess repräsentieren oder eine Reihe von Eingabe- Ausgabe-Werten eines Simulationsmodells. Zwei häufig widersprüchliche Anforderungen an das erworbene Wissen sind, dass es (1) die Daten möglichst exakt zusammenfasst und (2) in einer gut verständlichen Form vorliegt. Entscheidungsbäume (“Decision Trees”) und Methoden zur Entdeckung von Untergruppen (“Subgroup Discovery”) liefern Wissenszusammenfassungen in Form von Hyperrechtecken; diese gelten als gut verständlich. Um die Bedeutung einer verständlichen Datenzusammenfassung zu demonstrieren, erforschen wir Dezentrale intelligente Netzsteuerung — ein neues System, das die Bedarfsreaktion in Stromnetzen ohne wesentliche Änderungen in der Infrastruktur implementiert. Die bisher durchgeführte konventionelle Analyse dieses Systems beschränkte sich auf die Berücksichtigung identischer Teilnehmer und spiegelte daher die Realität nicht ausreichend gut wider. Wir führen viele Simulationen mit unterschiedlichen Eingabewerten durch und wenden Entscheidungsbäume auf die resultierenden Daten an. Mit den daraus resultierenden verständlichen Datenzusammenfassung konnten wir neue Erkenntnisse zum Verhalten der Dezentrale intelligente Netzsteuerung gewinnen. Entscheidungsbäume ermöglichen die Beschreibung des Systemverhaltens für alle Eingabekombinationen. Manchmal ist man aber nicht daran interessiert, den gesamten Eingaberaum zu partitionieren, sondern Bereiche zu finden, die zu bestimmten Ausgabe führen (sog. Untergruppen). Die vorhandenen Algorithmen zum Erkennen von Untergruppen erfordern normalerweise große Datenmengen, um eine stabile und genaue Ausgabe zu erzielen. Der Datenerfassungsprozess ist jedoch häufig kostspielig. Unser Hauptbeitrag ist die Verbesserung der Untergruppenerkennung aus Datensätzen mit wenigen Beobachtungen. Die Entdeckung von Untergruppen in simulierten Daten wird als Szenarioerkennung bezeichnet. Ein häufig verwendeter Algorithmus für die Szenarioerkennung ist PRIM (Patient Rule Induction Method). Wir schlagen REDS (Rule Extraction for Discovering Scenarios) vor, ein neues Verfahren für die Szenarioerkennung. Für REDS, trainieren wir zuerst ein statistisches Zwischenmodell und verwenden dieses, um eine große Menge neuer Daten für PRIM zu erstellen. Die grundlegende statistische Intuition beschrieben wir ebenfalls. Experimente zeigen, dass REDS viel besser funktioniert als PRIM für sich alleine: Es reduziert die Anzahl der erforderlichen Simulationsläufe um 75% im Durchschnitt. Mit simulierten Daten hat man perfekte Kenntnisse über die Eingangsverteilung — eine Voraussetzung von REDS. Um REDS auf realen Messdaten anwendbar zu machen, haben wir es mit Stichproben aus einer geschätzten multivariate Verteilung der Daten kombiniert. Wir haben die resultierende Methode in Kombination mit verschiedenen Methoden zur Generierung von Daten experimentell evaluiert. Wir haben dies für PRIM und BestInterval — eine weitere repräsentative Methode zur Erkennung von Untergruppen — gemacht. In den meisten Fällen hat unsere Methodik die Qualität der entdeckten Untergruppen erhöht

    Integrating Machine Learning Paradigms for Predictive Maintenance in the Fourth Industrial Revolution era

    Get PDF
    In the last decade, manufacturing companies have been facing two significant challenges. First, digitalization imposes adopting Industry 4.0 technologies and allows creating smart, connected, self-aware, and self-predictive factories. Second, the attention on sustainability imposes to evaluate and reduce the impact of the implemented solutions from economic and social points of view. In manufacturing companies, the maintenance of physical assets assumes a critical role. Increasing the reliability and the availability of production systems leads to the minimization of systems’ downtimes; In addition, the proper system functioning avoids production wastes and potentially catastrophic accidents. Digitalization and new ICT technologies have assumed a relevant role in maintenance strategies. They allow assessing the health condition of machinery at any point in time. Moreover, they allow predicting the future behavior of machinery so that maintenance interventions can be planned, and the useful life of components can be exploited until the time instant before their fault. This dissertation provides insights on Predictive Maintenance goals and tools in Industry 4.0 and proposes a novel data acquisition, processing, sharing, and storage framework that addresses typical issues machine producers and users encounter. The research elaborates on two research questions that narrow down the potential approaches to data acquisition, processing, and analysis for fault diagnostics in evolving environments. The research activity is developed according to a research framework, where the research questions are addressed by research levers that are explored according to research topics. Each topic requires a specific set of methods and approaches; however, the overarching methodological approach presented in this dissertation includes three fundamental aspects: the maximization of the quality level of input data, the use of Machine Learning methods for data analysis, and the use of case studies deriving from both controlled environments (laboratory) and real-world instances

    A corpus-based study of academic-collocation use and patterns in postgraduate Computer Science students’ writing

    Get PDF
    Collocation has been considered a problematic area for L2 learners. Various studies have been conducted to investigate native speakers’ (NS) and non-native speakers’ (NNS) use of different types of collocations (e.g., Durrant and Schmitt, 2009; Laufer and Waldman, 2011).These studies have indicated that, unlike NS, NNS rely on a limited set of collocations and tend to overuse them. This raises the question: if NNS tend to overuse a limited set of collocations in their academic writing, would their use of academic collocations in a specific discipline (Computer Science in this study) vary from that of NS and expert writers? This study has three main aims. First, it investigates the use of lexical academic collocations in NNS and NS Computer Science students’ MSc dissertations and compares their uses with those by expert writers in their writing of published research articles. Second, it explores the factors behind the over/underuse of the 24shared lexical collocations among corpora. Third, it develops awareness-raising activities that could be used to help non-expert NNS students with collocation over/underuse problems. For this purpose, a corpus of 600,000 words was compiled from 55 dissertations (26 written by NS and 29 by NNS). For comparison purposes, a reference corpus of 600,269 words was compiled from 63 research articles from prestigious high impact factor Computer Science academic journals. The Academic Word List (AWL) (Coxhead, 2000) was used to develop lists of the most frequent academic words in the student corpora, whose collocations were examined. Quantitative analysis was then carried out by comparing the 100 most frequent noun and verb collocations from each of the student corpora with the reference corpus. The results reveal that both NNS (52%) and NS (78%) students overuse noun collocations compared to the expert writers in the reference corpus. They underuse only a small number of noun collocations (8%). Surprisingly, neither NNS nor NS students significantly over/underused verb collocations compared to the reference corpus. In order to achieve the second aim, mixed methods approach was adopted. First, the variant patterns of the 24 shared noun collocations between NNS and NS corpora were identified to determine whether over/underuse of these collocations could be explained by their differences in the number of patterns used. Approximately half of the 24 collocations identified for their patterns were using more patterns including (Noun + preposition +Noun and Noun + adjective +Noun) that were rarely located in the writing of experts. Second, a categorisation judgement task and semi-structured interviews were carried out with three Computer Scientists to elicit their views on the various factors likely influencing noun collocation choices by the writers across the corpora. Results demonstrate that three main factors could explain the variation: sub-discipline, topic, and genre. To achieve the third pedagogical aim, a sample of awareness-raising activities was designed for the problematic over/underuse of some noun collocations. Using the corpus-based Data Driven Learning (DDL)approach (Johns,1991), three types of awareness-raising activities were developed: noticing collocation, noticing and identifying different patterns of the same collocation, and comparing and contrasting patterns between NNS students’ corpora and the reference corpus. Results of this study suggest that academic collocation use in an ESP context (Computer Science) is related to other factors than students’ lack of knowledge of collocations. Expertness, genre variation, topic and discipline-specific collocations are proved important factors to be considered in ESP. Thus, ESP teachers have to alert their students to the effect of these factors in academic collocation use in subject specific disciplines. This has tangible implications for Applied Linguistics and for teaching practices
    corecore