9 research outputs found

    An improved multiple classifier combination scheme for pattern classification

    Get PDF
    Combining multiple classifiers are considered as a new direction in the pattern recognition to improve classification performance. The main problem of multiple classifier combination is that there is no standard guideline for constructing an accurate and diverse classifier ensemble. This is due to the difficulty in identifying the number of homogeneous classifiers and how to combine the classifier outputs. The most commonly used ensemble method is the random strategy while the majority voting technique is used as the combiner. However, the random strategy cannot determine the number of classifiers and the majority voting technique does not consider the strength of each classifier, thus resulting in low classification accuracy. In this study, an improved multiple classifier combination scheme is proposed. The ant system (AS) algorithm is used to partition feature set in developing feature subsets which represent the number of classifiers. A compactness measure is introduced as a parameter in constructing an accurate and diverse classifier ensemble. A weighted voting technique is used to combine the classifier outputs by considering the strength of the classifiers prior to voting. Experiments were performed using four base classifiers, which are Nearest Mean Classifier (NMC), Naive Bayes Classifier (NBC), k-Nearest Neighbour (k-NN) and Linear Discriminant Analysis (LDA) on benchmark datasets, to test the credibility of the proposed multiple classifier combination scheme. The average classification accuracy of the homogeneous NMC, NBC, k-NN and LDA ensembles are 97.91%, 98.06%, 98.09% and 98.12% respectively. The accuracies are higher than those obtained through the use of other approaches in developing multiple classifier combination. The proposed multiple classifier combination scheme will help to develop other multiple classifier combination for pattern recognition and classification

    An Enhanced Random Linear Oracle Ensemble Method using Feature Selection Approach based on NaĂŻve Bayes Classifier

    Get PDF
    Random Linear Oracle (RLO) ensemble replaced each classifier with two mini-ensembles, allowing base classifiers to be trained using different data set, improving the variety of trained classifiers. NaĂŻve Bayes (NB) classifier was chosen as the base classifier for this research due to its simplicity and computational inexpensive. Different feature selection algorithms are applied to RLO ensemble to investigate the effect of different sized data towards its performance. Experiments were carried out using 30 data sets from UCI repository, as well as 6 learning algorithms, namely NB classifier, RLO ensemble, RLO ensemble trained with Genetic Algorithm (GA) feature selection using accuracy of NB classifier as fitness function, RLO ensemble trained with GA feature selection using accuracy of RLO ensemble as fitness function, RLO ensemble trained with t-test feature selection, and RLO ensemble trained with Kruskal-Wallis test feature selection. The results showed that RLO ensemble could significantly improve the diversity of NB classifier in dealing with distinctively selected feature sets through its fusionselection paradigm. Consequently, feature selection algorithms could greatly benefit RLO ensemble, with properly selected number of features from filter approach, or GA natural selection from wrapper approach, it received great classification accuracy improvement, as well as growth in diversity

    Grammar-based evolutionary approach for automated workflow composition with domain-specific operators and ensemble diversity

    Full text link
    The process of extracting valuable and novel insights from raw data involves a series of complex steps. In the realm of Automated Machine Learning (AutoML), a significant research focus is on automating aspects of this process, specifically tasks like selecting algorithms and optimising their hyper-parameters. A particularly challenging task in AutoML is automatic workflow composition (AWC). AWC aims to identify the most effective sequence of data preprocessing and ML algorithms, coupled with their best hyper-parameters, for a specific dataset. However, existing AWC methods are limited in how many and in what ways they can combine algorithms within a workflow. Addressing this gap, this paper introduces EvoFlow, a grammar-based evolutionary approach for AWC. EvoFlow enhances the flexibility in designing workflow structures, empowering practitioners to select algorithms that best fit their specific requirements. EvoFlow stands out by integrating two innovative features. First, it employs a suite of genetic operators, designed specifically for AWC, to optimise both the structure of workflows and their hyper-parameters. Second, it implements a novel updating mechanism that enriches the variety of predictions made by different workflows. Promoting this diversity helps prevent the algorithm from overfitting. With this aim, EvoFlow builds an ensemble whose workflows differ in their misclassified instances. To evaluate EvoFlow's effectiveness, we carried out empirical validation using a set of classification benchmarks. We begin with an ablation study to demonstrate the enhanced performance attributable to EvoFlow's unique components. Then, we compare EvoFlow with other AWC approaches, encompassing both evolutionary and non-evolutionary techniques. Our findings show that EvoFlow's specialised genetic operators and updating mechanism substantially outperform current leading methods[..]Comment: 32 pages, 7 figures, 6 tables, journal pape

    Contrôle intégré du pilotage d'atelier et de la qualité des produits. : Application à la société ACTA mobilier.

    Get PDF
    Centre de Recherche en Automatique de Nancy. The idea is to take advantage of Product Driven System in an industrial environment disturbed by many loops and a rework rate (non quality) causing significant loss of products, non-compliance deadlines, unstable workloads, etc ... impossible link between the product and identifying infotronic lead to more difficult traceability. Work on scheduling and optimization are hampered by these disturbances on the production line that make them untenable schedules. Priority processing on defective products ensures a service rate that remains outstanding compared to the percentage of products to repair. But it also leads to loss of products that prevent the full delivery of the order. The scientific problem revolves around the control of flow in a production context disturbed by the loops and the quality level by assessing its impact on congestion.The quality-control issue has been addressed by using neural networks that can predict the occurrence of the defect to which they are dedicated from production and environmental parameters. This anticipation allows us to offer a program alternative to use or to plan to postpone the task. The adaptation of the forecasting model to the drift of the physical model with a behavior regarded as nervous is made "on line" using control charts that detect drift and its start date.Despite this simplification of flows, the flow control remains complex due to normal production loops and residual nonqualities. There are different system saturation states for which the most suitable control rule is not always the same. This analysis is presented in a two-dimensional mapping which each axis has a key indicator on non-quality rate and / or disruption of flows.Although, unlike algorithms, the most suitable control rule will not always be highlighted, this mapping has other advantages such as the simplification of the control, the ability for all users to have important information about the workshop state, or the need for homogenization of the global state of the production unit.In this context, the intelligent container offers interesting perspectives with the will to trace a group of products with the same rooting sheet rather than products one by one, to share information such as its delivery date, the urgency degree, to know what paths they should take and what are the possible alternatives or to communicate with other machines and systems including the quality forecasting system and retain information over the manufacture of the products. The proposed system is so interactive where container is at the heart of the decision. It reported his presence to scheduling system only if the quality system requirements are met, and simplify this work while allowing a traditional linear algorithm to achieve this task seen as particularlycomplicated at first. It is however the responsibility of the scheduler to ensure the pilot rule to use and request the relevant information available to the lots. The contribution of this thesis is a methodology to simplify complex problems by a division of work between different subsystems actors applied to the case of a manufacturer of high-finished lacquered panels.Cette thèse CIFRE s’inscrit dans le cadre d’une collaboration entre Acta-Mobilier, fabricant de façades laquées haut de gamme, et le Centre de Recherche en Automatique Nancy. L’idée est de tirer parti du concept de Système Contrôlé par le Produit dans un environnement industriel perturbé par de nombreuses boucles de production et par un taux de reprises (non-qualités) non négligeable engendrant des pertes de pièces, le non-respect des délais, des charges de travail instables, etc… le lien impossible entre le produit et un identifiant infotronique rendant en plus la traçabilité difficile. Les travaux sur l’ordonnancementet son optimisation sont freinés par ces perturbations sur la chaîne de production qui rendent les plannings intenables. Le traitement prioritaire des pièces défectueuses permet d’assurer un taux de service qui reste remarquable au regard du pourcentage de pièces à réparer. Mais cela engendre aussi des pertes de pièces qui empêchent la livraison complète de la commande. La problématique scientifique s’articule autour du pilotage des flux dans un contexte de production perturbé par les reprises et de la maîtrise de la qualité en évaluant son impact sur l’engorgement. L’enjeu de maîtrise de la qualité a été abordé à l’aide de réseaux de neurones capables de prévoir l’apparition du défaut auquel ils sont dédiés en fonction des paramètres de production et environnementaux. Cette anticipation permet de proposer une alternative de programme à utiliser ou à reporter la planification de la tâche. L’adaptation du modèle de prévision aux dérives du modèle physique au comportement considéré comme nerveux est réalisée « en-ligne » à l’aide de cartes de contrôle qui permettent de détecter la dérive et sa date de début.Malgré cette simplification des flux, le pilotage reste complexe en raison des boucles normales de production et des non qualités résiduelles. Il existe différents états de saturation du système pour lesquels la règle de pilotage la plus adaptée n’est pas toujours la même. Cette analyse est présentée sous forme de cartographie en deux dimensions dont chacun des axes présente un indicateur clé du taux de non-qualité et/ou de la perturbation des flux. Même si, contrairement aux algorithmes, la règle de pilotage la mieux adaptée ne sera pas toujours mise en évidence, cette cartographie présente d’autres avantages tels que la simplification du pilotage, la possibilité pour tous les utilisateurs d’avoir l’information importante sur l’état de l’atelier en uncoup d’oeil, ou encore la nécessité d’homogénéisation sur la globalité de l’unité de production.Dans ce contexte, le container intelligent offre des perspectives intéressantes avec la volonté de tracer un groupe de produits ayant la même gamme de fabrication plutôt que des produits un à un, de partager des informations telles que sa date de livraison, son degré d’urgence, de connaître quels chemins ils doivent emprunter dans l’atelier et quelles sont les alternatives possibles ou encore de communiquer avec les machines et les autres systèmes dont celui de prévision de la qualité et retenir des informations au fil de la fabrication des produits. Le système proposé est donc interactif ou le conteneur est au coeur de la décision. Il signale sa présence au système d’ordonnancement seulement si les conditions qualité sont réunies, permettant ainsi de simplifier sontravail autorisant alors un simple algorithme traditionnel de programmation linéaire à réaliser cette tâche particulièrement compliquée au premier abord. C’est en revanche à la charge de l’ordonnanceur de s’assurer de la règle de pilotage à utiliser et de demander les informations correspondantes aux lots disponibles.La contribution de cette thèse est donc une méthodologie de simplification de problèmes complexes par une répartition des tâches entre différents sous-systèmes acteurs appliquée au cas d’une entreprise de fabrication de façades de cuisine laquées haut de gamme

    The impact of training data characteristics on ensemble classification of land cover

    Get PDF
    Supervised classification of remote sensing imagery has long been recognised as an essential technology for large area land cover mapping. Remote sensing derived land cover and forest classification maps are important sources of information for understanding environmental processes and informing natural resource management decision making. In recent years, the supervised transformation of remote sensing data into thematic products has been advanced through the introduction and development of machine learning classification techniques. Applied to a variety of science and engineering problems over the past twenty years (Lary et al., 2016), machine learning provides greater accuracy and efficiency than traditional parametric classifiers, capable of dealing with large data volumes across complex measurement spaces. The Random forest (RF) classifier in particular, has become popular in the remote sensing community, with a range of commonly cited advantages, including its low parameterisation requirements, excellent classification results and ability to handle noisy observation data and outliers, in a complex measurement space and small training data relative to the study area size. In the context of large area land cover classification for forest cover, using multisource remote sensing and geospatial data, this research sets out to examine proposed advantages of the RF classifier - insensitivity to training data noise (mislabelling) and handling training data class imbalance. Through margin theory, the research also investigates the utility of ensemble learning – in which multiple base classifiers are combined to reduce generalisation error in classification – as a means of designing more efficient classifiers, improving classification performance, and reducing reference (training and test) data redundancy. The first part of the thesis (chapters 2 and 3) introduces the experimental setting and data used in the research, including a description (in chapter 2) of the sampling framework for the reference data used in classification experiments that follow. Chapter 3 evaluates the performance of the RF classifier applied across 7.2 million hectares of public land study area in Victoria, Australia. This chapter describes an open-source framework for deploying the RF classifier over large areas and processing significant volumes of multi-source remote sensing and ancillary spatial data. The second part of this thesis (research chapters 4 through 6) examines the effect of training data characteristics (class imbalance and mislabelling) on the performance of RF, and explores the application of the ensemble margin, as a means of both examining RF classification performance, and informing training data sampling to improve classification accuracy. Results of binary and multiclass experiments described in chapter 4, provide insights into the behaviour of RF, in which training data are not evenly distributed among classes and contain systematically mislabelled instances. Results show that while the error rate of the RF classifier is relatively insensitive to mislabelled training data (in the multiclass experiment, overall 78.3% Kappa with no mislabelled instances to 70.1% with 25% mislabelling in each class), the level of associated confidence falls at a faster rate than overall accuracy with increasing rates of mislabelled training data. This study section also demonstrates that imbalanced training data can be introduced to reduce error in classes that are most difficult to classify. The relationship between per-class and overall classification performance and the diversity of members in a RF ensemble classifier, is explored through experiments presented in chapter 5. This research examines ways of targeting particular training data samples to induce RF ensemble diversity and improve per-class and overall classification performance and efficiency. Through use of the ensemble margin, this study offers insights into the trade-off between ensemble classification accuracy and diversity. The research shows that boosting diversity among RF ensemble members, by emphasising the contribution of lower margin training instances used in the learning process, is an effective means of improving classification performance, particularly for more difficult or rarer classes, and is a way of reducing information redundancy and improving the efficiency of classification problems. Research chapter 6 looks at the application of the RF classifier for calculating Landscape Pattern Indices (LPIs) from classification prediction maps, and examines the sensitivity of these indices to training data characteristics and sampling based on the ensemble margin. This research reveals a range of commonly used LPIs to have significant sensitivity to training data mislabelling in RF classification, as well as margin-based training data sampling. In conclusion, this thesis examines proposed advantages of the popular machine learning classifier, Random forests - the relative insensitivity to training data noise (mislabelling) and its ability to handle class imbalance. This research also explores the utility of the ensemble margin for designing more efficient classifiers, measuring and improving classification performance, and designing ensemble classification systems which use reference data more efficiently and effectively, with less data redundancy. These findings have practical applications and implications for large area land cover classification, for which the generation of high quality reference data is often a time consuming, subjective and expensive exercise
    corecore