Search CORE

906 research outputs found

BagStack Classification for Data Imbalance Problems with Application to Defect Detection and Labeling in Semiconductor Units

Author
Publication venue
Publication date: 01/01/2019
Field of study

abstract: Despite the fact that machine learning supports the development of computer vision applications by shortening the development cycle, finding a general learning algorithm that solves a wide range of applications is still bounded by the ”no free lunch theorem”. The search for the right algorithm to solve a specific problem is driven by the problem itself, the data availability and many other requirements. Automated visual inspection (AVI) systems represent a major part of these challenging computer vision applications. They are gaining growing interest in the manufacturing industry to detect defective products and keep these from reaching customers. The process of defect detection and classification in semiconductor units is challenging due to different acceptable variations that the manufacturing process introduces. Other variations are also typically introduced when using optical inspection systems due to changes in lighting conditions and misalignment of the imaged units, which makes the defect detection process more challenging. In this thesis, a BagStack classification framework is proposed, which makes use of stacking and bagging concepts to handle both variance and bias errors. The classifier is designed to handle the data imbalance and overfitting problems by adaptively transforming the multi-class classification problem into multiple binary classification problems, applying a bagging approach to train a set of base learners for each specific problem, adaptively specifying the number of base learners assigned to each problem, adaptively specifying the number of samples to use from each class, applying a novel data-imbalance aware cross-validation technique to generate the meta-data while taking into account the data imbalance problem at the meta-data level and, finally, using a multi-response random forest regression classifier as a meta-classifier. The BagStack classifier makes use of multiple features to solve the defect classification problem. In order to detect defects, a locally adaptive statistical background modeling is proposed. The proposed BagStack classifier outperforms state-of-the-art image classification techniques on our dataset in terms of overall classification accuracy and average per-class classification accuracy. The proposed detection method achieves high performance on the considered dataset in terms of recall and precision.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

ASU Digital Repository

Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography

Author: Abeel
Adem
Ahn
Ali
Altincay
Anand
Arbel
Archer
Averbuch
Banfield
Bao
Bartlett
Bauer
Bay
Bennett
Brazdil
Breiman
Breiman
Breiman
Brodley
Brown
Brown
Bruzzone
Bryll
Buntine
Buttrey
Chan
Chawla
Christensen
Christmann
Clark
Cohen
Croux
Cunningham
Dasarathy
Denison
Derbeko
Dietterich
Dietterich
Dietterich
Dimitrakakis
Domingos
Drucker
Džeroski
Elovici
Elovici
Frank
Friedman
Friedman
Friedman
Gama
Gams
Gey
Gunter
Hansen
Ho
Ho
Ho
Ho
Hothorn
Hu
Hu
Huang
Islam
Jacobs
Jordan
Kamel
Kang
Kim
Kolen
Krogh
Kuncheva
Kuncheva
Kuncheva
Kusiak
Lam
Langdon
Leigh
Li
Liao
Lin
Lin
Lior Rokach
Liu
Liu
Lu
Maimon
Maimon
Maimon
Mangiameli
Menahem
Merkwirth
Merler
Merz
Michalski
Mitchell
Moskovitch
Nowlan
Opitz
Opitz
Opitz
Parmanto
Partridge
Phama
Polikar
Ridgeway
Rokach
Rokach
Rokach
Rokach
Rokach
Rokach
Rokach
Rokach
Rokach
Rokach
Rokach
Rokach
Rokach
Rokach
Rosen
Rudin
Schaffer
Schapire
Schclar
Seewald
Sexton
Sharkey
Sharkey
Sharkey
Sharkey
Shilen
Sivalingam
Skurichina
Sohna
Sun
Tan
Tao
Tao
Towell
Tsao
Tsymbal
Tsymbal
Tukey
Tumer
Tumer
Tumer
Valentini
Vilalta
Wanas
Wang
Webb
Webb
Windeatt
Wolpert
Woods
Wu
Xu
Yates
Zhang
Zhang
Zhou
Zhou
Zhou
Zhoua
Zupan
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

On pruning and feature engineering in Random Forests.

Author: Fawagreh Khaled
Publication venue
Publication date: 31/10/2016
Field of study

Random Forest (RF) is an ensemble classification technique that was developed by Leo Breiman over a decade ago. Compared with other ensemble techniques, it has proved its accuracy and superiority. Many researchers, however, believe that there is still room for optimizing RF further by enhancing and improving its performance accuracy. This explains why there have been many extensions of RF where each extension employed a variety of techniques and strategies to improve certain aspect(s) of RF. The main focus of this dissertation is to develop new extensions of RF using new optimization techniques that, to the best of our knowledge, have never been used before to optimize RF. These techniques are clustering, the local outlier factor, diversified weighted subspaces, and replicator dynamics. Applying these techniques on RF produced four extensions which we have termed CLUB-DRF, LOFB-DRF, DSB-RF, and RDB-DR respectively. Experimental studies on 15 real datasets showed favorable results, demonstrating the potential of the proposed methods. Performance-wise, CLUB-DRF is ranked first in terms of accuracy and classifcation speed making it ideal for real-time applications, and for machines/devices with limited memory and processing power

Open Access Institutional Repository at Robert Gordon University

Optimized classification predictions with a new index combining machine learning algorithms

Author: Anagnostopoulos Christos-Nikolaos
Niros Antonios D.
Spatharis Sofie
Tamvakis Androniki
Tsirtsis George
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/05/2018
Field of study

Voting is a commonly used ensemble method aiming to optimize classification predictions by combining results from individual base classifiers. However, the selection of appropriate classifiers to participate in voting algorithm is currently an open issue. In this study we developed a novel Dissimilarity-Performance (DP) index which incorporates two important criteria for the selection of base classifiers to participate in voting: their differential response in classification (dissimilarity) when combined in triads and their individual performance. To develop this empirical index we firstly used a range of different datasets to evaluate the relationship between voting results and measures of dissimilarity among classifiers of different types (rules, trees, lazy classifiers, functions and Bayes). Secondly, we computed the combined effect on voting performance of classifiers with different individual performance and/or diverse results in the voting performance. Our DP index was able to rank the classifier combinations according to their voting performance and thus to suggest the optimal combination. The proposed index is recommended for individual machine learning users as a preliminary tool to identify which classifiers to combine in order to achieve more accurate classification predictions avoiding computer intensive and time-consuming search

Enlighten

Learning to Build a Semantic Thesaurus from Free Text Corpora without External Help

Author: Katia Lida Kermanidis
Publication venue: 'IntechOpen'
Publication date: 01/01/2009
Field of study

IntechOpen

An Effective Multi-Resolution Hierarchical Granular Representation based Classifier using General Fuzzy Min-Max Neural Network

Author: Chen F
Gabrys B
Khuat TT
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

IEEE Motivated by the practical demands for simplification of data towards being consistent with human thinking and problem solving as well as tolerance of uncertainty, information granules are becoming important entities in data processing at different levels of data abstraction. This paper proposes a method to construct classifiers from multi-resolution hierarchical granular representations (MRHGRC) using hyperbox fuzzy sets. The proposed approach forms a series of granular inferences hierarchically through many levels of abstraction. An attractive characteristic of our classifier is that it can maintain a high accuracy in comparison to other fuzzy min-max models at a low degree of granularity based on reusing the knowledge learned from lower levels of abstraction. In addition, our approach can reduce the data size significantly as well as handle the uncertainty and incompleteness associated with data in real-world applications. The construction process of the classifier consists of two phases. The first phase is to formulate the model at the greatest level of granularity, while the later stage aims to reduce the complexity of the constructed model and deduce it from data at higher abstraction levels. Experimental analyses conducted comprehensively on both synthetic and real datasets indicated the efficiency of our method in terms of training time and predictive performance in comparison to other types of fuzzy min-max neural networks and common machine learning algorithms

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

Searching for Needles in the Cosmic Haystack

Author: Devine Thomas Ryan
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2020
Field of study

Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection

The Research Repository @ WVU (West Virginia University)

A Model-driven Visual Analytic Framework for Local Pattern Analysis

Author: Zhao Kaiyu
Publication venue: Digital WPI
Publication date: 09/02/2016
Field of study

The ultimate goal of any visual analytic task is to make sense of the data and gain insights. Unfortunately, the process of discovering useful information is becoming more challenging nowadays due to the growing data scale. Particularly, the human cognitive capabilities remain constant whereas the scale and complexity of data are not. Meanwhile, visual analytics largely relies on human analytic in the loop which imposes challenge to traditional human-driven workflow. It is almost impossible to show every aspect of details to the user while diving into local region of the data to explain phenomenons hidden in the data. For example, while exploring the data subsets, it is always important to determine which partitions of data contain more important information. Also, determining the subset of features is vital before further doing other analysis. Furthermore, modeling on these subsets of data locally can yield great finding but also introduces bias. In this work, a model driven visual analytic framework is proposed to help identify interesting local patterns from the above three aspects. This dissertation work aims to tackle these subproblems in the following three topics: model-driven data exploration, model-driven feature analysis and local model diagnosis. First, the model-driven data exploration focus on the problem of modeling subset of data to identify the co-movement of time-series data within certain subset time partitions, which is an important application in a number of domains such as medical science, finance, business and engineering. Second, the model-driven feature analysis is to discover the important subset of interesting features while analyzing local feature similarities. Within the financial risk dataset collected by domain expert, we discover that the feature correlation among different data partitions (i.e., small and large companies) are very different. Third, local model diagnosis provides a tool to identify interesting local regression models at local regions of the data space which makes it possible for the analysts to model the whole data space with a set of local models while knowing the strength and weakness of them. The three tools provide an integrated solution for identifying interesting patterns within local subsets of data

DigitalCommons@WPI

Ensemble deep learning: A review

Author: Ganaie M. A.
Hu Minghui
Malik A. K.
Suganthan P. N.
Tanveer M.
Publication venue
Publication date: 06/04/2021
Field of study

Ensemble learning combines several individual models to obtain better generalization performance. Currently, deep learning models with multilayer processing architecture is showing better performance as compared to the shallow or traditional classification models. Deep ensemble learning models combine the advantages of both the deep learning models as well as the ensemble learning such that the final model has better generalization performance. This paper reviews the state-of-art deep ensemble models and hence serves as an extensive summary for the researchers. The ensemble models are broadly categorised into ensemble models like bagging, boosting and stacking, negative correlation based deep ensemble models, explicit/implicit ensembles, homogeneous /heterogeneous ensemble, decision fusion strategies, unsupervised, semi-supervised, reinforcement learning and online/incremental, multilabel based deep ensemble models. Application of deep ensemble models in different domains is also briefly discussed. Finally, we conclude this paper with some future recommendations and research directions

arXiv.org e-Print Archive

Qatar University Institutional Repository