7 research outputs found

    Improved Weighted Random Forest for Classification Problems

    Get PDF
    Several studies have shown that combining machine learning models in an appropriate way will introduce improvements in the individual predictions made by the base models. The key to make well-performing ensemble model is in the diversity of the base models. Of the most common solutions for introducing diversity into the decision trees are bagging and random forest. Bagging enhances the diversity by sampling with replacement and generating many training data sets, while random forest adds selecting a random number of features as well. This has made the random forest a winning candidate for many machine learning applications. However, assuming equal weights for all base decision trees does not seem reasonable as the randomization of sampling and input feature selection may lead to different levels of decision-making abilities across base decision trees. Therefore, we propose several algorithms that intend to modify the weighting strategy of regular random forest and consequently make better predictions. The designed weighting frameworks include optimal weighted random forest based on ac-curacy, optimal weighted random forest based on the area under the curve (AUC), performance-based weighted random forest, and several stacking-based weighted random forest models. The numerical results show that the proposed models are able to introduce significant improvements compared to regular random forest

    An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis

    Get PDF
    Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced classification. Furthermore, data characteristics have a significant impact on the performance of imbalanced classifiers, which are generally neglected by existing evaluation methods. The objective of this study is to introduce a new criterion to comprehensively evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established using data envelopment analysis without explicit inputs (DEA-WEI), to determine the trade-off between the benefits of improved minority class accuracy and the cost of reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced ratio and typical imbalanced data characteristics on the efficiency of the classifiers. Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble and undersampling techniques are more effective for overlapping and noisy data. The efficiency of cost-sensitive classifiers decreases dramatically when the imbalanced ratio increases. Finally, we investigate the reasons for the different efficiencies of classifiers on imbalanced data and recommend steps to select appropriate classifiers for imbalanced data based on data characteristics.National Natural Science Foundation of China (NSFC) 71874023 71725001 71771037 7197104

    Strategies for Combining Tree-Based Ensemble Models

    Get PDF
    Ensemble models have proved effective in a variety of classification tasks. These models combine the predictions of several base models to achieve higher out-of-sample classification accuracy than the base models. Base models are typically trained using different subsets of training examples and input features. Ensemble classifiers are particularly effective when their constituent base models are diverse in terms of their prediction accuracy in different regions of the feature space. This dissertation investigated methods for combining ensemble models, treating them as base models. The goal is to develop a strategy for combining ensemble classifiers that results in higher classification accuracy than the constituent ensemble models. Three of the best performing tree-based ensemble methods – random forest, extremely randomized tree, and eXtreme gradient boosting model – were used to generate a set of base models. Outputs from classifiers generated by these methods were then combined to create an ensemble classifier. This dissertation systematically investigated methods for (1) selecting a set of diverse base models, and (2) combining the selected base models. The methods were evaluated using public domain data sets which have been extensively used for benchmarking classification models. The research established that applying random forest as the final ensemble method to integrate selected base models and factor scores of multiple correspondence analysis turned out to be the best ensemble approach

    BagStack Classification for Data Imbalance Problems with Application to Defect Detection and Labeling in Semiconductor Units

    Get PDF
    abstract: Despite the fact that machine learning supports the development of computer vision applications by shortening the development cycle, finding a general learning algorithm that solves a wide range of applications is still bounded by the ”no free lunch theorem”. The search for the right algorithm to solve a specific problem is driven by the problem itself, the data availability and many other requirements. Automated visual inspection (AVI) systems represent a major part of these challenging computer vision applications. They are gaining growing interest in the manufacturing industry to detect defective products and keep these from reaching customers. The process of defect detection and classification in semiconductor units is challenging due to different acceptable variations that the manufacturing process introduces. Other variations are also typically introduced when using optical inspection systems due to changes in lighting conditions and misalignment of the imaged units, which makes the defect detection process more challenging. In this thesis, a BagStack classification framework is proposed, which makes use of stacking and bagging concepts to handle both variance and bias errors. The classifier is designed to handle the data imbalance and overfitting problems by adaptively transforming the multi-class classification problem into multiple binary classification problems, applying a bagging approach to train a set of base learners for each specific problem, adaptively specifying the number of base learners assigned to each problem, adaptively specifying the number of samples to use from each class, applying a novel data-imbalance aware cross-validation technique to generate the meta-data while taking into account the data imbalance problem at the meta-data level and, finally, using a multi-response random forest regression classifier as a meta-classifier. The BagStack classifier makes use of multiple features to solve the defect classification problem. In order to detect defects, a locally adaptive statistical background modeling is proposed. The proposed BagStack classifier outperforms state-of-the-art image classification techniques on our dataset in terms of overall classification accuracy and average per-class classification accuracy. The proposed detection method achieves high performance on the considered dataset in terms of recall and precision.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    Otimização de modelos SARIMA-DEA com ensembles e delineamento de misturas

    Get PDF
    Accurate forecasting is crucial for several areas of knowledge, such as Economics, Management, Engineering and Statistics. There are several approaches to perform forecasting: time series analysis, regression analysis, artificial neural networks, etc. However, both researchers or analysts must be aware when applying any of the aforementioned techniques because of overfitting – which occurs when a given model has so many parameters that it fits well to the training set, but predicts the test set very poorly. Recently, model combination techniques are widely spread, since the ensemble of models is proven to make the forecast metrics better. However, the overfitting problem may still occur in these cases. To overcome this, this thesis suggests the application of an intermediate step between the selection of models for the ensemble and the optimization of their weights, which is the application of a Data Envelopment Analysis model suitable for the presence of fractional variables so as not to harm the assumption of convexity. To analyze this method, this thesis applies Box & Jenkins models. Therefore, Decision Making Units (DMU) are created through a Complete Factorial Arrangement, modifying the computational parameters. Super-efficiency analysis is applied and the 4 DMUs with the highest efficiency indexes are retained for later combination through Response Surface (RSM) optimization in the context of Mixture Design. It is also proposed the application of multivariate statistical techniques for dimensionality reduction, in order to make the problem computationally smaller. To validate the proposed method, a simulation study was created, comparing the results with the Naive method. The simulation showed that the method proposed in this thesis presents, on average, better results. Finally, the method was applied to series about electricity demand from Brazil and its five geographic regions.Realizar previsões acuradas é de suma importância para diversas áreas do conhecimento, como Economia, Gestão, Engenharias, Estatística etc. Existem várias abordagens para se realizar previsões: análise de séries temporais, análise de regressão, redes neurais artificiais etc. Contudo, um cuidado que todo pesquisador ou analista deve ter ao aplicar qualquer uma das referidas técnicas é o cuidado com o overfitting – que ocorre quando um determinado modelo possui tantos parâmetros que se ajusta bem ao conjunto de treino, mas prevê muito mal o conjunto de teste. Recentemente, técnicas de combinação de modelos estão em voga, uma vez que comprovadamente o ensemble de modelos faz com que as métricas de previsão sejam melhores. Entretanto, o problema do overfitting ainda pode estar presente nestes casos. Para contornar este problema, esta tese sugere a aplicação de um passo intermediário entre a seleção de modelos e a otimização dos pesos que é a utilização de um modelo de Análise por Envoltória de Dados adequado à presença de variáveis fracionadas para não ferir o pressuposto de convexidade. Para analisar este método, esta tese aplicará modelos de Box & Jenkins. Criam-se, portanto, Decision Making Units (DMUs) por meio de um Arranjo Fatorial Completo, modificando os parâmetros computacionais. Aplica-se a análise de supereficiência e retem-se as 4 DMUs com maiores índices de eficiência para posterior combinação por meio de otimização de Superfície de Resposta no contexto de Delineamento de Misturas. Propõe-se também a aplicação de técnicas estatística multivariadas para redução de dimensionalidade, a fim de tornar o problema computacionalmente menor. Para validar o método proposto, foi criado um estudo de simulação, comparando os resultados com o método Naive. A simulação mostrou que o método proposto nesta tese apresenta em média melhores resultados. Por fim, o método foi aplicado em séries de demanda de energia elétrica do Brasil e suas cinco regiões

    Constructing ensembles from data envelopment analysis

    No full text
    Abstract It has been shown in prior work in management science, statistics and machine learning that using an ensemble of models often results in better performance than using a single 'best' model. This paper proposes a novel Data Envelopment Analysis (DEA) based approach to combine models. We prove that for the 2-class classification problems, DEA models identify the same convex hull as the popular ROC analysis used for model combination. We further develop two DEA-based methods to combine k-class classifiers. Experiments demonstrate that the two methods outperform other benchmark methods and suggest that DEA can be a powerful tool for model combination.
    corecore