24,283 research outputs found

    Optimized classification predictions with a new index combining machine learning algorithms

    Get PDF
    Voting is a commonly used ensemble method aiming to optimize classification predictions by combining results from individual base classifiers. However, the selection of appropriate classifiers to participate in voting algorithm is currently an open issue. In this study we developed a novel Dissimilarity-Performance (DP) index which incorporates two important criteria for the selection of base classifiers to participate in voting: their differential response in classification (dissimilarity) when combined in triads and their individual performance. To develop this empirical index we firstly used a range of different datasets to evaluate the relationship between voting results and measures of dissimilarity among classifiers of different types (rules, trees, lazy classifiers, functions and Bayes). Secondly, we computed the combined effect on voting performance of classifiers with different individual performance and/or diverse results in the voting performance. Our DP index was able to rank the classifier combinations according to their voting performance and thus to suggest the optimal combination. The proposed index is recommended for individual machine learning users as a preliminary tool to identify which classifiers to combine in order to achieve more accurate classification predictions avoiding computer intensive and time-consuming search

    Large-scale Multi-label Text Classification - Revisiting Neural Networks

    Full text link
    Neural networks have recently been proposed for multi-label classification because they are able to capture and model label dependencies in the output layer. In this work, we investigate limitations of BP-MLL, a neural network (NN) architecture that aims at minimizing pairwise ranking error. Instead, we propose to use a comparably simple NN approach with recently proposed learning techniques for large-scale multi-label text classification tasks. In particular, we show that BP-MLL's ranking loss minimization can be efficiently and effectively replaced with the commonly used cross entropy error function, and demonstrate that several advances in neural network training that have been developed in the realm of deep learning can be effectively employed in this setting. Our experimental results show that simple NN models equipped with advanced techniques such as rectified linear units, dropout, and AdaGrad perform as well as or even outperform state-of-the-art approaches on six large-scale textual datasets with diverse characteristics.Comment: 16 pages, 4 figures, submitted to ECML 201

    Efficient learning of neighbor representations for boundary trees and forests

    Full text link
    We introduce a semiparametric approach to neighbor-based classification. We build off the recently proposed Boundary Trees algorithm by Mathy et al.(2015) which enables fast neighbor-based classification, regression and retrieval in large datasets. While boundary trees use an Euclidean measure of similarity, the Differentiable Boundary Tree algorithm by Zoran et al.(2017) was introduced to learn low-dimensional representations of complex input data, on which semantic similarity can be calculated to train boundary trees. As is pointed out by its authors, the differentiable boundary tree approach contains a few limitations that prevents it from scaling to large datasets. In this paper, we introduce Differentiable Boundary Sets, an algorithm that overcomes the computational issues of the differentiable boundary tree scheme and also improves its classification accuracy and data representability. Our algorithm is efficiently implementable with existing tools and offers a significant reduction in training time. We test and compare the algorithms on the well known MNIST handwritten digits dataset and the newer Fashion-MNIST dataset by Xiao et al.(2017).Comment: 9 pages, 2 figure

    Two-Stage Bagging Pruning for Reducing the Ensemble Size and Improving the Classification Performance

    Get PDF
    Ensemble methods, such as the traditional bagging algorithm, can usually improve the performance of a single classifier. However, they usually require large storage space as well as relatively time-consuming predictions. Many approaches were developed to reduce the ensemble size and improve the classification performance by pruning the traditional bagging algorithms. In this article, we proposed a two-stage strategy to prune the traditional bagging algorithm by combining two simple approaches: accuracy-based pruning (AP) and distance-based pruning (DP). These two methods, as well as their two combinations, “AP+DP” and “DP+AP” as the two-stage pruning strategy, were all examined. Comparing with the single pruning methods, we found that the two-stage pruning methods can furthermore reduce the ensemble size and improve the classification. “AP+DP” method generally performs better than the “DP+AP” method when using four base classifiers: decision tree, Gaussian naive Bayes, K-nearest neighbor, and logistic regression. Moreover, as compared to the traditional bagging, the two-stage method “AP+DP” improved the classification accuracy by 0.88%, 4.06%, 1.26%, and 0.96%, respectively, averaged over 28 datasets under the four base classifiers. It was also observed that “AP+DP” outperformed other three existing algorithms Brag, Nice, and TB assessed on 8 common datasets. In summary, the proposed two-stage pruning methods are simple and promising approaches, which can both reduce the ensemble size and improve the classification accuracy

    Classification of blazar candidates of unknown type in Fermi 4LAC by unanimous voting from multiple Machine Learning Algorithms

    Full text link
    The Fermi fourth catalog of active galactic nuclei (AGNs) data release 3 (4LAC-DR3) contains 3407 AGNs, out of which 755 are flat spectrum radio quasars (FSRQs), 1379 are BL Lacertae objects (BL Lacs), 1208 are blazars of unknown (BCUs) type, while 65 are non AGNs. Accurate categorization of many unassociated blazars still remains a challenge due to the lack of sufficient optical spectral information. The aim of this work is to use high-precision, optimized machine learning (ML) algorithms to classify BCUs into BL Lacs and FSRQs. To address this, we selected the 4LAC-DR3 Clean sample (i.e., sources with no analysis flags) containing 1115 BCUs. We employ five different supervised ML algorithms, namely, random forest, logistic regression, XGBoost, CatBoost, and neural network with seven features: Photon index, synchrotron-peak frequency, Pivot Energy, Photon index at Pivot\_Energy, Fractional variability, νFν\nu F\nu at synchrotron-peak frequency, and Variability index. Combining results from all models leads to better accuracy and more robust predictions. These five methods together classified 610 BCUs as BL Lacs and 333 BCUs as FSRQs with a classification metric area under the curve >> 0.96. Our results are significantly compatible with recent studies as well. The output from this study provides a larger blazar sample with many new targets that could be used for forthcoming multi-wavelength surveys. This work can be further extended by adding features in X-rays, UV, visible, and radio wavelengths.Comment: 22 pages, 10 figures, 3 tables, Accepted in Ap
    corecore