24,283 research outputs found
Optimized classification predictions with a new index combining machine learning algorithms
Voting is a commonly used ensemble method aiming to optimize classification predictions by combining results from individual base classifiers. However, the selection of appropriate classifiers to participate in voting algorithm is currently an open issue. In this study we developed a novel Dissimilarity-Performance (DP) index which incorporates two important criteria for the selection of base classifiers to participate in voting: their differential response in classification (dissimilarity) when combined in triads and their individual performance. To develop this empirical index we firstly used a range of different datasets to evaluate the relationship between voting results and measures of dissimilarity among classifiers of different types (rules, trees, lazy classifiers, functions and Bayes). Secondly, we computed the combined effect on voting performance of classifiers with different individual performance and/or diverse results in the voting performance. Our DP index was able to rank the classifier combinations according to their voting performance and thus to suggest the optimal combination. The proposed index is recommended for individual machine learning users as a preliminary tool to identify which classifiers to combine in order to achieve more accurate classification predictions avoiding computer intensive and time-consuming search
Large-scale Multi-label Text Classification - Revisiting Neural Networks
Neural networks have recently been proposed for multi-label classification
because they are able to capture and model label dependencies in the output
layer. In this work, we investigate limitations of BP-MLL, a neural network
(NN) architecture that aims at minimizing pairwise ranking error. Instead, we
propose to use a comparably simple NN approach with recently proposed learning
techniques for large-scale multi-label text classification tasks. In
particular, we show that BP-MLL's ranking loss minimization can be efficiently
and effectively replaced with the commonly used cross entropy error function,
and demonstrate that several advances in neural network training that have been
developed in the realm of deep learning can be effectively employed in this
setting. Our experimental results show that simple NN models equipped with
advanced techniques such as rectified linear units, dropout, and AdaGrad
perform as well as or even outperform state-of-the-art approaches on six
large-scale textual datasets with diverse characteristics.Comment: 16 pages, 4 figures, submitted to ECML 201
Efficient learning of neighbor representations for boundary trees and forests
We introduce a semiparametric approach to neighbor-based classification. We
build off the recently proposed Boundary Trees algorithm by Mathy et al.(2015)
which enables fast neighbor-based classification, regression and retrieval in
large datasets. While boundary trees use an Euclidean measure of similarity,
the Differentiable Boundary Tree algorithm by Zoran et al.(2017) was introduced
to learn low-dimensional representations of complex input data, on which
semantic similarity can be calculated to train boundary trees. As is pointed
out by its authors, the differentiable boundary tree approach contains a few
limitations that prevents it from scaling to large datasets. In this paper, we
introduce Differentiable Boundary Sets, an algorithm that overcomes the
computational issues of the differentiable boundary tree scheme and also
improves its classification accuracy and data representability. Our algorithm
is efficiently implementable with existing tools and offers a significant
reduction in training time. We test and compare the algorithms on the well
known MNIST handwritten digits dataset and the newer Fashion-MNIST dataset by
Xiao et al.(2017).Comment: 9 pages, 2 figure
Two-Stage Bagging Pruning for Reducing the Ensemble Size and Improving the Classification Performance
Ensemble methods, such as the traditional bagging algorithm, can usually improve the performance of a single classifier. However, they usually require large storage space as well as relatively time-consuming predictions. Many approaches were developed to reduce the ensemble size and improve the classification performance by pruning the traditional bagging algorithms. In this article, we proposed a two-stage strategy to prune the traditional bagging algorithm by combining two simple approaches: accuracy-based pruning (AP) and distance-based pruning (DP). These two methods, as well as their two combinations, “AP+DP” and “DP+AP” as the two-stage pruning strategy, were all examined. Comparing with the single pruning methods, we found that the two-stage pruning methods can furthermore reduce the ensemble size and improve the classification. “AP+DP” method generally performs better than the “DP+AP” method when using four base classifiers: decision tree, Gaussian naive Bayes, K-nearest neighbor, and logistic regression. Moreover, as compared to the traditional bagging, the two-stage method “AP+DP” improved the classification accuracy by 0.88%, 4.06%, 1.26%, and 0.96%, respectively, averaged over 28 datasets under the four base classifiers. It was also observed that “AP+DP” outperformed other three existing algorithms Brag, Nice, and TB assessed on 8 common datasets. In summary, the proposed two-stage pruning methods are simple and promising approaches, which can both reduce the ensemble size and improve the classification accuracy
Classification of blazar candidates of unknown type in Fermi 4LAC by unanimous voting from multiple Machine Learning Algorithms
The Fermi fourth catalog of active galactic nuclei (AGNs) data release 3
(4LAC-DR3) contains 3407 AGNs, out of which 755 are flat spectrum radio quasars
(FSRQs), 1379 are BL Lacertae objects (BL Lacs), 1208 are blazars of unknown
(BCUs) type, while 65 are non AGNs. Accurate categorization of many
unassociated blazars still remains a challenge due to the lack of sufficient
optical spectral information. The aim of this work is to use high-precision,
optimized machine learning (ML) algorithms to classify BCUs into BL Lacs and
FSRQs. To address this, we selected the 4LAC-DR3 Clean sample (i.e., sources
with no analysis flags) containing 1115 BCUs. We employ five different
supervised ML algorithms, namely, random forest, logistic regression, XGBoost,
CatBoost, and neural network with seven features: Photon index,
synchrotron-peak frequency, Pivot Energy, Photon index at Pivot\_Energy,
Fractional variability, at synchrotron-peak frequency, and
Variability index. Combining results from all models leads to better accuracy
and more robust predictions. These five methods together classified 610 BCUs as
BL Lacs and 333 BCUs as FSRQs with a classification metric area under the curve
0.96. Our results are significantly compatible with recent studies as well.
The output from this study provides a larger blazar sample with many new
targets that could be used for forthcoming multi-wavelength surveys. This work
can be further extended by adding features in X-rays, UV, visible, and radio
wavelengths.Comment: 22 pages, 10 figures, 3 tables, Accepted in Ap
- …