4,002 research outputs found
ICA as a preprocessing technique for classification
In this paper we propose the use of the independent component
analysis (ICA) [1] technique for improving the classification rate of decision
trees and multilayer perceptrons [2], [3]. The use of an ICA for the preprocessing
stage, makes the structure of both classifiers simpler, and therefore
improves the generalization properties. The hypothesis behind the proposed
preprocessing is that an ICA analysis will transform the feature space into a
space where the components are independent, and aligned to the axes and
therefore will be more adapted to the way that a decision tree is constructed.
Also the inference of the weights of a multilayer perceptron will be much easier
because the gradient search in the weight space will follow independent
trajectories. The result is that classifiers are less complex and on some databases
the error rate is lower. This idea is also applicable to regressio
Machine learning with the hierarchy‐of‐hypotheses (HoH) approach discovers novel pattern in studies on biological invasions
Research synthesis on simple yet general hypotheses and ideas is challenging in scientific disciplines studying highly context‐dependent systems such as medical, social, and biological sciences. This study shows that machine learning, equation‐free statistical modeling of artificial intelligence, is a promising synthesis tool for discovering novel patterns and the source of controversy in a general hypothesis. We apply a decision tree algorithm, assuming that evidence from various contexts can be adequately integrated in a hierarchically nested structure. As a case study, we analyzed 163 articles that studied a prominent hypothesis in invasion biology, the enemy release hypothesis. We explored if any of the nine attributes that classify each study can differentiate conclusions as classification problem. Results corroborated that machine learning can be useful for research synthesis, as the algorithm could detect patterns that had been already focused in previous narrative reviews. Compared with the previous synthesis study that assessed the same evidence collection based on experts' judgement, the algorithm has newly proposed that the studies focusing on Asian regions mostly supported the hypothesis, suggesting that more detailed investigations in these regions can enhance our understanding of the hypothesis. We suggest that machine learning algorithms can be a promising synthesis tool especially where studies (a) reformulate a general hypothesis from different perspectives, (b) use different methods or variables, or (c) report insufficient information for conducting meta‐analyses
TreeGrad: Transferring Tree Ensembles to Neural Networks
Gradient Boosting Decision Tree (GBDT) are popular machine learning
algorithms with implementations such as LightGBM and in popular machine
learning toolkits like Scikit-Learn. Many implementations can only produce
trees in an offline manner and in a greedy manner. We explore ways to convert
existing GBDT implementations to known neural network architectures with
minimal performance loss in order to allow decision splits to be updated in an
online manner and provide extensions to allow splits points to be altered as a
neural architecture search problem. We provide learning bounds for our neural
network.Comment: Technical Report on Implementation of Deep Neural Decision Forests
Algorithm. To accompany implementation here:
https://github.com/chappers/TreeGrad. Update: Please cite as: Siu, C. (2019).
"Transferring Tree Ensembles to Neural Networks". International Conference on
Neural Information Processing. Springer, 2019. arXiv admin note: text overlap
with arXiv:1909.1179
Mixing hetero- and homogeneous models in weighted ensembles
The effectiveness of ensembling for improving classification performance is well documented. Broadly speaking, ensemble design can be expressed as a spectrum where at one end a set of heterogeneous classifiers model the same data, and at the other homogeneous models derived from the same classification algorithm are diversified through data manipulation. The cross-validation accuracy weighted probabilistic ensemble is a heterogeneous weighted ensemble scheme that needs reliable estimates of error from its base classifiers. It estimates error through a cross-validation process, and raises the estimates to a power to accentuate differences. We study the effects of maintaining all models trained during cross-validation on the final ensemble's predictive performance, and the base model's and resulting ensembles' variance and robustness across datasets and resamples. We find that augmenting the ensemble through the retention of all models trained provides a consistent and significant improvement, despite reductions in the reliability of the base models' performance estimates
Child Mortality in Mozambique: a Review of Recent Trends and Attributable Causes
Data regarding the main causes of death among children in Mozambique are patchy, outdated, and in many cases based on methodologies with many underlying limitations, which make them unreliable. More robust postmortem methodologies to study the underlying causes of mortality, currently being introduced in a surveillance sentinel site of the country, will surely contribute to improve our understanding of what is really killing children in this country
Improving adaptive bagging methods for evolving data streams
We propose two new improvements for bagging methods on evolving data streams. Recently, two new variants of Bagging were proposed: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. ASHT Bagging uses trees of different sizes, and ADWIN Bagging uses ADWIN as a change detector to decide when to discard underperforming ensemble members. We improve ADWIN Bagging using Hoeffding Adaptive Trees, trees that can adaptively learn from data streams that change over time. To speed up the time for adapting to change of Adaptive-Size Hoeffding Tree (ASHT) Bagging, we add an error change detector for each classifier. We test our improvements by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples
Dengue Fever Outbreak in a Recreation Club, Dhaka, Bangladesh
An outbreak of dengue fever occurred among employees of a recreation club in Bangladesh. Occupational transmission was characterized by a 12% attack rate, no dengue among family contacts, and Aedes vectors in club areas. Early recognition of the outbreak likely limited its impact
Random forests with random projections of the output space for high dimensional multi-label classification
We adapt the idea of random projections applied to the output space, so as to
enhance tree-based ensemble methods in the context of multi-label
classification. We show how learning time complexity can be reduced without
affecting computational complexity and accuracy of predictions. We also show
that random output space projections may be used in order to reach different
bias-variance tradeoffs, over a broad panel of benchmark problems, and that
this may lead to improved accuracy while reducing significantly the
computational burden of the learning stage
Chains of infinite order, chains with memory of variable length, and maps of the interval
We show how to construct a topological Markov map of the interval whose
invariant probability measure is the stationary law of a given stochastic chain
of infinite order. In particular we caracterize the maps corresponding to
stochastic chains with memory of variable length. The problem treated here is
the converse of the classical construction of the Gibbs formalism for Markov
expanding maps of the interval
Computer aided diagnosis for cardiovascular diseases based on ECG signals : a survey
The interpretation of Electroencephalography (ECG) signals is difficult, because even subtle changes in the waveform can indicate a serious heart disease. Furthermore, these waveform changes might not be present all the time. As a consequence, it takes years of training for a medical practitioner to become an expert in ECG-based cardiovascular disease diagnosis. That training is a major investment in a specific skill. Even with expert ability, the signal interpretation takes time. In addition, human interpretation of ECG signals causes interoperator and intraoperator variability. ECG-based Computer-Aided Diagnosis (CAD) holds the promise of improving the diagnosis accuracy and reducing the cost. The same ECG signal will result in the same diagnosis support regardless of time and place. This paper introduces both the techniques used to realize the CAD functionality and the methods used to assess the established functionality. This survey aims to instill trust in CAD of cardiovascular diseases using ECG signals by introducing both a conceptional overview of the system and the necessary assessment method
- …