26,174 research outputs found
Individualized and Global Feature Attributions for Gradient Boosted Trees in the Presence of Regularization
While regularization is widely used in training gradient boosted
trees, popular individualized feature attribution methods for trees such as
Saabas and TreeSHAP overlook the training procedure. We propose Prediction
Decomposition Attribution (PreDecomp), a novel individualized feature
attribution for gradient boosted trees when they are trained with
regularization. Theoretical analysis shows that the inner product between
PreDecomp and labels on in-sample data is essentially the total gain of a tree,
and that it can faithfully recover additive models in the population case when
features are independent. Inspired by the connection between PreDecomp and
total gain, we also propose TreeInner, a family of debiased global feature
attributions defined in terms of the inner product between any individualized
feature attribution and labels on out-sample data for each tree. Numerical
experiments on a simulated dataset and a genomic ChIP dataset show that
TreeInner has state-of-the-art feature selection performance. Code reproducing
experiments is available at https://github.com/nalzok/TreeInner .Comment: 43 pages, 29 figure
Developing a Data-Driven Statistical Model for Accurately Predicting the Superconducting Critical Temperature of Materials using Multiple Regression and Gradient-Boosted Methods
This study focuses on developing a statistical model for estimating the superconducting critical temperature (Tc) of materials using a data-driven strategy. The study analyzed 21,263 superconductors and used a combination of multiple regression and gradient-boosted models to make predictions. The analysis included a descriptive analysis of the distribution of Tc, feature selection using the Backwards selection method, and model diagnostics. The results showed that the gradient-boosted method outperformed the multiple linear regression method with an RMSE of 12.01 and an R2 value of 88.23 after fine-tuning its hyperparameters. The study concludes that the gradient-boosted method is an effective approach for accurately predicting Tc in superconducting materials
End-to-end Feature Selection Approach for Learning Skinny Trees
Joint feature selection and tree ensemble learning is a challenging task.
Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests
support feature selection post-training based on feature importances, which are
known to be misleading, and can significantly hurt performance. We propose
Skinny Trees: a toolkit for feature selection in tree ensembles, such that
feature selection and tree ensemble learning occurs simultaneously. It is based
on an end-to-end optimization approach that considers feature selection in
differentiable trees with Group regularization. We optimize
with a first-order proximal method and present convergence guarantees for a
non-convex and non-smooth objective. Interestingly, dense-to-sparse
regularization scheduling can lead to more expressive and sparser tree
ensembles than vanilla proximal method. On 15 synthetic and real-world
datasets, Skinny Trees can achieve - feature
compression rates, leading up to faster inference over dense trees,
without any loss in performance. Skinny Trees lead to superior feature
selection than many existing toolkits e.g., in terms of AUC performance for
feature budget, Skinny Trees outperforms LightGBM by (up to
), and Random Forests by (up to ).Comment: Preprin
Impact of Feature Extraction Combined with Data Sampling Methods on Heartbeat Categorization
Dealing with class-imbalanced datasets in data analytics poses challenges, especially when faced with high-dimensional data. In order to handle this issue, researchers often utilize preprocessed methods like feature selection. Feature selection attempts to create a more informative and condensed feature set, while data sampling helps alleviate class imbalance. In our study, aim is to explore the effectiveness of data sampling preprocessed techniques combined with feature extraction using a dataset on ECG Heartbeat. We evaluate ensemble classifiers: Decision Tree; Random Forests (RF), Gradient-Boosted Trees (GBT) for feature extraction. In terms of data sampling, we assess the effectiveness of two methods: Random Under sampling (RUS) and Synthetic Minority Oversampling (SMOTE). The performance of this feature extraction is measured using the sensitivity and the specificity, two important metrics used for accuracy. Our findings depict that the combination of the RUS and GBT method yields the highest performance for ECG Heartbeat detection
Comparative Analysis of Machine Learning Algorithms for Solar Irradiance Forecasting in Smart Grids
The increasing global demand for clean and environmentally friendly energy
resources has caused increased interest in harnessing solar power through
photovoltaic (PV) systems for smart grids and homes. However, the inherent
unpredictability of PV generation poses problems associated with smart grid
planning and management, energy trading and market participation, demand
response, reliability, etc. Therefore, solar irradiance forecasting is
essential for optimizing PV system utilization. This study proposes the
next-generation machine learning algorithms such as random forests, Extreme
Gradient Boosting (XGBoost), Light Gradient Boosted Machine (lightGBM)
ensemble, CatBoost, and Multilayer Perceptron Artificial Neural Networks
(MLP-ANNs) to forecast solar irradiance. Besides, Bayesian optimization is
applied to hyperparameter tuning. Unlike tree-based ensemble algorithms that
select the features intrinsically, MLP-ANN needs feature selection as a
separate step. The simulation results indicate that the performance of the
MLP-ANNs improves when feature selection is applied. Besides, the random forest
outperforms the other learning algorithms.Comment: 6 pages, 4 figures, 3 tables, to appear in the 13th Smart Grid
Conferenc
Brain Stroke Prediction Model Based SMOTE and Machine Learning Algorithms
A brain stroke is a critical medical emergency condition that causes disability and death. The pre-diagnosis of this case can reduce the complications and problems that affect the brain as a result of being affected by the complications that occur during the injury. This study lists an analysis process on a brain stroke dataset using the KNIME tool, which provides a set of different machine learning components such as random forest, Decision Tree Learner, Gradient Boosted Trees Learner, and Logistic Regression algorithms. The problem of imbalanced data will be handled as part of data preprocessing. The factors that affect the brain stroke will be explored based on feature selection approaches such as forward feature selection, backward feature elimination, genetic algorithms, and random. The aim is to build a model that helps doctors diagnose the disease accurately based on the results we obtained from the study and analysis. The results showed that logistic regression outperformed the other algorithms after applying the algorithm with forward feature selection and backward feature elimination
Boosted Multiple Kernel Learning for First-Person Activity Recognition
Activity recognition from first-person (ego-centric) videos has recently
gained attention due to the increasing ubiquity of the wearable cameras. There
has been a surge of efforts adapting existing feature descriptors and designing
new descriptors for the first-person videos. An effective activity recognition
system requires selection and use of complementary features and appropriate
kernels for each feature. In this study, we propose a data-driven framework for
first-person activity recognition which effectively selects and combines
features and their respective kernels during the training. Our experimental
results show that use of Multiple Kernel Learning (MKL) and Boosted MKL in
first-person activity recognition problem exhibits improved results in
comparison to the state-of-the-art. In addition, these techniques enable the
expansion of the framework with new features in an efficient and convenient
way.Comment: First published in the Proceedings of the 25th European Signal
Processing Conference (EUSIPCO-2017) in 2017, published by EURASI
- β¦