Search CORE

26,174 research outputs found

Individualized and Global Feature Attributions for Gradient Boosted Trees in the Presence of $\ell_2$ Regularization

Author: Sun Qingyao
Publication venue
Publication date: 08/11/2022
Field of study

While

\ell_2

regularization is widely used in training gradient boosted trees, popular individualized feature attribution methods for trees such as Saabas and TreeSHAP overlook the training procedure. We propose Prediction Decomposition Attribution (PreDecomp), a novel individualized feature attribution for gradient boosted trees when they are trained with

\ell_2

regularization. Theoretical analysis shows that the inner product between PreDecomp and labels on in-sample data is essentially the total gain of a tree, and that it can faithfully recover additive models in the population case when features are independent. Inspired by the connection between PreDecomp and total gain, we also propose TreeInner, a family of debiased global feature attributions defined in terms of the inner product between any individualized feature attribution and labels on out-sample data for each tree. Numerical experiments on a simulated dataset and a genomic ChIP dataset show that TreeInner has state-of-the-art feature selection performance. Code reproducing experiments is available at https://github.com/nalzok/TreeInner .Comment: 43 pages, 29 figure

arXiv.org e-Print Archive

Developing a Data-Driven Statistical Model for Accurately Predicting the Superconducting Critical Temperature of Materials using Multiple Regression and Gradient-Boosted Methods

Author: Agbemade Emil
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2023
Field of study

This study focuses on developing a statistical model for estimating the superconducting critical temperature (Tc) of materials using a data-driven strategy. The study analyzed 21,263 superconductors and used a combination of multiple regression and gradient-boosted models to make predictions. The analysis included a descriptive analysis of the distribution of Tc, feature selection using the Backwards selection method, and model diagnostics. The results showed that the gradient-boosted method outperformed the multiple linear regression method with an RMSE of 12.01 and an R2 value of 88.23 after fine-tuning its hyperparameters. The study concludes that the gradient-boosted method is an effective approach for accurately predicting Tc in superconducting materials

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

End-to-end Feature Selection Approach for Learning Skinny Trees

Author: Behdin Kayhan
Ibrahim Shibal
Mazumder Rahul
Publication venue
Publication date: 27/10/2023
Field of study

Joint feature selection and tree ensemble learning is a challenging task. Popular tree ensemble toolkits e.g., Gradient Boosted Trees and Random Forests support feature selection post-training based on feature importances, which are known to be misleading, and can significantly hurt performance. We propose Skinny Trees: a toolkit for feature selection in tree ensembles, such that feature selection and tree ensemble learning occurs simultaneously. It is based on an end-to-end optimization approach that considers feature selection in differentiable trees with Group

\ell_0 - \ell_2

regularization. We optimize with a first-order proximal method and present convergence guarantees for a non-convex and non-smooth objective. Interestingly, dense-to-sparse regularization scheduling can lead to more expressive and sparser tree ensembles than vanilla proximal method. On 15 synthetic and real-world datasets, Skinny Trees can achieve

1.5\times

620\times

feature compression rates, leading up to

10\times

faster inference over dense trees, without any loss in performance. Skinny Trees lead to superior feature selection than many existing toolkits e.g., in terms of AUC performance for

25\%

feature budget, Skinny Trees outperforms LightGBM by

10.2\%

(up to

37.7\%

), and Random Forests by

3\%

(up to

12.5\%

).Comment: Preprin

arXiv.org e-Print Archive

Impact of Feature Extraction Combined with Data Sampling Methods on Heartbeat Categorization

Author: Sampath Kini K et al.
Publication venue: Auricle Global Society of Education and Research
Publication date: 25/01/2024
Field of study

Dealing with class-imbalanced datasets in data analytics poses challenges, especially when faced with high-dimensional data. In order to handle this issue, researchers often utilize preprocessed methods like feature selection. Feature selection attempts to create a more informative and condensed feature set, while data sampling helps alleviate class imbalance. In our study, aim is to explore the effectiveness of data sampling preprocessed techniques combined with feature extraction using a dataset on ECG Heartbeat. We evaluate ensemble classifiers: Decision Tree; Random Forests (RF), Gradient-Boosted Trees (GBT) for feature extraction. In terms of data sampling, we assess the effectiveness of two methods: Random Under sampling (RUS) and Synthetic Minority Oversampling (SMOTE). The performance of this feature extraction is measured using the sensitivity and the specificity, two important metrics used for accuracy. Our findings depict that the combination of the RUS and GBT method yields the highest performance for ECG Heartbeat detection

International Journal on Recent and Innovation Trends in Computing and Communication

Comparative Analysis of Machine Learning Algorithms for Solar Irradiance Forecasting in Smart Grids

Author: Mohammadzadeh Shima
Soleymani Saman
Publication venue
Publication date: 20/10/2023
Field of study

The increasing global demand for clean and environmentally friendly energy resources has caused increased interest in harnessing solar power through photovoltaic (PV) systems for smart grids and homes. However, the inherent unpredictability of PV generation poses problems associated with smart grid planning and management, energy trading and market participation, demand response, reliability, etc. Therefore, solar irradiance forecasting is essential for optimizing PV system utilization. This study proposes the next-generation machine learning algorithms such as random forests, Extreme Gradient Boosting (XGBoost), Light Gradient Boosted Machine (lightGBM) ensemble, CatBoost, and Multilayer Perceptron Artificial Neural Networks (MLP-ANNs) to forecast solar irradiance. Besides, Bayesian optimization is applied to hyperparameter tuning. Unlike tree-based ensemble algorithms that select the features intrinsically, MLP-ANN needs feature selection as a separate step. The simulation results indicate that the performance of the MLP-ANNs improves when feature selection is applied. Besides, the random forest outperforms the other learning algorithms.Comment: 6 pages, 4 figures, 3 tables, to appear in the 13th Smart Grid Conferenc

arXiv.org e-Print Archive

Brain Stroke Prediction Model Based SMOTE and Machine Learning Algorithms

Author: Abd Alzahra Ahmed Mohmed
Hamoud Alaa Khalaf
Hmoud Hwraa Kareem
Mohammed Alhussain Waad
Sukar Ahmad Muneathir
Publication venue: International University of Sarajevo
Publication date: 09/03/2024
Field of study

A brain stroke is a critical medical emergency condition that causes disability and death. The pre-diagnosis of this case can reduce the complications and problems that affect the brain as a result of being affected by the complications that occur during the injury. This study lists an analysis process on a brain stroke dataset using the KNIME tool, which provides a set of different machine learning components such as random forest, Decision Tree Learner, Gradient Boosted Trees Learner, and Logistic Regression algorithms. The problem of imbalanced data will be handled as part of data preprocessing. The factors that affect the brain stroke will be explored based on feature selection approaches such as forward feature selection, backward feature elimination, genetic algorithms, and random. The aim is to build a model that helps doctors diagnose the disease accurately based on the results we obtained from the study and analysis. The results showed that logistic regression outperformed the other algorithms after applying the algorithm with forward feature selection and backward feature elimination

Inquiry (E-Journal - Faculty of Business and Administration, International University of Sarajevo)

Boosted Multiple Kernel Learning for First-Person Activity Recognition

Author: Arabaci Mehmet Ali
Ozkan Fatih
Surer Elif
Temizel Alptekin
Publication venue
Publication date: 05/06/2017
Field of study

Activity recognition from first-person (ego-centric) videos has recently gained attention due to the increasing ubiquity of the wearable cameras. There has been a surge of efforts adapting existing feature descriptors and designing new descriptors for the first-person videos. An effective activity recognition system requires selection and use of complementary features and appropriate kernels for each feature. In this study, we propose a data-driven framework for first-person activity recognition which effectively selects and combines features and their respective kernels during the training. Our experimental results show that use of Multiple Kernel Learning (MKL) and Boosted MKL in first-person activity recognition problem exhibits improved results in comparison to the state-of-the-art. In addition, these techniques enable the expansion of the framework with new features in an efficient and convenient way.Comment: First published in the Proceedings of the 25th European Signal Processing Conference (EUSIPCO-2017) in 2017, published by EURASI

arXiv.org e-Print Archive

Crossref

OpenMETU (Middle East Technical University)