72,674 research outputs found
High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso
The goal of supervised feature selection is to find a subset of input
features that are responsible for predicting output values. The least absolute
shrinkage and selection operator (Lasso) allows computationally efficient
feature selection based on linear dependency between input features and output
values. In this paper, we consider a feature-wise kernelized Lasso for
capturing non-linear input-output dependency. We first show that, with
particular choices of kernel functions, non-redundant features with strong
statistical dependence on output values can be found in terms of kernel-based
independence measures. We then show that the globally optimal solution can be
efficiently computed; this makes the approach scalable to high-dimensional
problems. The effectiveness of the proposed method is demonstrated through
feature selection experiments with thousands of features.Comment: 18 page
Reliability and validity in comparative studies of software prediction models
Empirical studies on software prediction models do not converge with respect to the question "which prediction model is best?" The reason for this lack of convergence is poorly understood. In this simulation study, we have examined a frequently used research procedure comprising three main ingredients: a single data sample, an accuracy indicator, and cross validation. Typically, these empirical studies compare a machine learning model with a regression model. In our study, we use simulation and compare a machine learning and a regression model. The results suggest that it is the research procedure itself that is unreliable. This lack of reliability may strongly contribute to the lack of convergence. Our findings thus cast some doubt on the conclusions of any study of competing software prediction models that used this research procedure as a basis of model comparison. Thus, we need to develop more reliable research procedures before we can have confidence in the conclusions of comparative studies of software prediction models
Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection
In high-dimensional model selection problems, penalized simple least-square
approaches have been extensively used. This paper addresses the question of
both robustness and efficiency of penalized model selection methods, and
proposes a data-driven weighted linear combination of convex loss functions,
together with weighted -penalty. It is completely data-adaptive and does
not require prior knowledge of the error distribution. The weighted
-penalty is used both to ensure the convexity of the penalty term and to
ameliorate the bias caused by the -penalty. In the setting with
dimensionality much larger than the sample size, we establish a strong oracle
property of the proposed method that possesses both the model selection
consistency and estimation efficiency for the true non-zero coefficients. As
specific examples, we introduce a robust method of composite L1-L2, and optimal
composite quantile method and evaluate their performance in both simulated and
real data examples
Estimating Blood Pressure from Photoplethysmogram Signal and Demographic Features using Machine Learning Techniques
Hypertension is a potentially unsafe health ailment, which can be indicated
directly from the Blood pressure (BP). Hypertension always leads to other
health complications. Continuous monitoring of BP is very important; however,
cuff-based BP measurements are discrete and uncomfortable to the user. To
address this need, a cuff-less, continuous and a non-invasive BP measurement
system is proposed using Photoplethysmogram (PPG) signal and demographic
features using machine learning (ML) algorithms. PPG signals were acquired from
219 subjects, which undergo pre-processing and feature extraction steps. Time,
frequency and time-frequency domain features were extracted from the PPG and
their derivative signals. Feature selection techniques were used to reduce the
computational complexity and to decrease the chance of over-fitting the ML
algorithms. The features were then used to train and evaluate ML algorithms.
The best regression models were selected for Systolic BP (SBP) and Diastolic BP
(DBP) estimation individually. Gaussian Process Regression (GPR) along with
ReliefF feature selection algorithm outperforms other algorithms in estimating
SBP and DBP with a root-mean-square error (RMSE) of 6.74 and 3.59 respectively.
This ML model can be implemented in hardware systems to continuously monitor BP
and avoid any critical health conditions due to sudden changes.Comment: Accepted for publication in Sensor, 14 Figures, 14 Table
- …