96,952 research outputs found

    Large-scale Nonlinear Variable Selection via Kernel Random Features

    Full text link
    We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selection method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by mapping the inputs into a relatively low-dimensional space of random features. The algorithm discovers the variables relevant for the regression task together with learning the prediction model through learning the appropriate nonlinear random feature maps. We demonstrate the outstanding performance of our method on a set of large-scale synthetic and real datasets.Comment: Final version for proceedings of ECML/PKDD 201

    Model Selection Techniques for Kernel-Based Regression Analysis Using Information Complexity Measure and Genetic Algorithms

    Get PDF
    In statistical modeling, an overparameterized model leads to poor generalization on unseen data points. This issue requires a model selection technique that appropriately chooses the form, the parameters of the proposed model and the independent variables retained for the modeling. Model selection is particularly important for linear and nonlinear statistical models, which can be easily overfitted. Recently, support vector machines (SVMs), also known as kernel-based methods, have drawn much attention as the next generation of nonlinear modeling techniques. The model selection issues for SVMs include the selection of the kernel, the corresponding parameters and the optimal subset of independent variables. In the current literature, k-fold cross-validation is the widely utilized model selection method for SVMs by the machine learning researchers. However, cross-validation is computationally intensive since one has to fit the model k times. This dissertation introduces the use of a model selection criterion based on information complexity (ICOMP) measure for kernel-based regression analysis and its applications. ICOMP penalizes both the lack-of-fit and the complexity of the model to choose the optimal model with good generalization properties. ICOMP provides a simple index for each model and does not require any validation data. It is computationally efficient and it has been successfully applied to various linear model selection problems. In this dissertation, we introduce ICOMP to the nonlinear kernel-based modeling areas. Specifically, this dissertation proposes ICOMP and its various forms in the area of kernel ridge regression; kernel partial least squares regression; kernel principal component analysis; kernel principal component regression; relevance vector regression; relevance vector logistic regression and classification problems. The model selection tasks achieved by our proposed criterion include choosing the form of the kernel function, the parameters of the kernel function, the ridge parameter, the number of latent variables, the number of principal components and the optimal subset of input variables in a simultaneous fashion for intelligent data mining. The performance of the proposed model selection method is tested on simulation bench- mark data sets as well as real data sets. The predictive performance of the proposed model selection criteria are comparable to and even better than cross-validation, which is too costly to compute and not efficient. This dissertation combines the Genetic Algorithm with ICOMP in variable subsetting, which significantly decreases the computational time as compared to the exhaustive search of all possible subsets. GA procedure is shown to be robust and performs well in our repeated simulation examples. Therefore, this dissertation provides researchers an alternative computationally efficient model selection approach for data analysis using kernel methods

    Comparative Analysis of Predictive Performance in Nonparametric Functional Regression: A Case Study of Spectrometric Fat Content Prediction

    Get PDF
    Objective: This research aims to compare two nonparametric functional regression models, the Kernel Model and the K-Nearest Neighbor (KNN) Model, with a focus on predicting scalar responses from functional covariates. Two semi-metrics, one based on second derivatives and the other on Functional Principle Component Analysis, are employed for prediction. The study assesses the accuracy of these models by computing Mean Square Errors (MSE) and provides practical applications for illustration. Method: The study delves into the realm of nonparametric functional regression, where the response variable (Y) is scalar, and the covariate variable (x) is a function. The Kernel Model, known as funopare.kernel.cv, and the KNN Model, termed funopare.knn.gcv, are used for prediction. The Kernel Model employs automatic bandwidth selection via Cross-Validation, while the KNN Model employs a global smoothing parameter. The performance of both models is evaluated using MSE, considering two different semi-metrics. Results: The results indicate that the KNN Model outperforms the Kernel Model in terms of prediction accuracy, as supported by the computed MSE. The choice of semi-metric, whether based on second derivatives or Functional Principle Component Analysis, impacts the model's performance. Two real-world applications, Spectrometric Data for predicting fat content and Canadian Weather Station data for predicting precipitation, demonstrate the practicality and utility of the models. Conclusion: This research provides valuable insights into nonparametric functional regression methods for predicting scalar responses from functional covariates. The KNN Model, when compared to the Kernel Model, offers superior predictive performance. The selection of an appropriate semi-metric is essential for model accuracy. Future research may explore the extension of these models to cases involving multivariate responses and consider interactions between response components

    PEMODELAN PERSENTASE PENDUDUK MISKIN DI KABUPATEN DAN KOTA DI JAWA TENGAH DENGAN PENDEKATAN MIXED GEOGRAPHICALLY WEIGHTED REGRESSION

    Get PDF
    Regression analysis is a statistical analysis that models the relationship between the response variable and the predictor variable. Geographically Weighted Regression (GWR) is the development of linear regression with the added factor of the geographical location where the response variable is taken, so that the resulting parameters will be local. Mixed Geographically Weighted Regression (MGWR) has a basic concept that is a combination of a linear regression model and GWR, by modeling variables that are local and which are global variables. Methods for estimating the model parameters MGWR no different from the GWR using Weighted Least Square (WLS). Selection of the optimum bandwidth using the Cross Validation (CV). Application models MGWR the percentage of poor people in the district and town in Central Java showed MGWR models that different significantly from the global regression model. As well as models generated for each area will be different from each other. Based on the Akaike Information Criterion (AIC) between the global regression model, the GWR and MGWR models, it is known that MGWR models with Gaussian kernel weighting function is the best model is used to analyze the percentage of poor in the counties and cities in Central Java because it has the smallest AIC value . Keywords: Akaike Information Criterion, Cross Validation, Fungsi Kernel Gaussian, Mixed Geographically Weighted Regression, Weighted Least Square

    Kernel based methods for accelerated failure time model with ultra-high dimensional data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with <it>L</it><sub>1 </sub>and <it>L<sub>p </sub></it>penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size <it>n </it>≪ <it>m </it>(the number of genes), directly identifying a small subset of genes from ultra-high (<it>m </it>> 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds) of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes.</p> <p>Results</p> <p>The accelerated failure time (AFT) model is a linear regression model and a useful alternative to the Cox model for survival analysis. In this paper, we propose a nonlinear kernel based AFT model and an efficient variable selection method with adaptive kernel ridge regression. Our proposed variable selection method is based on the kernel matrix and dual problem with a much smaller <it>n </it>× <it>n </it>matrix. It is very efficient when the number of unknown variables (genes) is much larger than the number of samples. Moreover, the primal variables are explicitly updated and the sparsity in the solution is exploited.</p> <p>Conclusions</p> <p>Our proposed methods can simultaneously identify survival associated prognostic factors and predict survival outcomes with ultra-high dimensional genomic data. We have demonstrated the performance of our methods with both simulation and real data. The proposed method performs superbly with limited computational studies.</p

    Sparse Machine Learning Methods for Prediction and Personalized Medicine

    Get PDF
    With growing interest to use black-box machine learning for complex data with many feature variables, it is critical to obtain a prediction model that only depends on a small set of features to maximize generalizability. Therefore, feature selection remains to be an important and challenging problem in modern applications. Most of existing methods for feature selection are based on either parametric or semiparametric models, so the resulting performance can severely suffer from model misspecification when high-order nonlinear interactions among the features are present. A very limited number of approaches for nonparametric feature selection were proposed, but they are computationally intensive and may not even converge. Thus, nonparametric feature selection for high-dimensional data is an important problem in statistics and machine learning fields. Futhermore, in the field of precision medicine, machine learning techniques are usually applied on a large health dataset containing patients' information to find optimal individual treatment rule (ITR), which makes the learning process computational demanding. Thus, identifying the truly important feature variables shortens the computation time and saves the cost of collecting redundant data. Therefore, we focus on developing machine learning techniques to perform variable selection for both prediction and personalized medicine in the dissertation. In the first project, we propose a novel and computationally efficient approach for nonparametric feature selection in regression field based on a tensor-product kernel function over the feature space. The importance of each feature is governed by a parameter in the kernel function which can be efficiently computed iteratively from a modified alternating direction method of multipliers (ADMM) algorithm. We prove the oracle selection property of the proposed method. Finally, we demonstrate the superior performance of our approach compared to existing methods via simulation studies and application to the prediction of Alzheimer's disease. In the second project, we continue to propose a new framework to perform nonparametric feature selection for both regression and classification problems. Under this framework, we learn prediction functions through empirical risk minimization over a reproducing kernel Hilbert space (RKHS). The space is generated by a novel tensor product kernel which depends on a set of parameters that determine the importance of the features. Computationally, we minimize the empirical risk with a penalty to estimate the prediction and kernel parameters simultaneously. The solution can be obtained by iteratively solving convex optimization problems. We study the theoretical property of the kernel feature space and prove oracle selection property and Fisher consistency of our proposed method. Finally, we demonstrate the superior performance of our approach compared to existing methods via extensive simulation studies and application to a microarray study of eye disease in animals. Finally, we focus on applying the nonparametric feature selection framework for treatment decision making with high-dimensional data. We directly estimate the decision function in Reproducing Kernel Hilbert Space (RKHS) generated by a novel constructed tensor product kernel with parameters capturing the importance of each variable. Computationally, we adopt two steps to separate the procedure for both estimating and tuning processes, which makes the computation more fast and stable. Finally, we demonstrate the superior performance of our approach compared to existing methods via one simulation study and application to type 2 diabetes.Doctor of Philosoph

    Kernel-based Information Criterion

    Full text link
    This paper introduces Kernel-based Information Criterion (KIC) for model selection in regression analysis. The novel kernel-based complexity measure in KIC efficiently computes the interdependency between parameters of the model using a variable-wise variance and yields selection of better, more robust regressors. Experimental results show superior performance on both simulated and real data sets compared to Leave-One-Out Cross-Validation (LOOCV), kernel-based Information Complexity (ICOMP), and maximum log of marginal likelihood in Gaussian Process Regression (GPR).Comment: We modified the reference 17, and the subcaptions of Figure
    corecore