286,312 research outputs found

    Robust order selection of mixtures of regression models with random effects

    Get PDF
    Finite mixtures of regression models with random effects are a very flexible statistical tool to model data, as these models allow to model the heterogeneity of the population and to account for multiple correlated observations from the same individual at the same time. The selection of the number of components for these models has been a long-standing challenging problem in statistics. However, the majority of the existent methods for the estimation of the number of components are not robust and, therefore, are quite sensitive to outliers. In this article we study a robust estimation of the number of components for mixtures of regression models with random effects, investigating the performance of trimmed information and classification criteria comparatively to the performance of the traditional information and classification criteria. The simulation study and a real-world application showcase the superiority of the trimmed information and classification criteria in the presence of contaminated data.Fundação para a Ciência e Tecnologia (FCT

    Extreme Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression

    Full text link
    One important issue commonly encountered in the analysis of microarray data is to decide which and how many genes should be selected for further studies. For discriminant microarray data analyses based on statistical models, such as the logistic regression models, gene selection can be accomplished by a comparison of the maximum likelihood of the model given the real data, L^(D∣M)\hat{L}(D|M), and the expected maximum likelihood of the model given an ensemble of surrogate data with randomly permuted label, L^(D0∣M)\hat{L}(D_0|M). Typically, the computational burden for obtaining L^(D0∣M)\hat{L}(D_0|M) is immense, often exceeding the limits of computing available resources by orders of magnitude. Here, we propose an approach that circumvents such heavy computations by mapping the simulation problem to an extreme-value problem. We present the derivation of an asymptotic distribution of the extreme-value as well as its mean, median, and variance. Using this distribution, we propose two gene selection criteria, and we apply them to two microarray datasets and three classification tasks for illustration.Comment: to be published in Journal of Computational Biology (2004

    Permuted Inclusion Criterion: A Variable Selection Technique

    Get PDF
    We introduce a new variable selection technique called the Permuted Inclusion Criterion (PIC) based on augmenting the predictor space X with a row-permuted version denoted Xpi. We adopt the linear regression setup with n observations on p variables. Thus, our augmented space has p real predictors and p permuted predictors. This has many desirable properties for variable selection. For example, this preserves relations between variables, e.g. squares and interactions and equates the moments and covariance structure of X and Xpi. More importantly, Xpi scales with the size of X. We motivate the idea with forward selection. The first time we select a predictor from Xpi, we stop. As this depends on the permutation, we simulate many times and create a distribution of models and stopping points. This has the added benefit of quantifying how certain we are about stopping. Variable selection typically penalizes each additional variable by a prespecified amount. Our method uses a data-adaptive penalty. We apply this method to simulated data and compare its predictive performance to other widely used criteria such as Cp, RIC, and the Lasso. Viewing PIC as a selection scheme for greedy algorithms, we extend the PIC to generalized linear regression (GLM) and classification and regression trees (CART)

    The use of Rasch analysis as a tool in the construction of a preference based measure: the case of AQLQ

    Get PDF
    The majority of quality of life instruments are not preference-based measures and so cannot be used within cost utility analysis. The Asthma Quality of Life Questionnaire (AQLQ) is one such instrument. The aim of this study was to develop a health state classification that is amenable to valuation from the AQLQ. Rasch models were applied to samples of responders to the AQLQ with the aim of i) selecting a number of items for a preference based utility measure (AQL-5D), ii) reducing the number of levels for each item to a more manageable number of levels for establishing AQL-5D. Selection of items for the evaluation survey was supported with conventional psychometric criteria for item selection (feasibility, internal consistency, floor and ceiling effects, responsiveness and regression against overall health). The role of Rasch analysis in reducing the number of item levels to a preconceived target number of levels proved unsuccessful. However, Rasch analysis proved to be a useful tool in assisting in the initial process of selecting items from an existing HRQL instrument in the construction of AQL-5D. The method is recommended for use alongside conventional psychometric testing to aid in the development of preference-based measures

    Bayesian Example Selection Using BaBiES

    Get PDF
    Active learning is widely used to select which examples from a pool should be labeled to give best results when learning predictive models. It is, however, sometimes desirable to choose examples before any labeling or machine learning has occurred. The optimal experimental design literature has many theoretically attractive optimality criteria for example selection, but most are intractable when working with large numbers of predictive features. We present the BaBiES criterion, an approximation of Bayesian A-optimal design for linear regression using binary predictors, which is both simple and extremely fast. Empirical evaluations demonstrate that, in spite of selecting all examples prior to learning, BaBiES is competitive with standard active learning methods for a variety of document classification tasks

    Optimisation based approaches for machine learning

    Get PDF
    Machine learning has attracted a lot of attention in recent years and it has become an integral part of many commercial and research projects, with a wide range of applications. With current developments in technology, more data is generated and stored than ever before. Identifying patterns, trends and anomalies in these datasets and summarising them with simple quantitative models is a vital task. This thesis focuses on the development of machine learning algorithms based on mathematical programming for datasets that are relatively small in size. The first topic of this doctoral thesis is piecewise regression, where a dataset is partitioned into multiple regions and a regression model is fitted to each one. This work uses an existing algorithm from the literature and extends the mathematical formulation in order to include information criteria. The inclusion of such criteria targets to deal with overfitting, which is a common problem in supervised learning tasks, by finding a balance between predictive performance and model complexity. The improvement in overall performance is demonstrated by testing and comparing the proposed method with various algorithms from the literature on various regression datasets. Extending the topic of regression, a decision tree regressor is also proposed. Decision trees are powerful and easy to understand structures that can be used both for regression and classification. In this work, an optimisation model is used for the binary splitting of nodes. A statistical test is introduced to check whether the partitioning of nodes is statistically meaningful and as a result control the tree generation process. Additionally, a novel mathematical formulation is proposed to perform feature selection and ultimately identify the appropriate variable to be selected for the splitting of nodes. The performance of the proposed algorithm is once again compared with a number of literature algorithms and it is shown that the introduction of the variable selection model is useful for reducing the training time of the algorithm without major sacrifices in performance. Lastly, a novel decision tree classifier is proposed. This algorithm is based on a mathematical formulation that identifies the optimal splitting variable and break value, applies a linear transformation to the data and then assigns them to a class while minimising the number of misclassified samples. The introduction of the linear transformation step reduces the dimensionality of the examined dataset down to a single variable, aiding the classification accuracy of the algorithm for more complex datasets. Popular classifiers from the literature have been used to compare the accuracy of the proposed algorithm on both synthetic and publicly available classification datasets

    Model Selection Techniques for Kernel-Based Regression Analysis Using Information Complexity Measure and Genetic Algorithms

    Get PDF
    In statistical modeling, an overparameterized model leads to poor generalization on unseen data points. This issue requires a model selection technique that appropriately chooses the form, the parameters of the proposed model and the independent variables retained for the modeling. Model selection is particularly important for linear and nonlinear statistical models, which can be easily overfitted. Recently, support vector machines (SVMs), also known as kernel-based methods, have drawn much attention as the next generation of nonlinear modeling techniques. The model selection issues for SVMs include the selection of the kernel, the corresponding parameters and the optimal subset of independent variables. In the current literature, k-fold cross-validation is the widely utilized model selection method for SVMs by the machine learning researchers. However, cross-validation is computationally intensive since one has to fit the model k times. This dissertation introduces the use of a model selection criterion based on information complexity (ICOMP) measure for kernel-based regression analysis and its applications. ICOMP penalizes both the lack-of-fit and the complexity of the model to choose the optimal model with good generalization properties. ICOMP provides a simple index for each model and does not require any validation data. It is computationally efficient and it has been successfully applied to various linear model selection problems. In this dissertation, we introduce ICOMP to the nonlinear kernel-based modeling areas. Specifically, this dissertation proposes ICOMP and its various forms in the area of kernel ridge regression; kernel partial least squares regression; kernel principal component analysis; kernel principal component regression; relevance vector regression; relevance vector logistic regression and classification problems. The model selection tasks achieved by our proposed criterion include choosing the form of the kernel function, the parameters of the kernel function, the ridge parameter, the number of latent variables, the number of principal components and the optimal subset of input variables in a simultaneous fashion for intelligent data mining. The performance of the proposed model selection method is tested on simulation bench- mark data sets as well as real data sets. The predictive performance of the proposed model selection criteria are comparable to and even better than cross-validation, which is too costly to compute and not efficient. This dissertation combines the Genetic Algorithm with ICOMP in variable subsetting, which significantly decreases the computational time as compared to the exhaustive search of all possible subsets. GA procedure is shown to be robust and performs well in our repeated simulation examples. Therefore, this dissertation provides researchers an alternative computationally efficient model selection approach for data analysis using kernel methods
    • …
    corecore