158,944 research outputs found

    Correcting for selection bias via cross-validation in the classification of microarray data

    Full text link
    There is increasing interest in the use of diagnostic rules based on microarray data. These rules are formed by considering the expression levels of thousands of genes in tissue samples taken on patients of known classification with respect to a number of classes, representing, say, disease status or treatment strategy. As the final versions of these rules are usually based on a small subset of the available genes, there is a selection bias that has to be corrected for in the estimation of the associated error rates. We consider the problem using cross-validation. In particular, we present explicit formulae that are useful in explaining the layers of validation that have to be performed in order to avoid improperly cross-validated estimates.Comment: Published in at http://dx.doi.org/10.1214/193940307000000284 the IMS Collections (http://www.imstat.org/publications/imscollections.htm) by the Institute of Mathematical Statistics (http://www.imstat.org

    Kernel density estimation on the torus

    No full text
    Kernel density estimation for multivariate, circular data has been formulated only when the sample space is the sphere, but theory for the torus would also be useful. For data lying on a d-dimensional torus (d >= 1), we discuss kernel estimation of a density, its mixed partial derivatives, and their squared functionals. We introduce a specific class of product kernels whose order is suitably defined in such a way to obtain L-2-risk formulas whose structure can be compared to their Euclidean counterparts. Our kernels are based on circular densities; however, we also discuss smaller bias estimation involving negative kernels which are functions of circular densities. Practical rules for selecting the smoothing degree, based on cross-validation, bootstrap and plug-in ideas are derived. Moreover, we provide specific results on the use of kernels based on the von Mises density. Finally, real-data examples and simulation studies illustrate the findings

    Growing Regression Forests by Classification: Applications to Object Pose Estimation

    Full text link
    In this work, we propose a novel node splitting method for regression trees and incorporate it into the regression forest framework. Unlike traditional binary splitting, where the splitting rule is selected from a predefined set of binary splitting rules via trial-and-error, the proposed node splitting method first finds clusters of the training data which at least locally minimize the empirical loss without considering the input space. Then splitting rules which preserve the found clusters as much as possible are determined by casting the problem into a classification problem. Consequently, our new node splitting method enjoys more freedom in choosing the splitting rules, resulting in more efficient tree structures. In addition to the Euclidean target space, we present a variant which can naturally deal with a circular target space by the proper use of circular statistics. We apply the regression forest employing our node splitting to head pose estimation (Euclidean target space) and car direction estimation (circular target space) and demonstrate that the proposed method significantly outperforms state-of-the-art methods (38.5% and 22.5% error reduction respectively).Comment: Paper accepted by ECCV 201

    Deriving Models for Software Project Effort Estimation By Means of Genetic Programming

    Get PDF
    Software engineering, effort estimation, genetic programming, symbolic regression. This paper presents the application of a computational intelligence methodology in effort estimation for software projects. Namely, we apply a genetic programming model for symbolic regression; aiming to produce mathematical expressions that (1) are highly accurate and (2) can be used for estimating the development effort by revealing relationships between the project’s features and the required work. We selected to investigate the effectiveness of this methodology into two software engineering domains. The system was proved able to generate models in the form of handy mathematical expressions that are more accurate than those found in literature.

    Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression

    Full text link
    In high dimensional settings, sparse structures are crucial for efficiency, both in term of memory, computation and performance. It is customary to consider â„“1\ell_1 penalty to enforce sparsity in such scenarios. Sparsity enforcing methods, the Lasso being a canonical example, are popular candidates to address high dimension. For efficiency, they rely on tuning a parameter trading data fitting versus sparsity. For the Lasso theory to hold this tuning parameter should be proportional to the noise level, yet the latter is often unknown in practice. A possible remedy is to jointly optimize over the regression parameter as well as over the noise level. This has been considered under several names in the literature: Scaled-Lasso, Square-root Lasso, Concomitant Lasso estimation for instance, and could be of interest for confidence sets or uncertainty quantification. In this work, after illustrating numerical difficulties for the Smoothed Concomitant Lasso formulation, we propose a modification we coined Smoothed Concomitant Lasso, aimed at increasing numerical stability. We propose an efficient and accurate solver leading to a computational cost no more expansive than the one for the Lasso. We leverage on standard ingredients behind the success of fast Lasso solvers: a coordinate descent algorithm, combined with safe screening rules to achieve speed efficiency, by eliminating early irrelevant features

    Bandwidth choice for nonparametric classification

    Full text link
    It is shown that, for kernel-based classification with univariate distributions and two populations, optimal bandwidth choice has a dichotomous character. If the two densities cross at just one point, where their curvatures have the same signs, then minimum Bayes risk is achieved using bandwidths which are an order of magnitude larger than those which minimize pointwise estimation error. On the other hand, if the curvature signs are different, or if there are multiple crossing points, then bandwidths of conventional size are generally appropriate. The range of different modes of behavior is narrower in multivariate settings. There, the optimal size of bandwidth is generally the same as that which is appropriate for pointwise density estimation. These properties motivate empirical rules for bandwidth choice.Comment: Published at http://dx.doi.org/10.1214/009053604000000959 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore