88 research outputs found

    Accelerated Proximal Algorithm for Finding the Dantzig Selector and Source Separation Using Dictionary Learning

    Get PDF
    In most of the applications, signals acquired from different sensors are composite and are corrupted by some noise. In the presence of noise, separation of composite signals into its components without losing information is quite challenging. Separation of signals becomes more difficult when only a few samples of the noisy undersampled composite signals are given. In this paper, we aim to find Dantzig selector with overcomplete dictionaries using Accelerated Proximal Gradient Algorithm (APGA) for recovery and separation of undersampled composite signals. We have successfully diagnosed leukemia disease using our model and compared it with Alternating Direction Method of Multipliers (ADMM). As a test case, we have also recovered Electrocardiogram (ECG) signal with great accuracy from its noisy version using this model along with Proximity Operator based Algorithm (POA) for comparison. With less computational complexity compared with ADMM and POA, APGA has a good clustering capability depicted from the leukemia diagnosis

    Challenges of Big Data Analysis

    Full text link
    Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

    Classification of gene expression autism data based on adaptive penalized logistic regression

    Get PDF
    The common issues of high-dimensional gene expression data are that many of genes may not be relevant to their diseases. Gene selection has been proved to be an effective way to improve the result of many classification methods. In this paper, an adaptive penalized logistic regression is proposed, with the aim of identification relevant genes and provides high classification accuracy of autism data, by combining the logistic regression with the weighted L1-norm. Experimental results show that the proposed method significantly outperforms two competitor methods in terms of classification accuracy, G-mean, and area under the curve. Thus, the proposed method can be useful for other cancer classification using DNA gene expression data in the real clinical practice

    Feature selection when there are many influential features

    Full text link
    Recent discussion of the success of feature selection methods has argued that focusing on a relatively small number of features has been counterproductive. Instead, it is suggested, the number of significant features can be in the thousands or tens of thousands, rather than (as is commonly supposed at present) approximately in the range from five to fifty. This change, in orders of magnitude, in the number of influential features, necessitates alterations to the way in which we choose features and to the manner in which the success of feature selection is assessed. In this paper, we suggest a general approach that is suited to cases where the number of relevant features is very large, and we consider particular versions of the approach in detail. We propose ways of measuring performance, and we study both theoretical and numerical properties of the proposed methodology.Comment: Published in at http://dx.doi.org/10.3150/13-BEJ536 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

    Robust and sparse estimation of high-dimensional precision matrices via bivariate outlier detection

    Get PDF
    Robust estimation of Gaussian Graphical models in the high-dimensional setting is becoming increasingly important since large and real data may contain outlying observations. These outliers can lead to drastically wrong inference on the intrinsic graph structure. Several procedures apply univariate transformations to make the data Gaussian distributed. However, these transformations do not work well under the presence of structural bivariate outliers. We propose a robust precision matrix estimator under the cellwise contamination mechanism that is robust against structural bivariate outliers. This estimator exploits robust pairwise weighted correlation coefficient estimates, where the weights are computed by the Mahalanobis distance with respect to an affine equivariant robust correlation coefficient estimator. We show that the convergence rate of the proposed estimator is the same as the correlation coefficient used to compute the Mahalanobis distance. We conduct numerical simulation under different contamination settings to compare the graph recovery performance of different robust estimators. Finally, the proposed method is then applied to the classification of tumors using gene expression data. We show that our procedure can effectively recover the true graph under cellwise data contamination.Acknowledgements: the authors acknowledge financial support from the Spanish Ministry of Education and Science, research project MTM2013-44902-P

    An efficient gene selection method for high-dimensional microarray data based on sparse logistic regression

    Get PDF
    Gene selection in high-dimensional microarray data has become increasingly important in cancer classification. The high dimensionality of microarray data makes the application of many expert classifier systems difficult.To simultaneously perform gene selection and estimate the gene coefficientsin the model, sparse logistic regression using L1-norm was successfully applied in high-dimensional microarray data. However, when there are highcorrelation among genes, L1-norm cannot perform effectively. To addressthis issue, an efficient sparse logistic regression (ESLR) is proposed. Extensive applications using high-dimensional gene expression data show that ourproposed method can successfully select the highly correlated genes. Furthermore, ESLR is compared with other three methods and exhibits competitiveperformance in both classification accuracy and Youdens index. Thus, wecan conclude that ESLR has significant impact in sparse logistic regressionmethod and could be used in the field of high-dimensional microarray datacancer classification

    High-dimensional Measurement Error Models for Lipschitz Loss

    Full text link
    Recently emerging large-scale biomedical data pose exciting opportunities for scientific discoveries. However, the ultrahigh dimensionality and non-negligible measurement errors in the data may create difficulties in estimation. There are limited methods for high-dimensional covariates with measurement error, that usually require knowledge of the noise distribution and focus on linear or generalized linear models. In this work, we develop high-dimensional measurement error models for a class of Lipschitz loss functions that encompasses logistic regression, hinge loss and quantile regression, among others. Our estimator is designed to minimize the L1L_1 norm among all estimators belonging to suitable feasible sets, without requiring any knowledge of the noise distribution. Subsequently, we generalize these estimators to a Lasso analog version that is computationally scalable to higher dimensions. We derive theoretical guarantees in terms of finite sample statistical error bounds and sign consistency, even when the dimensionality increases exponentially with the sample size. Extensive simulation studies demonstrate superior performance compared to existing methods in classification and quantile regression problems. An application to a gender classification task based on brain functional connectivity in the Human Connectome Project data illustrates improved accuracy under our approach, and the ability to reliably identify significant brain connections that drive gender differences
    • …
    corecore