443 research outputs found

    An efficient gene selection method for high-dimensional microarray data based on sparse logistic regression

    Get PDF
    Gene selection in high-dimensional microarray data has become increasingly important in cancer classification. The high dimensionality of microarray data makes the application of many expert classifier systems difficult.To simultaneously perform gene selection and estimate the gene coefficientsin the model, sparse logistic regression using L1-norm was successfully applied in high-dimensional microarray data. However, when there are highcorrelation among genes, L1-norm cannot perform effectively. To addressthis issue, an efficient sparse logistic regression (ESLR) is proposed. Extensive applications using high-dimensional gene expression data show that ourproposed method can successfully select the highly correlated genes. Furthermore, ESLR is compared with other three methods and exhibits competitiveperformance in both classification accuracy and Youdens index. Thus, wecan conclude that ESLR has significant impact in sparse logistic regressionmethod and could be used in the field of high-dimensional microarray datacancer classification

    An Elastic-net Logistic Regression Approach to Generate Classifiers and Gene Signatures for Types of Immune Cells and T Helper Cell Subsets

    Get PDF
    Background: Host immune response is coordinated by a variety of different specialized cell types that vary in time and location. While host immune response can be studied using conventional low-dimensional approaches, advances in transcriptomics analysis may provide a less biased view. Yet, leveraging transcriptomics data to identify immune cell subtypes presents challenges for extracting informative gene signatures hidden within a high dimensional transcriptomics space characterized by low sample numbers with noisy and missing values. To address these challenges, we explore using machine learning methods to select gene subsets and estimate gene coefficients simultaneously. Results: Elastic-net logistic regression, a type of machine learning, was used to construct separate classifiers for ten different types of immune cell and for five T helper cell subsets. The resulting classifiers were then used to develop gene signatures that best discriminate among immune cell types and T helper cell subsets using RNA-seq datasets. We validated the approach using single-cell RNA-seq (scRNA-seq) datasets, which gave consistent results. In addition, we classified cell types that were previously unannotated. Finally, we benchmarked the proposed gene signatures against other existing gene signatures. Conclusions: Developed classifiers can be used as priors in predicting the extent and functional orientation of the host immune response in diseases, such as cancer, where transcriptomic profiling of bulk tissue samples and single cells are routinely employed. Information that can provide insight into the mechanistic basis of disease and therapeutic response. The so

    Classification of gene expression autism data based on adaptive penalized logistic regression

    Get PDF
    The common issues of high-dimensional gene expression data are that many of genes may not be relevant to their diseases. Gene selection has been proved to be an effective way to improve the result of many classification methods. In this paper, an adaptive penalized logistic regression is proposed, with the aim of identification relevant genes and provides high classification accuracy of autism data, by combining the logistic regression with the weighted L1-norm. Experimental results show that the proposed method significantly outperforms two competitor methods in terms of classification accuracy, G-mean, and area under the curve. Thus, the proposed method can be useful for other cancer classification using DNA gene expression data in the real clinical practice

    Relaxed Adaptive Lasso for Classification on High-Dimensional Sparse Data with Multicollinearity

    Get PDF
    High-dimensional sparse data with multicollinearity is frequently found in medical data. This problem can lead to poor predictive accuracy when applied to a new data set. The Least Absolute Shrinkage and Selection Operator (Lasso) is a popular machine-learning algorithm for variable selection and parameter estimation. Additionally, the adaptive Lasso method was developed using the adaptive weight on the l1-norm penalty. This adaptive weight is related to the power order of the estimators. Thus, we focus on 1) the power of adaptive weight on the penalty function, and 2) the two-stage variable selection method. This study aimed to propose the relaxed adaptive Lasso sparse logistic regression. Moreover, we compared the performances of the different penalty functions by using the mean of the predicted mean squared error (MPMSE) for the simulation study and the accuracy of classification for a real-data application. The results showed that the proposed method performed best on high-dimensional sparse data with multicollinearity. Along with, for classifier with the support vector machine, this proposed method was also the best option for the variable selection process

    Novel Regression Models For High-Dimensional Survival Analysis

    Get PDF
    Survival analysis aims to predict the occurrence of specific events of interest at future time points. The presence of incomplete observations due to censoring brings unique challenges in this domain and differentiates survival analysis techniques from other standard regression methods. In this thesis, we propose four models to deal with the high-dimensional survival analysis. Firstly, we propose a regularized linear regression model with weighted least-squares to handle the survival prediction in the presence of censored instances. We employ the elastic net penalty term for inducing sparsity into the linear model to effectively handle high-dimensional data. As opposed to the existing censored linear models, the parameter estimation of our model does not need any prior estimation of survival times of censored instances. The second model we proposed is a unified model for regularized parametric survival regression for an arbitrary survival distribution. We employ a generalized linear model to approximate the negative log-likelihood and use the elastic net as a sparsity-inducing penalty to effectively deal with high-dimensional data. The proposed model is then formulated as a penalized iteratively reweighted least squares and solved using a cyclical coordinate descent-based method.Considering the fact that the popularly used survival analysis methods such as Cox proportional hazard model and parametric survival regression suffer from some strict assumptions and hypotheses that are not realistic in many real-world applications. we reformulate the survival analysis problem as a multi-task learning problem in the third model which predicts the survival time by estimating the survival status at each time interval during the study duration. We propose an indicator matrix to enable the multi-task learning algorithm to handle censored instances and incorporate some of the important characteristics of survival problems such as non-negative non-increasing list structure into our model through max-heap projection. And the proposed formulation is solved via an Alternating Direction Method of Multipliers (ADMM) based algorithm. Besides above three methods which aim at solving standard survival prediction problem, we also propose a transfer learning model for survival analysis. During our study, we noticed that obtaining sufficient labeled training instances for learning a robust prediction model is a very time consuming process and can be extremely difficult in practice. Thus, we proposed a Cox based model which uses the L2,1-norm penalty to encourage source predictors and target predictors share similar sparsity patterns and hence learns a shared representation across source and target domains to improve the model performance on the target task. We demonstrate the performance of the proposed models using several real-world high-dimensional biomedical benchmark datasets and our experimental results indicate that our model outperforms other state-of-the-art related competing methods and attains very competitive performance on various datasets

    Improved LASSO (ILASSO) for gene selection and classification in high dimensional dna microarray data

    Get PDF
    Classification and selection of gene in high dimensional microarray data has become a challenging problem in molecular biology and genetics. Penalized Adaptive likelihood method has been employed recently for classification of cancer to address both gene selection consistency and estimation of gene coefficients in high dimensional data simultaneously. Many studies from the literature have proposed the use of ordinary least squares (OLS), maximum likelihood estimation (MLE) and Elastic net as the initial weight in the Adaptive elastic net, but in high dimensional microarray data the MLE and OLS are not suitable. Likewise, considering the Elastic net as the initial weight in Adaptive elastic yields a poor performance, because the ridge penalty in the Elastic net grouped coefficient of highly correlated genes closer to each other. As a result, the estimator fails to differentiate coefficients of highly correlated genes that have different sign being grouped together. To tackle this issue, the present study proposed Improved LASSO (ILASSO) estimator which add the ridge penalty to the original LASSO with an Adaptive weight to both l1 - norm and l2 - norm simultaneously. Results from the real data indicated that ILASSO has a better performance compared to other methods in terms of the number of gene selected, classification precision, Sensitivity and Specificity

    Methodological contributions to the challenges and opportunities of high dimensional clustering in the context of single-cell data

    Get PDF
    With the sequencing of single cells it is possible to measure gene expression of each single-cell in contrast to bulk sequencing which enables only average gene expression. This procedure provides access to read counts for each single cell and allows the development of methods such that single cells are automatically allocated to cell types. The determination of cell types is decisive for the analysis of diseases and to understand human health based on the genetic profile of single cells. It is of common use that cell types are allocated using clustering procedures that have been developed explicitly for single-cell data. For that purpose the single-cell consensus clustering (SC3), proposed by Kiselev et al. (Nat Methods 14(5):483-486, 2017) is part of the leading clustering methods in this context and is also of relevance for the following contributions. This PhD thesis aims at the development of appropriate analysis techniques for the clustering of high-dimensional single-cell data and their reliable validation. It also provides a simulation framework for the investigation of the influence of distorted measurements of single cells towards clustering performance. We further incorporate cluster indices as informative weights into the regularized regression, which allows a soft filtering of variables

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    • …
    corecore