363 research outputs found

    Gene and pathway identification with Lp penalized Bayesian logistic regression

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identifying genes and pathways associated with diseases such as cancer has been a subject of considerable research in recent years in the area of bioinformatics and computational biology. It has been demonstrated that the magnitude of differential expression does not necessarily indicate biological significance. Even a very small change in the expression of particular gene may have dramatic physiological consequences if the protein encoded by this gene plays a catalytic role in a specific cell function. Moreover, highly correlated genes may function together on the same pathway biologically. Finally, in sparse logistic regression with <it>L</it><sub><it>p </it></sub>(<it>p </it>< 1) penalty, the degree of the sparsity obtained is determined by the value of the regularization parameter. Usually this parameter must be carefully tuned through cross-validation, which is time consuming.</p> <p>Results</p> <p>In this paper, we proposed a simple Bayesian approach to integrate the regularization parameter out analytically using a new prior. Therefore, there is no longer a need for parameter selection, as it is eliminated entirely from the model. The proposed algorithm (BLpLog) is typically two or three orders of magnitude faster than the original algorithm and free from bias in performance estimation. We also define a novel similarity measure and develop an integrated algorithm to hunt the regulatory genes with low expression changes but having high correlation with the selected genes. Pathways of those correlated genes were identified with DAVID <url>http://david.abcc.ncifcrf.gov/</url>.</p> <p>Conclusion</p> <p>Experimental results with gene expression data demonstrate that the proposed methods can be utilized to identify important genes and pathways that are related to cancer and build a parsimonious model for future patient predictions.</p

    Efficient Feature Selection and Multiclass Classification with Integrated Instance and Model Based Learning

    Get PDF
    Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L1 or Lp penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced

    Regularized Machine Learning in the Genetic Prediction of Complex Traits

    Get PDF
    Compared to univariate analysis of genome-wide association (GWA) studies, machine learning&ndash;based models have been shown to provide improved means of learning such multilocus panels of genetic variants and their interactions that are most predictive of complex phenotypic traits. Many applications of predictive modeling rely on effective variable selection, often implemented through model regularization, which penalizes the model complexity and enables predictions in individuals outside of the training dataset. However, the different regularization approaches may also lead to considerable differences, especially in the number of genetic variants needed for maximal predictive accuracy, as illustrated here in examples from both disease classification and quantitative trait prediction. We also highlight the potential pitfalls of the regularized machine learning models, related to issues such as model overfitting to the training data, which may lead to over-optimistic prediction results, as well as identifiability of the predictive variants, which is important in many medical applications. While genetic risk prediction for human diseases is used as a motivating use case, we argue that these models are also widely applicable in nonhuman applications, such as animal and plant breeding, where accurate genotype-to-phenotype modeling is needed. Finally, we discuss some key future advances, open questions and challenges in this developing field, when moving toward low-frequency variants and cross-phenotype interactions.</p

    Penalized estimation in high-dimensional data analysis

    Get PDF
    • …
    corecore