1,597 research outputs found
Variable selection and regression analysis for graph-structured covariates with an application to genomics
Graphs and networks are common ways of depicting biological information. In
biology, many different biological processes are represented by graphs, such as
regulatory networks, metabolic pathways and protein--protein interaction
networks. This kind of a priori use of graphs is a useful supplement to the
standard numerical data such as microarray gene expression data. In this paper
we consider the problem of regression analysis and variable selection when the
covariates are linked on a graph. We study a graph-constrained regularization
procedure and its theoretical properties for regression analysis to take into
account the neighborhood information of the variables measured on a graph. This
procedure involves a smoothness penalty on the coefficients that is defined as
a quadratic form of the Laplacian matrix associated with the graph. We
establish estimation and model selection consistency results and provide
estimation bounds for both fixed and diverging numbers of parameters in
regression models. We demonstrate by simulations and a real data set that the
proposed procedure can lead to better variable selection and prediction than
existing methods that ignore the graph information associated with the
covariates.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS332 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Recommended from our members
Integrative analysis of the inter-tumoral heterogeneity of triple-negative breast cancer.
Triple-negative breast cancers (TNBC) lack estrogen and progesterone receptors and HER2 amplification, and are resistant to therapies that target these receptors. Tumors from TNBC patients are heterogeneous based on genetic variations, tumor histology, and clinical outcomes. We used high throughput genomic data for TNBC patients (n = 137) from TCGA to characterize inter-tumor heterogeneity. Similarity network fusion (SNF)-based integrative clustering combining gene expression, miRNA expression, and copy number variation, revealed three distinct patient clusters. Integrating multiple types of data resulted in more distinct clusters than analyses with a single datatype. Whereas most TNBCs are classified by PAM50 as basal subtype, one of the clusters was enriched in the non-basal PAM50 subtypes, exhibited more aggressive clinical features and had a distinctive signature of oncogenic mutations, miRNAs and expressed genes. Our analyses provide a new classification scheme for TNBC based on multiple omics datasets and provide insight into molecular features that underlie TNBC heterogeneity
The role of network science in glioblastoma
Network science has long been recognized as a well-established discipline across many biological domains. In the particular case of cancer genomics, network discovery is challenged by the multitude of available high-dimensional heterogeneous views of data. Glioblastoma (GBM) is an example of such a complex and heterogeneous disease that can be tackled by network science. Identifying the architecture of molecular GBM networks is essential to understanding the information flow and better informing drug development and pre-clinical studies. Here, we review network-based strategies that have been used in the study of GBM, along with the available software implementations for reproducibility and further testing on newly coming datasets. Promising results have been obtained from both bulk and single-cell GBM data, placing network discovery at the forefront of developing a molecularly-informed-based personalized medicine.This work was partially supported by national funds through Fundação para a Ciência e a
Tecnologia (FCT) with references CEECINST/00102/2018, CEECIND/00072/2018 and
PD/BDE/143154/2019, UIDB/04516/2020, UIDB/00297/2020, UIDB/50021/2020, UIDB/50022/2020,
UIDB/50026/2020, UIDP/50026/2020, NORTE-01-0145-FEDER-000013, and NORTE-01-0145-FEDER000023 and projects PTDC/CCI-BIO/4180/2020 and DSAIPA/DS/0026/2019. This project has received funding from the European Union’s Horizon 2020 research and innovation program under
Grant Agreement No. 951970 (OLISSIPO project)
Recommended from our members
Bayesian hierarchical graph-structured model for pathway analysis using gene expression data
In genomic analysis, there is growing interest in network structures that represent biochemistry interactions. Graph structured or constrained inference takes advantage of a known relational structure among variables to introduce smoothness and reduce complexity in modeling, especially for high-dimensional genomic data. There has been a lot of interest in its application in model regularization and selection. However, prior knowledge on the graphical structure among the variables can be limited and partial. Empirical data may suggest variations and modifications to such a graph, which could lead to new and interesting biological findings. In this paper, we propose a Bayesian random graph-constrained model, rGrace, an extension from the Grace model, to combine a priori network information with empirical evidence, for applications such as pathway analysis. Using both simulations and real data examples, we show that the new method, while leading to improved predictive performance, can identify discrepancy between data and a prior known graph structure and suggest modifications and updates
Network-Based Biomarker Discovery : Development of Prognostic Biomarkers for Personalized Medicine by Integrating Data and Prior Knowledge
Advances in genome science and technology offer a deeper understanding of biology while at the same time improving the practice of medicine. The expression profiling of some diseases, such as cancer, allows for identifying marker genes, which could be able to diagnose a disease or predict future disease outcomes. Marker genes (biomarkers) are selected by scoring how well their expression levels can discriminate between different classes of disease or between groups of patients with different clinical outcome (e.g. therapy response, survival time, etc.). A current challenge is to identify new markers that are directly related to the underlying disease mechanism
CAncer bioMarker Prediction Pipeline (CAMPP) - A standardized framework for the analysis of quantitative biological data
With the improvement of -omics and next-generation sequencing (NGS) methodologies, along with the lowered cost of generating these types of data, the analysis of high-throughput biological data has become standard both for forming and testing biomedical hypotheses. Our knowledge of how to normalize datasets to remove latent undesirable variances has grown extensively, making for standardized data that are easily compared between studies. Here we present the CAncer bioMarker Prediction Pipeline (CAMPP), an open-source R-based wrapper (https://github.com/ELELAB/CAncer-bioMarker-Prediction-Pipeline -CAMPP) intended to aid bioinformatic software-users with data analyses. CAMPP is called from a terminal command line and is supported by a user-friendly manual. The pipeline may be run on a local computer and requires little or no knowledge of programming. To avoid issues relating to R-package updates, a renv .lock file is provided to ensure R-package stability. Data-management includes missing value imputation, data normalization, and distributional checks. CAMPP performs (I) k-means clustering, (II) differential expression/abundance analysis, (III) elastic-net regression, (IV) correlation and co-expression network analyses, (V) survival analysis, and (VI) protein-protein/miRNA-gene interaction networks. The pipeline returns tabular files and graphical representations of the results. We hope that CAMPP will assist in streamlining bioinformatic analysis of quantitative biological data, whilst ensuring an appropriate bio-statistical framework
Recommended from our members
Aberrantly Expressed CeRNAs Account for Missing Genomic Variability of Cancer Genes via MicroRNA-Mediated Interactions
There is growing evidence that RNAs compete for binding and regulation by a finite pool of microRNAs (miRs), thus regulating each other through a competing endogenous RNA (ceRNA) mechanism. My dissertation work focused on systematically studying ceRNA interactions in cancer by reverse-engineering context-specific miR-RNA interactions and ceRNA regulatory interactions across multiple tumor types and study the effects of these interactions in cancer. I attempted to use ceRNA interactions to explain how genetic and epigenetic alterations are propagated to target established drivers of tumorigenesis. Using bioinformatics analysis of primary tumor samples and experimental validation in cell lines, I have investigated the roles that mRNAs and noncoding RNAs can play in tumorigenesis via ceRNA interactions. Specifically, I studied how RNAs target tumor-suppressors and oncogenes as ceRNAs, and attempted to accounting for some of the missing genomic variability in tumors
Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions
Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially. IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.Peer reviewe
- …