2,335 research outputs found
The Multiscale Backbone of the Human Phenotype Network Based on Biological Pathways
Background:
Networks are commonly used to represent and analyze large and complex systems of interacting elements. In systems biology, human disease networks show interactions between disorders sharing common genetic background. We built pathway-based human phenotype network (PHPN) of over 800 physical attributes, diseases, and behavioral traits; based on about 2,300 genes and 1,200 biological pathways. Using GWAS phenotype-to-genes associations, and pathway data from Reactome, we connect human traits based on the common patterns of human biological pathways, detecting more pleiotropic effects, and expanding previous studies from a gene-centric approach to that of shared cell-processes. Results:
The resulting network has a heavily right-skewed degree distribution, placing it in the scale-free region of the network topologies spectrum. We extract the multi-scale information backbone of the PHPN based on the local densities of the network and discarding weak connection. Using a standard community detection algorithm, we construct phenotype modules of similar traits without applying expert biological knowledge. These modules can be assimilated to the disease classes. However, we are able to classify phenotypes according to shared biology, and not arbitrary disease classes. We present examples of expected clinical connections identified by PHPN as proof of principle. Conclusions:
We unveil a previously uncharacterized connection between phenotype modules and discuss potential mechanistic connections that are obvious only in retrospect. The PHPN shows tremendous potential to become a useful tool both in the unveiling of the diseases’ common biology, and in the elaboration of diagnosis and treatments
Recommended from our members
Single Cell Analysis of Chromatin Accessibility
The identity of each cell in the human body is established and maintained through distinct gene expression program, which is regulated in part by the chromatin accessibility. Until recently, our understanding of chromatin accessibility has depended largely upon bulk measurements in populations of cells. Recent advances in the sequencing techniques have allowed for the identification of open chromatin regions in single cells. During my Ph.D., I have developed and used single cell sequencing techniques to study the diverse gene regulatory programs underlie the different cell types in mammalian complex tissues. In chapter 1, colleague and I developed Single Nucleus Assay of Transpose Accessible Chromatin using Sequencing (snATAC-seq), a combinatorial barcoding-assisted single-cell assay for probing accessible chromatin in single cells. We then used snATAC-seq to generate an epigenomic atlas of early developing mouse brain. The high-level noise of each single cell chromatin accessibility profile and the large volume of the datasets pose unique computational challenges. In chapter 2, I developed a comprehensive bioinformatics software package called SnapATAC for analyzing large-scale single cell ATAC-seq dataset. SnapATAC resolves the heterogeneity in complex tissues and maps the trajectories of cellular states. As a demonstration of its utility, SnapATAC was applied to 55,592 single-nucleus ATAC-seq profiles from the mouse secondary motor cortex. To further determine the target genes of the distal regulatory elements identified using snATAC-seq in different cell types, in chapter 3, colleague and I developed PLAC-seq, a cost-efficient method that identifies the long-range chromatin interaction at kilobase resolution. PLAC-seq improves the efficiency of detecting chromatin conformation by over 10-fold and reduces the input requirement by nearly 100-fold compared to the prior techniques. Finally, to probe the in vivo function of the regulatory sequences, I present a high-throughput CRISPR screening method (CREST-seq) for the unbiased discovery and functional assessment of enhancer sequences in the human genome. We used it to interrogate the 2-Mb POU5F1 locus in human embryonic stem cells and discovered that sequences previously annotated as promoters of functionally unrelated genes can regulate the expression of POU5F1 from a long distance. We anticipate that these studies will help us understand the gene regulatory programs across diverse biological systems ranging from human disease to the evolution of species
INVESTIGATING INVASION IN DUCTAL CARCINOMA IN SITU WITH TOPOGRAPHICAL SINGLE CELL GENOME SEQUENCING
Synchronous Ductal Carcinoma in situ (DCIS-IDC) is an early stage breast cancer invasion in which it is possible to delineate genomic evolution during invasion because of the presence of both in situ and invasive regions within the same sample. While laser capture microdissection studies of DCIS-IDC examined the relationship between the paired in situ (DCIS) and invasive (IDC) regions, these studies were either confounded by bulk tissue or limited to a small set of genes or markers. To overcome these challenges, we developed Topographic Single Cell Sequencing (TSCS), which combines laser-catapulting with single cell DNA sequencing to measure genomic copy number profiles from single tumor cells while preserving their spatial context. We applied TSCS to sequence 1,293 single cells from 10 synchronous DCIS patients. We also applied deep-exome sequencing to the in situ, invasive and normal tissues for the DCIS-IDC patients. Previous bulk tissue studies had produced several conflicting models of tumor evolution. Our data support a multiclonal invasion model, in which genome evolution occurs within the ducts and gives rise to multiple subclones that escape the ducts into the adjacent tissues to establish the invasive carcinomas. In summary, we have developed a novel method for single cell DNA sequencing, which preserves spatial context, and applied this method to understand clonal evolution during the transition between carcinoma in situ to invasive ductal carcinoma
Gene Set Enrichment and Projection: A Computational Tool for Knowledge Discovery in Transcriptomes
Explaining the mechanism behind a genetic disease involves two phases, collecting and analyzing data associated to the disease, then interpreting those data in the context of biological systems. The objective of this dissertation was to develop a method of integrating complementary datasets surrounding any single biological process, with the goal of presenting the response to a signal in terms of a set of downstream biological effects. This dissertation specifically tests the hypothesis that computational projection methods overlaid with domain expertise can direct research towards relevant systems-level signals underlying complex genetic disease. To this end, I developed a software algorithm named Geneset Enrichment and Projection Displays (GSEPD) that can visualize multidimensional genetic expression to identify the biologically relevant gene sets that are altered in response to a biological process. This dissertation highlights a problem of data interpretation facing the medical research community, and shows how computational sciences can help. By bringing annotation and expression datasets together, a new analytical and software method was produced that helps unravel complicated experimental and biological data. The dissertation shows four coauthored studies where the experts in their field have desired to annotate functional significance to a gene-centric experiment. Using GSEPD to show inherently high dimensional data as a simple colored graph, a subspace vector projection directly calculated how each sample behaves like test conditions. The end-user medical researcher understands their data as a series of somewhat-independent subsystems, and GSEPD provides a dimensionality reduction for high throughput experiments of limited sample size. Gene Ontology analyses are accessible on a sample-to-sample level, and this work highlights not just the expected biological systems, but many annotated results available in vast online databases
SuSE : Subspace Selection embedded in an EM algorithm
National audienceSubspace clustering is an extension of traditional clustering that seeks to find clusters embedded in different subspaces within a dataset. This is a particularly important challenge with high dimensional data where the curse of dimensionality occurs. It also has the benefit of providing smaller descriptions of the clusters found. In this field, we show that using probabilistic models provides many advantages over other existing methods. In particular, we show that the difficult problem of the parameter settings of subspace clustering algorithms can be seen as a model selection problem in the framework of probabilistic models. It thus allows us to design a method that does not require any input parameter from the user. We also point out the interest in allowing the clusters to overlap. And finally, we show that it is well suited for detecting the noise that may exist in the data, and that this helps to provide a more understandable representation of the clusters found
Infectious Disease Ontology
Technological developments have resulted in tremendous increases in the volume and diversity of the data and information that must be processed in the course of biomedical and clinical research and practice. Researchers are at the same time under ever greater pressure to share data and to take steps to ensure that data resources are interoperable. The use of ontologies to annotate data has proven successful in supporting these goals and in providing new possibilities for the automated processing of data and information. In this chapter, we describe different types of vocabulary resources and emphasize those features of formal ontologies that make them most useful for computational applications. We describe current uses of ontologies and discuss future goals for ontology-based computing, focusing on its use in the field of infectious diseases. We review the largest and most widely used vocabulary resources relevant to the study of infectious diseases and conclude with a description of the Infectious Disease Ontology (IDO) suite of interoperable ontology modules that together cover the entire infectious disease domain
Large-scale single-cell transcriptomics of osteosarcoma reveals extensive and different heterogeneity in primary tumors versus murine xenograft model
Heterogeneity within tumors has long been studied as a potential confounding factor for effective therapies, with recent studies pointing to heterogeneity resulting in distinct clonal subtypes, each with varying degrees of fitness and metastatic potential. Studies of heterogeneity have previously been limited to microscopy observations, immunohistochemistry, and flow cytometry. Recently, however, it has become possible to examine heterogeneity at a previously unexplored level: the transcriptome of individual cells.
Osteosarcomas have been known to be highly heterogeneous, so we have selected osteosarcoma as our primary tumor to study as a proof-of-concept. Additionally, we have elected to create a murine patient derived xenograft (PDX) model from a primary osteosarcoma tumor and examine differences between the primary tumor and resulting xenograft at the single-cell level. Through this, we hope to better understand tumor heterogeneity and add to the current discussion in the scientific community regarding the relevance of PDX models for testing promising new therapies and personalized medicine.
Through our examination of single-cell heterogeneity in osteosarcomas, we have confirmed the extensive heterogeneity previously reported, but this time at the level of mRNA. The osteosarcomas were so hetereogeneous that our resulting dataset of over 1,000 cells still did not have enough resolution to generate highly differentiated and separate groupings of cells. Upon examining inter-tumor heterogeneity, we observed the cells from different tumors to generally cluster separately. However, there were certain populations of cells from all tumors that clustered together. We also generated a PDX model and sequenced the resulting tumor, observing markedly reduced heterogeneity as compared to the original primary tumor. Importantly, the cells from the PDX model clustered within the larger group of cells from the original tumor, lending credence to the theory of clonal selection.
This work presents evidence of extensive intra- and inter-tumor heterogeneity at the mRNA level within osteosarcoma tumors. This heterogeneity requires further single cell sampling to shed light on the biology of tumor diversity. Further, this heterogeneity is significantly reduced in a generated murine PDX model. This difference should serve as a potential warning about additional factors to take into account when evaluating therapies in PDX models, and suggests that further studies examining cause and effect of this observed heterogeneity are warranted
Unsupervised Discovery and Representation of Subspace Trends in Massive Biomedical Datasets
The goal of this dissertation is to develop unsupervised algorithms for discovering previously unknown subspace trends in massive multivariate biomedical data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. To overcome these limitations, we propose a novel graph-theoretic neighborhood similarity measure for sensing concordant progressive changes across data dimensions. Using this measure, we present an unsupervised algorithm for trend-relevant feature selection and visualization. Additionally, we propose to use an efficient online density-based representation to make the algorithm scalable for massive datasets.
The representation not only assists in trend discovery, but also in cluster detection including rare populations. Our method has been successfully applied to diverse synthetic and real-world biomedical datasets, such as gene expression microarray and arbor morphology of neurons and microglia in brain tissue. Derived representations revealed biologically meaningful hidden subspace trend(s) that were obscured by irrelevant features and noise. Although our applications are mostly from the biomedical domain, the proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.Electrical and Computer Engineering, Department o
Pathway level subtyping identifies a slow-cycling biological phenotype associated with poor clinical outcomes in colorectal cancer
Molecular stratification using gene-level transcriptional data has identified subtypes with distinctive genotypic and phenotypic traits, as exemplified by the consensus molecular subtypes (CMS) in colorectal cancer (CRC). Here, rather than gene-level data, we make use of gene ontology and biological activation state information for initial molecular class discovery. In doing so, we defined three pathway-derived subtypes (PDS) in CRC: PDS1 tumors, which are canonical/LGR5+ stem-rich, highly proliferative and display good prognosis; PDS2 tumors, which are regenerative/ANXA1+ stem-rich, with elevated stromal and immune tumor microenvironmental lineages; and PDS3 tumors, which represent a previously overlooked slow-cycling subset of tumors within CMS2 with reduced stem populations and increased differentiated lineages, particularly enterocytes and enteroendocrine cells, yet display the worst prognosis in locally advanced disease. These PDS3 phenotypic traits are evident across numerous bulk and single-cell datasets, and demark a series of subtle biological states that are currently under-represented in pre-clinical models and are not identified using existing subtyping classifiers
- …