15 research outputs found

    Unsupervised machine learning of high dimensional data for patient stratification

    Get PDF
    The development mechanisms of numerous complex, rare diseases are largely unknown to scientists partly due to their multifaceted heterogeneity. Stratifying patients is becoming a very important objective as we further research that inherent heterogeneity which can be utilised towards personalised medicine. However, considerable difficulties slow down accurate patient stratification mainly represented by outdated clinical criteria, weak associations or simple symptom categories. Fortunately, immense steps have been taken towards multiple omic data generation and utilisation aiming to produce new insights as in exploratory machine learning which showed the potential to identify the source of disease mechanisms from patient subgroups. This work describes the development of a modular clustering toolkit, named Omada, designed to assist researchers in exploring disease heterogeneity without extensive expertise in the machine learning field. Subsequently, it assesses Omada’s capabilities and validity by testing the toolkit on multiple data modalities from pulmonary hypertension (PH) patients. I first demonstrate the toolkit’s ability to create biologically meaningful subgroups based on whole blood RNA-seq data from H/IPAH patients in the manuscript “Biological heterogeneity in idiopathic pulmonary arterial hypertension identified through unsupervised transcriptomic profiling of whole blood”. Our work on the manuscript titled “Diagnostic miRNA signatures for treatable forms of pulmonary hypertension highlight challenges with clinical classification” aimed to apply the same clustering approach on a PH microRNA dataset as a first step in forming microRNA diagnostic signatures by recognising the potential of microRNA expression in identifying diverse disease sub-populations irrespectively of pre-existing PH classes. The toolkit’s effectiveness on metabolite data was also tested. Lastly, a longitudinal clustering approach was explored on activity readouts from wearables on COVID-19 patients as part of our manuscript “Unsupervised machine learning identifies and associates trajectory patterns of COVID-19 symptoms and physical activity measured via a smart watch”. Two clusters of high and low activity trajectories were generated and associated with symptom classes showing a weak but interesting relationship between the two. In summary, this thesis is examining the potential of patient stratification based on several data types from patients that represent a new, unseen picture of disease mechanisms. The tools presented provide important indications of distinct patient groups and could generate the insights needed for further targeted research and clinical associations that can help towards understanding rare, complex diseases

    The Pathway Coexpression Network: Revealing pathway relationships.

    Get PDF
    A goal of genomics is to understand the relationships between biological processes. Pathways contribute to functional interplay within biological processes through complex but poorly understood interactions. However, limited functional references for global pathway relationships exist. Pathways from databases such as KEGG and Reactome provide discrete annotations of biological processes. Their relationships are currently either inferred from gene set enrichment within specific experiments, or by simple overlap, linking pathway annotations that have genes in common. Here, we provide a unifying interpretation of functional interaction between pathways by systematically quantifying coexpression between 1,330 canonical pathways from the Molecular Signatures Database (MSigDB) to establish the Pathway Coexpression Network (PCxN). We estimated the correlation between canonical pathways valid in a broad context using a curated collection of 3,207 microarrays from 72 normal human tissues. PCxN accounts for shared genes between annotations to estimate significant correlations between pathways with related functions rather than with similar annotations. We demonstrate that PCxN provides novel insight into mechanisms of complex diseases using an Alzheimer's Disease (AD) case study. PCxN retrieved pathways significantly correlated with an expert curated AD gene list. These pathways have known associations with AD and were significantly enriched for genes independently associated with AD. As a further step, we show how PCxN complements the results of gene set enrichment methods by revealing relationships between enriched pathways, and by identifying additional highly correlated pathways. PCxN revealed that correlated pathways from an AD expression profiling study include functional clusters involved in cell adhesion and oxidative stress. PCxN provides expanded connections to pathways from the extracellular matrix. PCxN provides a powerful new framework for interrogation of global pathway relationships. Comprehensive exploration of PCxN can be performed at http://pcxn.org/

    Biological heterogeneity in idiopathic pulmonary arterial hypertension identified through unsupervised transcriptomic profiling of whole blood

    Get PDF
    Idiopathic pulmonary arterial hypertension (IPAH) is a rare but fatal disease diagnosed by right heart catheterisation and the exclusion of other forms of pulmonary arterial hypertension, producing a heterogeneous population with varied treatment response. Here we show unsupervised machine learning identification of three major patient subgroups that account for 92% of the cohort, each with unique whole blood transcriptomic and clinical feature signatures. These subgroups are associated with poor, moderate, and good prognosis. The poor prognosis subgroup is associated with upregulation of the ALAS2 and downregulation of several immunoglobulin genes, while the good prognosis subgroup is defined by upregulation of the bone morphogenetic protein signalling regulator NOG, and the C/C variant of HLA-DPA1/DPB1 (independently associated with survival). These findings independently validated provide evidence for the existence of 3 major subgroups (endophenotypes) within the IPAH classification, could improve risk stratification and provide molecular insights into the pathogenesis of IPAH

    PathSys: An infrastructure for functional molecular data sharing

    No full text
    Data integration at the level of high dimensional molecular interrogation is confounded by the diaspora of platforms and annotations of molecular events. To unify interpretation of functional activity within and between samples, we are developing a suite of tools that confer a highly standardised representation of pathway activity, networked pathway activity correlation, and pathway/disease/drug interaction. We have discovered that using the concept of higher order gene set interactions, using gene sets as the unit of comparison we are able to unify very large sets of data without a reliance on geneset overlap. <br> <br>Pathprint is the most developed of our set of tools: a functional approach that compares gene expression as a tertiary summary statistic for each canonical pathway, generating a set of pathway activities, networks and transcriptionally regulated targets. It compares a sample against a background of thousands of arrays to yield a relative activity for each pathway tested. It can be applied universally to gene expression profiles across species. Integration of large-scale profiling methods and curation of the public repository overcomes platform, species and batch effects to yield a standard measure of functional distance between experiments. Pathprint version (v2.0), shortly available through Bioconductor, includes 35 platforms, with new additions effectively increasing the number of covered arrays to 446,708; providing a 4x increase in background for pathway comparisons. Pathprint is utilised by the Harvard Stem Cell commons (http://stemcellcommons.org) as part of standardisation for representation and comparisons of stem cell systems. It is being implemented within the Genometranslationcommons (https://beta.genometranslationcommons.org//#/) at the University of Sheffield and the CureADCircuitscommons (in dev) as part of a Harvard/MIT/Sheffield consortium investigating regulation of genes associated with Alzheimer’s. <br><br>PCxN (namely the Pathway Co-Expression Network) (Hide, Winston (2015): PCxN the Pathway co-activity Map. figshare. https://doi.org/10.6084/m9.figshare.1589792.v4) is an online web resource which allows the discovery of correlation relationships between groups of pathways or gene sets drawn from the MsigDB and Pathprint collections. The tool provides users the ability to explore a static extendable network by focusing on single pathways and their most correlated neighbours, as well as identifying relationships between groups of pathways shown to be enriched in the collections by gene set enrichment. Analyses can be viewed and exported through a heatmap, a correlation network and gene/network tables. PCxN is employed as part of the CureADCircuits consortium (publication in prep) and is deployed for interpretation of network and pathway relationships by the AMP-AD consortium. <br><br>PDN (Pathway Drug Network), currently in development, relies on a network, made up of the expression correlation between each of 16,150 drug, disease and pathway gene signatures across 58,475 publicly available human microarrays (Affymetrix HGU133 Plus2) collected from the Comparative Toxicogenomics Database, PharmGKB, GeneSigDB, Wikipathways, KEGG, Netpath, Reactome, and Connectivity Map. PDN aims to utilize pathway – drug relationships to identify drug leads and to prioritise pathways that can be targeted in relationships to disease profiles. Its prototype has been successfully used together with Pathprint at Harvard School of Public Health in (Joachim R., Altschuler G., Hutchinson J., Wong H., Hide W., Kobzik L.: Pathwaycentered Analysis of the Relative Resistance of Children to Sepsis Mortality, in preparation). We have shown that PDN has a substantially higher rate of positives (p<0.01) when compared to a purely gene-level ConnectivityMap analysis (54% vs. 27%). In direct testing of drug candidates using an endotoxemia model of murine sepsis, 5 of 10 compounds improved survival. <br> <br>Taken as a whole, these approaches provide the first standardised approach to representation of systems biology with significant new insight into the systems level interpretation of gene set activity and correlation between genesets. <br

    PathSys: Integrating pathway curation, profiling methods, and public repositories: An infrastructure for functional molecular data sharing

    No full text
    Data integration at the level of high dimensional molecular interrogation is confounded by the diaspora of platforms and annotations of molecular events. To unify interpretation of functional activity within and between samples, we are developing a suite of tools that confer a highly standardised representation of pathway activity, networked pathway activity correlation, and pathway/disease/drug interaction. We have discovered that using the concept of higher order gene set interactions, using gene sets as the unit of comparison we are able to unify very large sets of data without a reliance on geneset overlap. <br> <br> Pathprint is the most developed of our set of tools: a functional approach that compares gene expression as a tertiary summary statistic for each canonical pathway, generating a set of pathway activities, networks and transcriptionally regulated targets. It compares a sample against a background of thousands of arrays to yield a relative activity for each pathway tested. It can be applied universally to gene expression profiles across species. Integration of large-scale profiling methods and curation of the public repository overcomes platform, species and batch effects to yield a standard measure of functional distance between experiments. Pathprint version (v2.0), shortly available through Bioconductor, includes 35 platforms, with new additions effectively increasing the number of covered arrays to 446,708; providing a 4x increase in background for pathway comparisons. Pathprint is utilised by the Harvard Stem Cell commons (http://stemcellcommons.org) as part of standardisation for representation and comparisons of stem cell systems. It is being implemented within the Genometranslationcommons (https://beta.genometranslationcommons.org//#/) at the University of Sheffield and the CureADCircuitscommons (in dev) as part of a Harvard/MIT/Sheffield consortium investigating regulation of genes associated with Alzheimer’s.  PCxN (namely the Pathway Co-Expression Network) (Hide, Winston (2015): PCxN the Pathway co-activity Map. figshare. https://doi.org/10.6084/m9.figshare.1589792.v4) is an online web resource which allows the discovery of correlation relationships between groups of pathways or gene sets drawn from the MsigDB and Pathprint collections. The tool provides users the ability to explore a static extendable network by focusing on single pathways and their most correlated neighbours, as well as identifying relationships between groups of pathways shown to be enriched in the collections by gene set enrichment. Analyses can be viewed and exported through a heatmap, a correlation network and gene/network tables. PCxN is employed as part of the CureADCircuits consortium (publication in prep) and is deployed for interpretation of network and pathway relationships by the AMP-AD consortium.  PDN (Pathway Drug Network), currently in development, relies on a network, made up of the expression correlation between each of 16,150 drug, disease and pathway gene signatures across 58,475 publicly available human microarrays (Affymetrix HGU133 Plus2) collected from the Comparative Toxicogenomics Database, PharmGKB, GeneSigDB, Wikipathways, KEGG, Netpath, Reactome, and Connectivity Map. PDN aims to utilize pathway – drug relationships to identify drug leads and to prioritise pathways that can be targeted in relationships to disease profiles. Its prototype has been successfully used together with Pathprint at Harvard School of Public Health in (Joachim R., Altschuler G., Hutchinson J., Wong H., Hide W., Kobzik L.: Pathwaycentered Analysis of the Relative Resistance of Children to Sepsis Mortality, in preparation). We have shown that PDN has a substantially higher rate of positives (p<0.01) when compared to a purely gene-level ConnectivityMap analysis (54% vs. 27%). In direct testing of drug candidates using an endotoxemia model of murine sepsis, 5 of 10 compounds improved survival. <br> <br>Taken as a whole, these approaches provide the first standardised approach to representation of systems biology with significant new insight into the systems level interpretation of gene set activity and correlation between genesets. <br

    Pathway Coexpression Network (PCxN) overview.

    No full text
    <p>(1) Human gene expression arrays for normal human tissues curated from GEO in Barcode 3.0 (2) The gene expression levels were replaced by their ranks so all arrays share a common scale. (3) For each microarray experiment, we first estimated the pathway expression based on the mean of the expression ranks, then the pathway correlation adjusted for shared genes, and tested the significance of the correlation. (4) We aggregated the experiment-level estimates to get the global pathway correlation and its corresponding significance. (5) We built a pathway coexpression network based on the significant pathway correlations.</p

    Significant correlations between the ribosome pathway and impact of gene overlap.

    No full text
    <p>(A) Boxplots of the correlation estimates between the Ribosome gene sets and random gene sets, and receiver operating characteristic (ROC) curves with the corresponding area under the curve (AUC) values in parenthesis under different degrees of overlap: no overlap, low overlap (overlap coefficient 0.0469, AUC = 1), medium overlap (overlap coefficient 0.5517, AUC = 0.9915) and high overlap (overlap coefficient 0.8532, AUC = 0.9528). The shape of the node in the following networks corresponds to the pathway database. For coexpression networks, the edge color indicates the value of the correlation and edge width is proportional to the correlation magnitude. For the overlap networks, the edge width is proportional to the overlap coefficient. (B) Pathway coexpression and overlap network for the KEGG and Reactome annotations of the <i>Cell Cycle</i> and <i>DNA Replication</i> pathways. These pathways have related functions and share genes between them. (C) Pathway coexpression network and overlap network for different versions of the <i>Wnt Signaling</i> pathway. In the coexpression network, missing edges correspond to correlations that are not significant. These pathway annotations are redundant and represent the same function (D) The stacked bar plot shows the number of pathways pairs with only significant correlations in red, with only significant overlaps in yellow, and with both in orange. The boxplots show the distribution of the correlation coefficients with pathway pairs with only significant correlations (red) and with both significant overlaps and significant correlations (orange). (E) Pathway coexpression network for the Reactome pathways related to the mitotic metaphase of the cell cycle with significant correlations but no shared genes. (F) Overlap network for Reactome pathways related to the mitotic cell cycle with significant overlaps but no significant correlations. (G) Pathway coexpression network and overlap network for cell cycle phases and related processes from Reactome with both significant correlations and significant overlaps.</p

    Canonical pathways correlated with the Alzheimer’s disease curated list.

    No full text
    <p>The ADCL is colored in blue. Neighbors without genes in common with the ADCL are highlighted in green. The shape of the node corresponds to the pathway database. For the coexpression network, the edge color indicates the value of the correlation and the edge width is proportional to the correlation magnitude. For the overlap network, the edge width is proportional to the overlap coefficient. (A) Pathway coexpression network for the top pathways correlated with the ADCL (by correlation magnitude). All correlated pathways have established associations with AD: <i>GPVI Mediated Activation Cascade</i> [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref079" target="_blank">79</a>], IL-3, 5 and GM-CSF signalling [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref080" target="_blank">80</a>], <i>Antigen Processing Cross Presentation</i> [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref081" target="_blank">81</a>], <i>PDGFRB Pathway</i> [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref083" target="_blank">83</a>], <i>Toll Pathway</i> [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref084" target="_blank">84</a>], <i>Regulation of Signaling by CBL</i> [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref082" target="_blank">82</a>], <i>Toll-like Receptor Signaling</i> [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref085" target="_blank">85</a>], <i>Activation of IRF3/IRF7 Mediated by TBK1/IKK Epsilon</i> [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref085" target="_blank">85</a>], <i>Cell Surface Interactions at the Vascular Wall</i> [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref086" target="_blank">86</a>], <i>FCER1 Pathway</i> [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006042#pcbi.1006042.ref087" target="_blank">87</a>]. (B) Shared genes (overlap coefficient) between the top pathways correlated with the ADCL. (C) Correlation magnitude of all canonical pathways correlated with the ADCL sorted by the magnitude of their correlation and split in bins of increasing size. (D) Proportion of canonical pathways enriched for the genes within the ADCL (<i>p</i> < 0.001, adjusted with FDR) present in the canonical pathways correlated with the ADCL (E) Proportion of canonical pathways enriched for genes associated with AD from the Genetic Association Database present in the pathways correlated with the ADCL (<i>p</i> < 0.001, adjusted with FDR). The red line indicates the proportion of all 1,330 canonical pathways enriched for genes within the ADCL.</p
    corecore