4 research outputs found

    Single sample pathway analysis in metabolomics: performance evaluation and application

    Get PDF
    Background Single sample pathway analysis (ssPA) transforms molecular level omics data to the pathway level, enabling the discovery of patient-specific pathway signatures. Compared to conventional pathway analysis, ssPA overcomes the limitations by enabling multi-group comparisons, alongside facilitating numerous downstream analyses such as pathway-based machine learning. While in transcriptomics ssPA is a widely used technique, there is little literature evaluating its suitability for metabolomics. Here we provide a benchmark of established ssPA methods (ssGSEA, GSVA, SVD (PLAGE), and z-score) alongside the evaluation of two novel methods we propose: ssClustPA and kPCA, using semi-synthetic metabolomics data. We then demonstrate how ssPA can facilitate pathway-based interpretation of metabolomics data by performing a case-study on inflammatory bowel disease mass spectrometry data, using clustering to determine subtype-specific pathway signatures. Results While GSEA-based and z-score methods outperformed the others in terms of recall, clustering/dimensionality reduction-based methods provided higher precision at moderate-to-high effect sizes. A case study applying ssPA to inflammatory bowel disease data demonstrates how these methods yield a much richer depth of interpretation than conventional approaches, for example by clustering pathway scores to visualise a pathway-based patient subtype-specific correlation network. We also developed the sspa python package (freely available at https://pypi.org/project/sspa/), providing implementations of all the methods benchmarked in this study. Conclusion This work underscores the value ssPA methods can add to metabolomic studies and provides a useful reference for those wishing to apply ssPA methods to metabolomics data

    Analysis of gene systems across brain disorders

    Get PDF
    This thesis analyses data from genetic and epigenetic studies of brain disorders, in order to establish potential convergences of mechanisms across different conditions. Current research highlights the common symptoms across a wide range of brain disorders. We analyse the properties of the gene regulator: Methyl-CpG binding protein 2 (MeCP2), a chromatin-binding protein and a modulator of gene expression and we establish a DNA binding model: Matrix-GC, to predict MeCP2 targets. We evaluate Matrix-GC’s performance using receiver operating characteristic curves while varying a determinant binding factor: guanine-cytosine nucleotide enrichment (GC content). We show by combining a DNA binding sequence with GC content, that Matrix-GC is able to capture genes bound by MeCP2 better than random chance and binding sequence alone. Matrix-GC is applied to various brain disorders associated with MeCP2, followed by downstream enrichment analysis of molecular pathways and processes. We show three main processes to be under the control of MeCP2 across several brain disorders: neuronal transmission, development, and immunoreactivity. We further validate the performance of Matrix-GC at the single gene level by comparing MeCP2-bound genes with existing high-throughput transcriptome analysis and show that our results are statistically significant. We carry out stringent control analysis by Monte Carlo permutation to strengthen the reliability of our results. We propose the Matrix-GC as an in silico procedure to identify putative MeCP2 target genes and shed light on mechanisms overlapping across different brain disorders. Our method of identifying target genes has broad applications and can be implemented with other proteins that influence gene regulation. Importantly, this research provides a framework for analysing genetic data with statistical rigour which can be applied to downstream gene set analysis

    Using set theory to reduce redundancy in pathway sets

    Get PDF
    Abstract Background The consolidation of pathway databases, such as KEGG, Reactome and ConsensusPathDB, has generated widespread biological interest, however the issue of pathway redundancy impedes the use of these consolidated datasets. Attempts to reduce this redundancy have focused on visualizing pathway overlap or merging pathways, but the resulting pathways may be of heterogeneous sizes and cover multiple biological functions. Efforts have also been made to deal with redundancy in pathway data by consolidating enriched pathways into a number of clusters or concepts. We present an alternative approach, which generates pathway subsets capable of covering all of genes presented within either pathway databases or enrichment results, generating substantial reductions in redundancy. Results We propose a method that uses set cover to reduce pathway redundancy, without merging pathways. The proposed approach considers three objectives: removal of pathway redundancy, controlling pathway size and coverage of the gene set. By applying set cover to the ConsensusPathDB dataset we were able to produce a reduced set of pathways, representing 100% of the genes in the original data set with 74% less redundancy, or 95% of the genes with 88% less redundancy. We also developed an algorithm to simplify enrichment data and applied it to a set of enriched osteoarthritis pathways, revealing that within the top ten pathways, five were redundant subsets of more enriched pathways. Applying set cover to the enrichment results removed these redundant pathways allowing more informative pathways to take their place. Conclusion Our method provides an alternative approach for handling pathway redundancy, while ensuring that the pathways are of homogeneous size and gene coverage is maximised. Pathways are not altered from their original form, allowing biological knowledge regarding the data set to be directly applicable. We demonstrate the ability of the algorithms to prioritise redundancy reduction, pathway size control or gene set coverage. The application of set cover to pathway enrichment results produces an optimised summary of the pathways that best represent the differentially regulated gene set
    corecore