14,074 research outputs found

    Sparse integrative clustering of multiple omics data sets

    Get PDF
    High resolution microarrays and second-generation sequencing platforms are powerful tools to investigate genome-wide alterations in DNA copy number, methylation and gene expression associated with a disease. An integrated genomic profiling approach measures multiple omics data types simultaneously in the same set of biological samples. Such approach renders an integrated data resolution that would not be available with any single data type. In this study, we use penalized latent variable regression methods for joint modeling of multiple omics data types to identify common latent variables that can be used to cluster patient samples into biologically and clinically relevant disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996) 267-288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 301-320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 91-108] methods to induce sparsity in the coefficient vectors, revealing important genomic features that have significant contributions to the latent variables. An iterative ridge regression is used to compute the sparse coefficient vectors. In model selection, a uniform design [Monographs on Statistics and Applied Probability (1994) Chapman & Hall] is used to seek "experimental" points that scattered uniformly across the search domain for efficient sampling of tuning parameter combinations. We compared our method to sparse singular value decomposition (SVD) and penalized Gaussian mixture model (GMM) using both real and simulated data sets. The proposed method is applied to integrate genomic, epigenomic and transcriptomic data for subtype analysis in breast and lung cancer data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS578 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    A computational pipeline to identify phenotypic manifestations related to genes

    Get PDF
    Tese de Mestrado, Bioinformática e Biologia computacional, 2022, Universidade de Lisboa, Faculdade de CiênciasUma proporção de pacientes com doenças de neuro desenvolvimento, tem uma mutação genética diretamente ligada à sua doença. A Perturbação do Espectro do Autismo (PEA) é uma patologia de neuro desenvolvimento com apresentação clínica muito heterogênea (Cummings et al., 2005). PEA é caracterizada por ter padrões de ações ou interesses repetitivos, dificuldades/limitações em interações sociais e comunicação que se manifestam desde a infância. Estes sintomas afetam mais homens que mulheres e podem variar em severidade. Talvez o maior avanço em perceber a fisiopatologia do PEA é ter sido reconhecido a contribuição genética para a etiologia do PEA com a ajuda do aparecimento de métodos NGS e WES (Daniel H. Geschwind, 2011; Asif et al., 2018). Há vários genes e mutações associados com o PEA o que aponta a uma origem heterogenia da doença. A combinação de uma arquitetura genética complexa e pouco compreendida, heterogeneidade fenotípica e o envolvimento de múltiplos loci que interagem entre si dificulta a descoberta dos genes com mutações específicas que levam ao PEA. Consequentemente, a etiologia genética dos distúrbios relacionados ao PEA permanece em grande parte desconhecida (Gupta et al., 2006). Vários estudos demonstraram que duplicações ou deleções de segmentos do genoma denominados de Variantes de Número de Cópias (CNVs), polimorfismos de nucleotídeo único (SNPs) e variantes de nucleotídeo único (SNVs) provavelmente têm um papel causal na PEA (Chang et al., 2014; Soler et al., 2018). O estabelecimento da relação entre os diferentes genes com as variantes do fenótipo do PEA pode facilitar o diagnóstico dos pacientes e, assim, possibilitar que os pacientes obtenham o tratamento mais eficiente e específico numa idade mais jovem. Devido aos recentes avanços nas tecnologias genómicas, os estudos genéticos em larga escala estão a revelar um grande número de variantes genéticas que potencialmente contribuem para o risco de doenças. O objetivo global deste trabalho foi propor um pipeline para identificar a manifestação fenotípica de variantes genéticas putativas causadoras de doenças. Para isso, foram estabelecidos dois objetivos específicos: • Identificação de clusters de genes funcionalmente semelhantes; • Inferir o fenótipo da doença para cada cluster separadamente. Para alcançar estes objetivos, neste estudo foi usado um dataset que contem 3707 genes de pacientes diagnosticados com PEA. A este dataset são aplicadas ferramenta como o DishIn e GoSemSim para calcular o valor da semelhança semântica em pares de genes, obtendo no fim uma matriz quadrada de semelhança semântica. Este valor é obtido pelas ferramentas ao quantificar a informação partilhada entre dois termos GO, associados a cada gene, como o conteúdo de informação do ancestral comum mais informativo de dois termos. As medidas para calcular a semelhança semântica do conteúdo de informação usadas neste trabalho são Lin, Jiang & Conrath e Rel. Através da matriz de semelhança semântica é calculada a matriz de distâncias à qual são aplicados os algoritmos de clustering DBSCAN, Kmeans e hierárquico, de modo a obter grupos de genes que sejam funcionalmente semelhantes. Após a análise dos resultados, foi possível concluir que variantes genéticas podem ser agrupados usando cálculos de semelhança semântica. Demonstrou-se que os genes que foram agrupados são funcionalmente semelhantes, estavam inseridos em redes de interação genética e podem levar a diferentes grupos de fenótipos de PEA. Os genes agrupados foram enriquecidos para diferentes pathways e sub fenótipos relacionados ao PEA.In most neurodevelopmental diseases, a proportion of patients carries a known gene mutation directly linked to their illness. Autism Spectrum Disorder (ASD) is a neurodevelopmental pathology with very heterogeneous clinical presentation (Cummings et al., 2005). ASD is characterized by symptoms of repetitive patterns of actions or interests, difficulties/limitations in social interactions and communication that appear since childhood. These symptoms affect more men than women and can vary in severity. Perhaps the greatest advance in understanding the pathophysiology of ASD is the recognition of the genetic contribution to the etiology of ASD with the help of the emergence of NGS and WES methods (Daniel H. Geschwind, 2011; Asif et al., 2018). There are several genes and mutations associated with ASD which point to a heterogeneous origin of the disease. A combination of a complex and poorly understood genetic architecture, phenotypic heterogeneity and the involvement of multiple loci interacting with one another hinder efforts to discover the genes with specific mutations that lead to ASD. Consequently, the genetic etiology of disorders related to ASD remains largely unknown (Gupta et al., 2006). Several studies demonstrated that duplications or deletions of genome segments called Copy Number Variants (CNVs), single nucleotide polymorphisms (SNPs) and single nucleotide variants (SNVs) are likely to have a causal role in ASD (Chang et al., 2014; Soler et al., 2018). The establishment of the relationship between different genes to the ASD phenotype variants may facilitate the diagnosis of patients and thus enable patients to obtain the correct treatment at a younger age. Due to recent advances in genomic technologies, the large-scale genetic studies are unraveling large numbers of genetic variants potentially contributing to disease risk. The global objective of this work was to propose a pipeline to identify the phenotypic manifestation of putative disease-causing genetic variants. For this purpose, two specific objectives were pursued: • Identification of clusters of functionally similar genes; • Inferring the disease phenotype for each cluster separately. To achieve these goals in this study, a dataset containing 3707 genes from patients diagnosed with ASD was used. Tools such as DishIn and GoSemSim are applied to this dataset to calculate the value of semantic similarity in pairs of genes, obtaining in the end a square matrix of semantic similarity. This value is obtained by the tools by quantifying the pairwise GO term semantic similarity through the amount of information shared between two terms, such as the information content of the most informative common ancestor of two terms. The measures to calculate the similarity of information content used in this work are Lin, Jiang & Conrath and Rel. Through the matrix of the semantic similarity matrix, the distance matrix is calculated to which the DBSCAN, Kmeans and hierarchical clustering algorithms are applied, to obtain functionally similar clusters of genes. After analyzing the results, it was possible to conclude that genes that were disrupted by genetic variants in patients can be clustered using semantic similarity measures. Clustered genes were functionally similar, also indicated by gene interaction networks and can lead to different ASD sub-phenotype. Genes clusters were enriched for different pathways and phenotype that were related to ASD subtypes

    Identification of transcriptional regulatory networks specific to pilocytic astrocytoma.

    Get PDF
    BackgroundPilocytic Astrocytomas (PAs) are common low-grade central nervous system malignancies for which few recurrent and specific genetic alterations have been identified. In an effort to better understand the molecular biology underlying the pathogenesis of these pediatric brain tumors, we performed higher-order transcriptional network analysis of a large gene expression dataset to identify gene regulatory pathways that are specific to this tumor type, relative to other, more aggressive glial or histologically distinct brain tumours.MethodsRNA derived from frozen human PA tumours was subjected to microarray-based gene expression profiling, using Affymetrix U133Plus2 GeneChip microarrays. This data set was compared to similar data sets previously generated from non-malignant human brain tissue and other brain tumour types, after appropriate normalization.ResultsIn this study, we examined gene expression in 66 PA tumors compared to 15 non-malignant cortical brain tissues, and identified 792 genes that demonstrated consistent differential expression between independent sets of PA and non-malignant specimens. From this entire 792 gene set, we used the previously described PAP tool to assemble a core transcriptional regulatory network composed of 6 transcription factor genes (TFs) and 24 target genes, for a total of 55 interactions. A similar analysis of oligodendroglioma and glioblastoma multiforme (GBM) gene expression data sets identified distinct, but overlapping, networks. Most importantly, comparison of each of the brain tumor type-specific networks revealed a network unique to PA that included repressed expression of ONECUT2, a gene frequently methylated in other tumor types, and 13 other uniquely predicted TF-gene interactions.ConclusionsThese results suggest specific transcriptional pathways that may operate to create the unique molecular phenotype of PA and thus opportunities for corresponding targeted therapeutic intervention. Moreover, this study also demonstrates how integration of gene expression data with TF-gene and TF-TF interaction data is a powerful approach to generating testable hypotheses to better understand cell-type specific genetic programs relevant to cancer

    Centronuclear myopathy in labrador retrievers: a recent founder mutation in the PTPLA gene has rapidly disseminated worldwide

    Get PDF
    Centronuclear myopathies (CNM) are inherited congenital disorders characterized by an excessive number of internalized nuclei. In humans, CNM results from ~70 mutations in three major genes from the myotubularin, dynamin and amphiphysin families. Analysis of animal models with altered expression of these genes revealed common defects in all forms of CNM, paving the way for unified pathogenic and therapeutic mechanisms. Despite these efforts, some CNM cases remain genetically unresolved. We previously identified an autosomal recessive form of CNM in French Labrador retrievers from an experimental pedigree, and showed that a loss-of-function mutation in the protein tyrosine phosphatase-like A (PTPLA) gene segregated with CNM. Around the world, client-owned Labrador retrievers with a similar clinical presentation and histopathological changes in muscle biopsies have been described. We hypothesized that these Labradors share the same PTPLA<sup>cnm</sup> mutation. Genotyping of an international panel of 7,426 Labradors led to the identification of PTPLA<sup>cnm</sup> carriers in 13 countries. Haplotype analysis demonstrated that the PTPLA<sup>cnm</sup> allele resulted from a single and recent mutational event that may have rapidly disseminated through the extensive use of popular sires. PTPLA-deficient Labradors will help define the integrated role of PTPLA in the existing CNM gene network. They will be valuable complementary large animal models to test innovative therapies in CNM

    Towards knowledge-based gene expression data mining

    Get PDF
    The field of gene expression data analysis has grown in the past few years from being purely data-centric to integrative, aiming at complementing microarray analysis with data and knowledge from diverse available sources. In this review, we report on the plethora of gene expression data mining techniques and focus on their evolution toward knowledge-based data analysis approaches. In particular, we discuss recent developments in gene expression-based analysis methods used in association and classification studies, phenotyping and reverse engineering of gene networks

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    corecore