14,074 research outputs found
Sparse integrative clustering of multiple omics data sets
High resolution microarrays and second-generation sequencing platforms are
powerful tools to investigate genome-wide alterations in DNA copy number,
methylation and gene expression associated with a disease. An integrated
genomic profiling approach measures multiple omics data types simultaneously in
the same set of biological samples. Such approach renders an integrated data
resolution that would not be available with any single data type. In this
study, we use penalized latent variable regression methods for joint modeling
of multiple omics data types to identify common latent variables that can be
used to cluster patient samples into biologically and clinically relevant
disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996)
267-288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
301-320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
91-108] methods to induce sparsity in the coefficient vectors, revealing
important genomic features that have significant contributions to the latent
variables. An iterative ridge regression is used to compute the sparse
coefficient vectors. In model selection, a uniform design [Monographs on
Statistics and Applied Probability (1994) Chapman & Hall] is used to seek
"experimental" points that scattered uniformly across the search domain for
efficient sampling of tuning parameter combinations. We compared our method to
sparse singular value decomposition (SVD) and penalized Gaussian mixture model
(GMM) using both real and simulated data sets. The proposed method is applied
to integrate genomic, epigenomic and transcriptomic data for subtype analysis
in breast and lung cancer data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS578 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Extending colonic mucosal microbiome analysis - Assessment of colonic lavage as a proxy for endoscopic colonic biopsies
This study was supported through GI Research funds and MRC Grant Ref: MR/M00533X/1 to GH.Peer reviewedPublisher PD
Recommended from our members
Broad and thematic remodeling of the surfaceome and glycoproteome on isogenic cells transformed with driving proliferative oncogenes.
The cell surface proteome, the surfaceome, is the interface for engaging the extracellular space in normal and cancer cells. Here we apply quantitative proteomics of N-linked glycoproteins to reveal how a collection of some 700 surface proteins is dramatically remodeled in an isogenic breast epithelial cell line stably expressing any of six of the most prominent proliferative oncogenes, including the receptor tyrosine kinases, EGFR and HER2, and downstream signaling partners such as KRAS, BRAF, MEK, and AKT. We find that each oncogene has somewhat different surfaceomes, but the functions of these proteins are harmonized by common biological themes including up-regulation of nutrient transporters, down-regulation of adhesion molecules and tumor suppressing phosphatases, and alteration in immune modulators. Addition of a potent MEK inhibitor that blocks MAPK signaling brings each oncogene-induced surfaceome back to a common state reflecting the strong dependence of the oncogene on the MAPK pathway to propagate signaling. Cell surface protein capture is mediated by covalent tagging of surface glycans, yet current methods do not afford sequencing of intact glycopeptides. Thus, we complement the surfaceome data with whole cell glycoproteomics enabled by a recently developed technique called activated ion electron transfer dissociation (AI-ETD). We found massive oncogene-induced changes to the glycoproteome and differential increases in complex hybrid glycans, especially for KRAS and HER2 oncogenes. Overall, these studies provide a broad systems-level view of how specific driver oncogenes remodel the surfaceome and the glycoproteome in a cell autologous fashion, and suggest possible surface targets, and combinations thereof, for drug and biomarker discovery
A computational pipeline to identify phenotypic manifestations related to genes
Tese de Mestrado, Bioinformática e Biologia computacional, 2022, Universidade de Lisboa, Faculdade de CiênciasUma proporção de pacientes com doenças de neuro desenvolvimento, tem uma mutação
genética diretamente ligada à sua doença. A Perturbação do Espectro do Autismo (PEA) é uma
patologia de neuro desenvolvimento com apresentação clínica muito heterogênea (Cummings et
al., 2005). PEA é caracterizada por ter padrões de ações ou interesses repetitivos,
dificuldades/limitações em interações sociais e comunicação que se manifestam desde a infância.
Estes sintomas afetam mais homens que mulheres e podem variar em severidade. Talvez o maior
avanço em perceber a fisiopatologia do PEA é ter sido reconhecido a contribuição genética para
a etiologia do PEA com a ajuda do aparecimento de métodos NGS e WES (Daniel H. Geschwind,
2011; Asif et al., 2018). Há vários genes e mutações associados com o PEA o que aponta a uma
origem heterogenia da doença. A combinação de uma arquitetura genética complexa e pouco
compreendida, heterogeneidade fenotípica e o envolvimento de múltiplos loci que interagem entre
si dificulta a descoberta dos genes com mutações específicas que levam ao PEA.
Consequentemente, a etiologia genética dos distúrbios relacionados ao PEA permanece em
grande parte desconhecida (Gupta et al., 2006). Vários estudos demonstraram que duplicações ou
deleções de segmentos do genoma denominados de Variantes de Número de Cópias (CNVs),
polimorfismos de nucleotídeo único (SNPs) e variantes de nucleotídeo único (SNVs)
provavelmente têm um papel causal na PEA (Chang et al., 2014; Soler et al., 2018). O
estabelecimento da relação entre os diferentes genes com as variantes do fenótipo do PEA pode
facilitar o diagnóstico dos pacientes e, assim, possibilitar que os pacientes obtenham o tratamento
mais eficiente e específico numa idade mais jovem.
Devido aos recentes avanços nas tecnologias genómicas, os estudos genéticos em larga escala
estão a revelar um grande número de variantes genéticas que potencialmente contribuem para o
risco de doenças. O objetivo global deste trabalho foi propor um pipeline para identificar a
manifestação fenotípica de variantes genéticas putativas causadoras de doenças. Para isso, foram
estabelecidos dois objetivos específicos:
• Identificação de clusters de genes funcionalmente semelhantes;
• Inferir o fenótipo da doença para cada cluster separadamente.
Para alcançar estes objetivos, neste estudo foi usado um dataset que contem 3707 genes de
pacientes diagnosticados com PEA. A este dataset são aplicadas ferramenta como o DishIn e
GoSemSim para calcular o valor da semelhança semântica em pares de genes, obtendo no fim
uma matriz quadrada de semelhança semântica. Este valor é obtido pelas ferramentas ao
quantificar a informação partilhada entre dois termos GO, associados a cada gene, como o
conteúdo de informação do ancestral comum mais informativo de dois termos. As medidas para
calcular a semelhança semântica do conteúdo de informação usadas neste trabalho são Lin, Jiang
& Conrath e Rel. Através da matriz de semelhança semântica é calculada a matriz de distâncias à
qual são aplicados os algoritmos de clustering DBSCAN, Kmeans e hierárquico, de modo a obter
grupos de genes que sejam funcionalmente semelhantes.
Após a análise dos resultados, foi possível concluir que variantes genéticas podem ser
agrupados usando cálculos de semelhança semântica. Demonstrou-se que os genes que foram
agrupados são funcionalmente semelhantes, estavam inseridos em redes de interação genética e
podem levar a diferentes grupos de fenótipos de PEA. Os genes agrupados foram enriquecidos
para diferentes pathways e sub fenótipos relacionados ao PEA.In most neurodevelopmental diseases, a proportion of patients carries a known gene mutation
directly linked to their illness. Autism Spectrum Disorder (ASD) is a neurodevelopmental
pathology with very heterogeneous clinical presentation (Cummings et al., 2005). ASD is
characterized by symptoms of repetitive patterns of actions or interests, difficulties/limitations in
social interactions and communication that appear since childhood. These symptoms affect more
men than women and can vary in severity.
Perhaps the greatest advance in understanding the pathophysiology of ASD is the recognition
of the genetic contribution to the etiology of ASD with the help of the emergence of NGS and
WES methods (Daniel H. Geschwind, 2011; Asif et al., 2018). There are several genes and
mutations associated with ASD which point to a heterogeneous origin of the disease. A
combination of a complex and poorly understood genetic architecture, phenotypic heterogeneity
and the involvement of multiple loci interacting with one another hinder efforts to discover the
genes with specific mutations that lead to ASD. Consequently, the genetic etiology of disorders
related to ASD remains largely unknown (Gupta et al., 2006). Several studies demonstrated that
duplications or deletions of genome segments called Copy Number Variants (CNVs), single
nucleotide polymorphisms (SNPs) and single nucleotide variants (SNVs) are likely to have a
causal role in ASD (Chang et al., 2014; Soler et al., 2018). The establishment of the relationship
between different genes to the ASD phenotype variants may facilitate the diagnosis of patients
and thus enable patients to obtain the correct treatment at a younger age.
Due to recent advances in genomic technologies, the large-scale genetic studies are
unraveling large numbers of genetic variants potentially contributing to disease risk. The global
objective of this work was to propose a pipeline to identify the phenotypic manifestation of
putative disease-causing genetic variants. For this purpose, two specific objectives were pursued:
• Identification of clusters of functionally similar genes;
• Inferring the disease phenotype for each cluster separately.
To achieve these goals in this study, a dataset containing 3707 genes from patients diagnosed
with ASD was used. Tools such as DishIn and GoSemSim are applied to this dataset to calculate
the value of semantic similarity in pairs of genes, obtaining in the end a square matrix of semantic
similarity. This value is obtained by the tools by quantifying the pairwise GO term semantic
similarity through the amount of information shared between two terms, such as the information
content of the most informative common ancestor of two terms. The measures to calculate the
similarity of information content used in this work are Lin, Jiang & Conrath and Rel. Through the
matrix of the semantic similarity matrix, the distance matrix is calculated to which the DBSCAN,
Kmeans and hierarchical clustering algorithms are applied, to obtain functionally similar clusters
of genes.
After analyzing the results, it was possible to conclude that genes that were disrupted by
genetic variants in patients can be clustered using semantic similarity measures. Clustered genes
were functionally similar, also indicated by gene interaction networks and can lead to different
ASD sub-phenotype. Genes clusters were enriched for different pathways and phenotype that
were related to ASD subtypes
Identification of transcriptional regulatory networks specific to pilocytic astrocytoma.
BackgroundPilocytic Astrocytomas (PAs) are common low-grade central nervous system malignancies for which few recurrent and specific genetic alterations have been identified. In an effort to better understand the molecular biology underlying the pathogenesis of these pediatric brain tumors, we performed higher-order transcriptional network analysis of a large gene expression dataset to identify gene regulatory pathways that are specific to this tumor type, relative to other, more aggressive glial or histologically distinct brain tumours.MethodsRNA derived from frozen human PA tumours was subjected to microarray-based gene expression profiling, using Affymetrix U133Plus2 GeneChip microarrays. This data set was compared to similar data sets previously generated from non-malignant human brain tissue and other brain tumour types, after appropriate normalization.ResultsIn this study, we examined gene expression in 66 PA tumors compared to 15 non-malignant cortical brain tissues, and identified 792 genes that demonstrated consistent differential expression between independent sets of PA and non-malignant specimens. From this entire 792 gene set, we used the previously described PAP tool to assemble a core transcriptional regulatory network composed of 6 transcription factor genes (TFs) and 24 target genes, for a total of 55 interactions. A similar analysis of oligodendroglioma and glioblastoma multiforme (GBM) gene expression data sets identified distinct, but overlapping, networks. Most importantly, comparison of each of the brain tumor type-specific networks revealed a network unique to PA that included repressed expression of ONECUT2, a gene frequently methylated in other tumor types, and 13 other uniquely predicted TF-gene interactions.ConclusionsThese results suggest specific transcriptional pathways that may operate to create the unique molecular phenotype of PA and thus opportunities for corresponding targeted therapeutic intervention. Moreover, this study also demonstrates how integration of gene expression data with TF-gene and TF-TF interaction data is a powerful approach to generating testable hypotheses to better understand cell-type specific genetic programs relevant to cancer
Centronuclear myopathy in labrador retrievers: a recent founder mutation in the PTPLA gene has rapidly disseminated worldwide
Centronuclear myopathies (CNM) are inherited congenital disorders characterized by an excessive number of internalized nuclei. In humans, CNM results from ~70 mutations in three major genes from the myotubularin, dynamin and amphiphysin families. Analysis of animal models with altered expression of these genes revealed common defects in all forms of CNM, paving the way for unified pathogenic and therapeutic mechanisms. Despite these efforts, some CNM cases remain genetically unresolved. We previously identified an autosomal recessive form of CNM in French Labrador retrievers from an experimental pedigree, and showed that a loss-of-function mutation in the protein tyrosine phosphatase-like A (PTPLA) gene segregated with CNM. Around the world, client-owned Labrador retrievers with a similar clinical presentation and histopathological changes in muscle biopsies have been described. We hypothesized that these Labradors share the same PTPLA<sup>cnm</sup> mutation. Genotyping of an international panel of 7,426 Labradors led to the identification of PTPLA<sup>cnm</sup> carriers in 13 countries. Haplotype analysis demonstrated that the PTPLA<sup>cnm</sup> allele resulted from a single and recent mutational event that may have rapidly disseminated through the extensive use of popular sires. PTPLA-deficient Labradors will help define the integrated role of PTPLA in the existing CNM gene network. They will be valuable complementary large animal models to test innovative therapies in CNM
Towards knowledge-based gene expression data mining
The field of gene expression data analysis has grown in the past few years from being purely data-centric to integrative, aiming at complementing microarray analysis with data and knowledge from diverse available sources. In this review, we report on the plethora of gene expression data mining techniques and focus on their evolution toward knowledge-based data analysis approaches. In particular, we discuss recent developments in gene expression-based analysis methods used in association and classification studies, phenotyping and reverse engineering of gene networks
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …