2 research outputs found

    The landscape of expression and alternative splicing variation across human traits

    Get PDF
    Understanding the consequences of individual transcriptome variation is fundamental to deciphering human biology and disease. We implement a statistical framework to quantify the contributions of 21 individual traits as drivers of gene expression and alternative splicing variation across 46 human tissues and 781 individuals from the Genotype-Tissue Expression project. We demonstrate that ancestry, sex, age, and BMI make additive and tissue-specific contributions to expression variability, whereas interactions are rare. Variation in splicing is dominated by ancestry and is under genetic control in most tissues, with ribosomal proteins showing a strong enrichment of tissue-shared splicing events. Our analyses reveal a systemic contribution of types 1 and 2 diabetes to tissue transcriptome variation with the strongest signal in the nerve, where histopathology image analysis identifies novel genes related to diabetic neuropathy. Our multi-tissue and multi-trait approach provides an extensive characterization of the main drivers of human transcriptome variation in health and disease.This study was funded by the HumTranscriptom project with reference PID2019-107937GA-I00. R.G.-P. was supported by a Juan de la Cierva fellowship (FJC2020-044119-I) funded by MCIN/AEI/10.13039/501100011033 and ‘‘European Union NextGenerationEU/PRTR.’’ J.M.R. was supported by a predoctoral fellowship from ‘‘la Caixa’’ Foundation (ID 100010434) with code LCF/BQ/DR22/11950022. A.R.-C. was supported by a Formación Personal Investigador (FPI) fellowship (PRE2019-090193) funded by MCIN/AEI. R.C.-G. was supported by an FPI fellowship (PRE2020-092510) funded by MCIN/AEI. M.M. was supported by a Ramon y Cajal fellowship (RYC-2017-22249).Peer ReviewedPostprint (published version

    Clustering and topic modeling for biomedical text mining

    No full text
    In this work, we study the problem of characterizing an unlabelled corpus of biomedical documents in an unsupervised manner. After a review of the literature on the subject, we propose an integrative approach to the problem. The integration is twofold. On one hand, we integrate, with multiview learning, different text representations derived from a traditional bag-of-words model, Latent Dirichlet Allocation, and a recurrent neural autoencoder. On the other hand, we integrate topic modeling outputs, clustering outputs and biomedical word embeddings to generate an intuitive and comprehensive characterization of the corpus. We also propose a semantic graph that supplies a synthetic visualization of the relationships between topics, clusters, and any other biomedical concept, based on semantic similarity. An application to the CORD-19 dataset, a collection of articles on COVID-19, shows our methodology produces a coherent, meaningful, and informative characterization of the corpus
    corecore