1,954 research outputs found

    Sparse reduced-rank regression for imaging genetics studies: models and applications

    Get PDF
    We present a novel statistical technique; the sparse reduced rank regression (sRRR) model which is a strategy for multivariate modelling of high-dimensional imaging responses and genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity in the regression coefficients, identifying subsets of genetic markers that best explain the variability observed in subsets of the phenotypes. To properly exploit the rich structure present in each of the imaging and genetics domains, we additionally propose the use of several structured penalties within the sRRR model. Using simulation procedures that accurately reflect realistic imaging genetics data, we present detailed evaluations of the sRRR method in comparison with the more traditional univariate linear modelling approach. In all settings considered, we show that sRRR possesses better power to detect the deleterious genetic variants. Moreover, using a simple genetic model, we demonstrate the potential benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to extracting averages over regions of interest in the brain. Since this entails the use of phenotypic vectors of enormous dimensionality, we suggest the use of a sparse classification model as a de-noising step, prior to the imaging genetics study. Finally, we present the application of a data re-sampling technique within the sRRR model for model selection. Using this approach we are able to rank the genetic markers in order of importance of association to the phenotypes, and similarly rank the phenotypes in order of importance to the genetic markers. In the very end, we illustrate the application perspective of the proposed statistical models in three real imaging genetics datasets and highlight some potential associations

    Data integration in inflammatory bowel disease

    Get PDF
    [eng] INTRODUCTION: Inflammatory bowel disease is a complex intestinal disease with several genetic and environmental factors that can influence its course. The ethiology and pathophysiology of the disease is not fully understood. There is some evidence that microbiome can play a role. Finding relationships between microbiome and host’s mucosa could help advance prevention, diagnosis or treatment. METHODS: We based our analysis on intestinal bacterial 16S rRNA and human transcriptome data from biopsies from multiple timepoints and intestine segments. We extended regularized generalized canonical correlation analysis to find models that are coherent with previous knowledge on the disease taking into account the samples’ information. Multiple inflammatory bowel disease datasets on different treatments and conditions were analysed and the models defining those dataset were compared. The results were compared with multiple co-inertia analysis. RESULTS: Splitting sample variables into different blocks results in models of these relationships that show differences on the genes and microorganisms selected. The models generated using our new method inteRmodel outperformed multiple coinertia analysis to classify the samples according to their location. Despite being used on datasets of different sources the resulting models show similar relationships between variables. DISCUSSION: Comparing multiple models helps find out the relationships within datasets. Our method finds how strong are the relationships between the microbiome, transcriptome and environmental variables. On different datasets genes selected were common. This approach is robust and flexible to different datasets and settings. CONCLUSION: With inteRmodel we found that the microbiome relates more closely to the sample location than to disease, but the transcriptome is highly related to the location of the sample on the intestine. There is a common transcriptome between datasets while microorganisms depend of the dataset. We can improve sample classification by taking into account both bacterial 16S rRNA and host transcriptome.[cat] INTRODUCCIÓ: La malaltia inflamatòria intestinal és una malaltia intestinal complexa amb diversos factors genètics i ambientals que poden influir en el seu curs. L'etiologia i fisiopatologia de la malaltia no es con eix del tot. Hi ha evidències que el microbioma pot tenir un paper rellevant. Trobar relacions entre el microbioma i la mucosa de l'hoste podria ajudar a avançar en la prevenció, el diagnòstic o el tractament. MÈTODES: Vam basar la nostra anàlisi en dades d'ARNr 16S bacteriana intestinal i de transcriptoma humà de biòpsies de múltiples punts de temps i segments intestinals. Hem ampliat l'anàlisi de correlació canònica generalitzada regularitzada per trobar models coherents amb el coneixement previ sobre la malaltia tenint en compte la informació de les mostres. Es van analitzar diversos conjunts de dades de malaltia inflamatòria intestinal sobre diferents tractaments i condicions i es van comparar els models que defineixen aquest conjunt de dades. Els resultats es van comparar amb l'anàlisi de coinèrcia múltiple. RESULTATS: Dividir les variables de la mostra en diferents blocs dona com a resultat models d'aquestes relacions que mostren diferències en els gens i els microorganismes seleccionats. Els models generats mitjançant el nostre nou mètode intermodel van superar l'anàlisi de coinèrcia múltiple per classificar les mostres segons la seva ubicació. Tot i utilitzar-se en conjunts de dades de diferents fonts, els models resultants mostren relacions similars entre variables. DISCUSSIÓ: La comparació de diversos models ajuda a esbrinar les relacions dins dels conjunts de dades. El nostre mètode troba com de fortes són les relacions entre el microbioma, el transcriptoma i les variables ambientals. En diferents conjunts de dades, els gens seleccionats eren comuns. Aquest enfocament és robust i flexible per a diferents conjunts de dades i configuracions. CONCLUSIÓ: Amb inteRmodel vam trobar que el microbioma es relaciona més estretament amb la ubicació de la mostra que amb la malaltia, però el transcriptoma està molt relacionat amb la ubicació de la mostra a l'intestí. Hi ha un transcriptoma comú entre conjunts de dades, mentre que els microorganismes depenen del conjunt de dades. Podem millorar la classificació de les mostres tenint en compte tant l'ARNr 16S bacterià com el transcriptoma hoste.[spa] INTRODUCCIÓN: La enfermedad inflamatoria intestinal es una enfermedad intestinal compleja con factores genéticos y ambientales que pueden influir en su curso. La etiología y la fisiopatología de la enfermedad no se conocen por completo. Existen evidencias que el microbioma puede desempeijar un papel relevante. Encontrar relaciones entre el microbioma y la mucosa del huésped podría ayudar a avanzar en la prevención, el diagnóstico o el tratamiento. MÉTODOS: Basamos nuestro análisis en el ARNr 16S bacteriano intestinal y en datos de transcriptomas humanos de biopsias de múltiples puntos temporales y segmentos intestinales. Extendimos el análisis de correlación canónica generalizada regularizado para encontrar modelos coherentes con el conocimiento previo sobre la enfermedad teniendo en cuenta la información de las muestras. Se analizaron múltiples conjuntos de datos de enfermedad inflamatoria intestinal en diferentes tratamientos y condiciones y se compararon los modelos que definen esos conjuntos de datos. Los resultados se compararon con análisis de coinercia múltiple. RESULTADOS: Dividir las variables de la muestra en diferentes bloques resulta en modelos de estas relaciones que muestran diferencias en los genes y microorganismos seleccionados. Los modelos generados con nuestro nuevo método, inter-Rmodel, superaron el análisis de múltiples coinercias para clasificar las muestras según su ubicación. A pesar de ser utilizados en conjuntos de datos de diferentes fuentes, los modelos resultantes muestran unas relaciones similares entre las variables. DISCUSIÓN: La comparación de varios modelos ayuda a descubrir las relaciones dentro de los conjuntos de datos. Nuestro método encuentra cuán fuertes son las relaciones entre el microbioma, el transcriptoma y las variables ambientales. En diferentes conjuntos de datos, los genes seleccionados eran comunes. Este enfoque es robusto y flexible para diferentes conjuntos de datos y configuraciones. CONCLUSIÓN: Con inteRmodel descubrimos que el microbioma se relaciona más estrechamente con la ubicación de la muestra que con la enfermedad, pero el transcriptoma está muy relacionado con la ubicación de la muestra en el intestino. Hay un transcriptoma común entre los conjuntos de datos, mientras que los microorganismos dependen del conjunto de datos. Podemos mejorar la clasificación de las muestras teniendo en cuenta tanto el ARNr 16S bacteriano como el transcriptoma del huésped

    Probabilistic analysis of the human transcriptome with side information

    Get PDF
    Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure

    Detecting LINE-1 mediated structural variants from sequencing data: computational characterization of genomic rearrangements occurring in human post-mortem brains in the pathologic context of Alzheimer’s disease and in mouse olfactory epithelium at physiological conditions

    Get PDF
    One of the most intriguing discoveries in the recent decades is that “the genome is a work in progress”, constantly gaining and loosing chunks of sequence, in order to provide new potentially favorable combinations for adaptation. The old genetic concept that the genome is static has prevailed until the 1950s, when it was first suggested that there is a lot more to DNA than just genes. Indeed, genetic material is dynamic and the greatest part of most organisms’ genome is occupied by non-coding DNA, especially DNA fragments deriving from elements capable of moving to new locations: Transposable Elements (TEs). TEs are mobile DNA fragments, whose remnants occupy nearly half of mammalian genome and up to 90% of the genome of some plants (SanMiguel et al., 1996). Since 1951, when Barbara McClintock discovered them in maize (McClintock, 1951), extensive efforts have been devoted to understand the function of these interspersed repeats. Unfortunately, due to their hidden activity, TEs have been largely underappreciated and dismissed as ‘junk DNA’. When researchers identified long interspersed element-1 (LINE-1 or L1) insertions to be responsible for haemophilia A, in 1988 (Kazazian et al., 1988), TEs gained new attention. LINE-1 elements are the only active, autonomous TE present in the mammalian genome. These molecules, able to create polymorphisms among individuals and genomic mosaicism among populations of cells, are major sources of Structural Variations (SVs) in humans and are responsible for 124 genetic diseases (Hancks and Kazazian, 2016). In particular, the discovery of LINE- 1 mobilization in neurogenesis (Muotri et al., 2005, Coufal et al., 2009) urged the scientific community to investigate the potential involvement of mobile elements in neuropsychiatric disorders (Bundo et al., 2014 , Guffanti et al., 2016, Shpyleva et al., 2017 ) and neurodegenerative diseases (Li et al., 2012). Nowadays, LINE-1 activity has been proven in vitro (Moran et al., 1996) and in vivo (Ostertag et al., 2002) while the real rate of retrotransposition remains an open question. One of the main reasons for this lack of knowledge is the absence of reliable methods to detect elements present in a small minority of cells, or unique to a single cell. This is exacerbated by the technical complexity of deconstructing non-reference, chimeric regions of the genomes through experimental or computational means. Until very recently, assays using ligation-mediated PCR techniques have been considered the gold standard for proving and quantifying current retrotransposon activity. vi Unfortunately, both positive and negative changes in the number of repeats detected with these techniques can occur by a multitude of mechanisms not directly related to retrotransposition. Among the most common retrotransposition-independent rearrangements there are non-homologous recombination-mediated deletions and duplications. In this thesis, I focus on the effects of LINE-1 elements on genome stability. To this purpose, I describe three different bioinformatics methods for the study of the hallmarks of LINE-1-mediated genome instability: direct insertion, post-insertional rearrangements and Double Strand Breaks (DSBs). The increasing availability of large amounts of sequencing data produced by Next- generation sequencing (NGS) calls for the development of new genomics technologies and bioinformatics pipelines targeted to study retrotransposons, to fully exploit the available resources. Therefore a scalable approach, such as the Splinkerette Analysis of Mobile Elements (SPAM) method proposed here, is of substantial interest to assist the current and future developments in the study of TEs. Importantly, SPAM allowed us to target exclusively Full-Length LINE-1 elements (FL-L1) present in Frontal Cortex (FC) and Kidney (K) of Alzheimer’s Disease (AD) and controls (CTRL) post-mortem tissues and to test whether LINE-1 polymorphisms can be a relevant source of SVs associated to AD risks. This is accomplished combining a PCR-based enrichment of FL-L1 elements with an ad hoc bioinformatic pipeline. The performance of our integrative method is achieved for its ability to detect LINE-1 insertion sites with great precision and for its scalability. Embedded in the methodology is the flexibility to perform the same technique in different organisms and for different classes of TEs. Using SPAM, we observed for the first time an unexpectedly high levels of retrotransposition in the K. In association with the SPAM approach, we performed TaqMan based Copy Number Variation (CNV) analysis to evaluate the content of potentially active L1s in the different tissues of AD and CTRL individuals. Overall, we show that the content of FL-L1 sequences in AD is significantly lower than in CTRL, that de-novo integrations are not associated to the disease but that FL-L1 polymorphisms can be a relevant source of SVs. Then, we investigated which mechanism underlies the regulation of Olfactory Receptor (OR) choice in the mouse Olfactory Epithelium (OE), characterizing Olfr2 locus-specific SVs. To perform this task, we combined whole genome amplification from small number vii of cells with PacBio single molecule sequencing and a complementary high-fidelity paired-end Illumina sequencing. This approach allowed us an accurate identification of breakpoints in a locus where a very high repeat concentration, especially LINE elements, provides more chances for recombination events to occur between retrotransposon fragments. Surprisingly, the analysis revealed hundreds of heterozygous structural variants in the vicinity of the locus, among which deletions are the most abundant. The presence and characteristics of particular genomic features associated with the observed deletions, suggest us that Micro-homology Mediated End Joining (MMEJ) of Double Strand Breaks (DSB) seems to be the main mechanism operating in the formation of deletions. Further experiments will tell us if the observed SVs are involved in the regulation of the expression of ORs. Intrigued by the idea that OR genes can present somatic SVs, we profiled endogenous DSB distribution in the mouse OE at p6 and 1m and in the liver at p6. To this purpose, we performed a Chromatin ImmunoPrecipitation and Sequencing (ChIP-Seq) analysis of γ-H2AX (an early response marker for DNA-DSBs). Little is known about the differential distribution of γ-H2AX throughout the genome at physiological conditions. In the light of our results, γ-H2AX signal is stronger in gene-rich, transcribed regions where it co- localizes with regulatory sites. These results suggest a potential involvement of DBSs in resolving topological stress and promoting interactions between regulatory regions. The research described in this thesis is aimed at enhancing our understanding of the role of LINE-1-mediated SVs in health and disease

    HIGH-THROUGHPUT METHODS FOR INTERROGATION OF RNA TERTIARY STRUCTURE

    Get PDF
    RNAs encode information in not only primary sequence but also higher orderstructure. Transcriptome-level investigations of RNA structure and function by chemicalprobing have been largely limited to secondary structure, as current probing methodslack the specificity and sensitivity for high-throughput discovery of RNA tertiarystructure. Here we introduce trimethyloxonium (TMO) as a chemical probe that reportson local structure and electrostatics, where nucleotides with enhanced TMO-reactivityare strongly indicative of RNA tertiary structure. We call these positions T-sites. T-siteprobing yields a robust signal from primary reactivities and detects tertiary structure inmodel RNAs with as few as a hundred reads. We extend T-site probing totranscriptome-wide discovery of RNA tertiary structure, detecting over 1,400 T-sites thatemphasize the potential abundance of higher order RNA structure in eukaryotictranscriptomes. We demonstrate the functional significance of T-sites in two mRNAs;mutations predicted to disrupt tertiary structure resulted in changes in translation. TMOis a rapidly-acting chemical probe; we also apply this characteristic to create a platformfor time-resolved, single molecule, through-space structure probing of RNA using theRING correlated chemical probing framework. Time-dependent correlation changes inthe RNase P RNA revealed that a long-range tertiary interaction guides native RNAfolding for both secondary and tertiary structure, a mechanism directly validated byconcise disruption of the through-space interaction. TMO-based experiments arepoised to rapidly expand opportunities in RNA biology and remove a key bottleneck incharacterization of RNA tertiary structure.Doctor of Philosoph

    Discriminative Representations for Heterogeneous Images and Multimodal Data

    Get PDF
    Histology images of tumor tissue are an important diagnostic and prognostic tool for pathologists. Recently developed molecular methods group tumors into subtypes to further guide treatment decisions, but they are not routinely performed on all patients. A lower cost and repeatable method to predict tumor subtypes from histology could bring benefits to more cancer patients. Further, combining imaging and genomic data types provides a more complete view of the tumor and may improve prognostication and treatment decisions. While molecular and genomic methods capture the state of a small sample of tumor, histological image analysis provides a spatial view and can identify multiple subtypes in a single tumor. This intra-tumor heterogeneity has yet to be fully understood and its quantification may lead to future insights into tumor progression. In this work, I develop methods to learn appropriate features directly from images using dictionary learning or deep learning. I use multiple instance learning to account for intra-tumor variations in subtype during training, improving subtype predictions and providing insights into tumor heterogeneity. I also integrate image and genomic features to learn a projection to a shared space that is also discriminative. This method can be used for cross-modal classification or to improve predictions from images by also learning from genomic data during training, even if only image data is available at test time.Doctor of Philosoph
    corecore