252 research outputs found

    Identification, improved modeling and integration of signals to predict constitutive and altering splicing

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2004.Includes bibliographical references.(cont.) manipulation of intronic elements that enables fish genes to be spliced properly in mammalian cells; (iii) A computational analysis using EST data, genome sequence data, and microarray expression data of tissue- specific alternative splicing is conducted, which distinguishes human brain, testis and liver as having unusually high levels of AS, highlights differences in the types of AS occurring commonly in different tissues, and identifies candidate cis-regulatory elements and trans-factors likely to play important roles in tissue-specific AS in human cells; (iv) The identification of a set of discriminatory sequence features and their integration into a statistical machine-learning algorithm, ACEScan, which distinguishes exons subject to evolutionarily conserved alternative splicing from constitutively spliced or lineage-specifically-spliced exons is described; (v) The genome-wide search for and experimental validation of exon-skipping events using the combination of two silencing cis-elements, UAGG and GGGG.The regulation of pre-messenger RNA splicing by the spliceosomal machinery via interactions between cis-regulatory elements and splicing trans-factors to generate a specific mRNA i.e. constitutive splicing, or sometimes many distinct mRNA isoforms i.e. alternative splicing, is still a poorly understood process. Progress into illuminating this process is further exacerbated by the variation of splicing in the multitude of tissues and cell types present, as well as the variation of cis and trans elements in different organisms, and the possibility that some alternative splicing events present in expressed sequence tag (EST) databases may constitute biochemical 'noise' or transient evolutionary fluctuations. Several studies, mainly computational in nature, addressing different questions regarding constitutive and alternative splicing are described here, ranging from improved modeling of splicing signals, studying the variation of alternative splicing in various tissues, analyzing evolutionary differences of cis and trans elements of splicing in various vertebrates, and utilizing attributes indicative of alternative splicing events conserved in human and mouse to identify novel alternatively spliced exons. In particular: (i) A general approach for improved modeling of short sequence motifs, based on the Maximum Entropy principle, that incorporates local adjacent and non-adjacent position dependencies is introduced, and applied to understanding splice site signals. The splice site recognition algorithm, MaxENTScan, performs better than previous models that utilize as input similar length sequences; (ii) The first large-scale bioinformatics study is conducted that identifies similarities and differences in candidate cis-regulatory elements and trans-acting splicingby Gene W. Yeo.Ph.D

    New approaches to unveil the Transcriptional landscape of dopaminergic neurons

    Get PDF
    Recent advances in studying the mammalian transcriptome arised new questions about how genes are organized and what is the function of noncoding RNAs. Furthermore, the discovery of large amounts of polyA- transcripts and antisense transcription proved that a portion of the transcriptome has still to be characterized. The complex anatomo-functional organization of the brain has prevented a comprehensive analysis of the transcriptional landscape of this tissue. New techniques must be developed to approach neuronal heterogeneity. In this study we combined Laser Capture Microdissection (LCM) and nanoCAGE, based on Cap Analysis of Gene Expression (CAGE), to describe expressed genes and map their transcription start sites (TSS) in two specific populations, A9 and A10, of mouse mesencephalic dopaminergic cells. Although sharing common dopaminergic marker genes, these two populations are part of different midbrain anatomical structures, substantia nigra (SN) for A9 and ventral tegmental area (VTA) for A10, project to relatively distinct areas, participate to distinct ascending dopaminergic pathways, exhibit different electrophysiological properties and different susceptibility to neurodegeneration in Parkinson`s disease. Specific neurons were identified by the expression of Green Fluorescent Protein driven by a celltype specific promoter in transgenic mice. High-quality RNAs were purified from 1000-2500 cells collected by LCM. We adapted the CAGE technique to analyze limiting amounts of RNAs (nanoCAGE). We took advantage of the cap-switching properties of the reverse transcriptase to specifically tag the 5`end of transcripts with a sequence containing a class III restriction site for EcoP15I. By creating 32bp 5`tags, we considerably improved the TSS mapping rate on the genome. A semi-suppressive PCR strategy was used to prevent primer dimers formation. The use of random priming in the 1st strand synthesis allowed to capture poly(A)- RNAs. 5`tags were sequenced with Illumina-Solexa platform. Here we show that this new nanoCAGE technology ensures a true high-throughput coverage of the transcriptome of a small number of identified neurons and can be used as an effective mean for gene discovery in the noncoding RNAs, to uncover putative alternative promoters associated to variants of protein coding transcripts and to detect potentially regulatory antisense transcripts. A further experimental validation by 5`RACE (Rapid Amplification of cDNA Ends) and RT-PCR on few candidate genes, have confirmed the existence in vivo of alternative TSS in the case of key regulatory genes involved in specifying and maintaining the dopaminergic phenotype of these neurons such as \u3b1-synuclein (Snca), dopamine transporter (Dat), vescicular monoamine transporter 2 (Vmat2), catechol-O-methyltransferase (Comt). Furthermore the differential expression of an antisense transcript overlapping to the polyubiquitin (Ubc) gene was detected as potentially interesting candidate gene accounting for differences in the ubiquitin-proteasome system (UPS) function in the two neuron populations. The potential implications deriving from these newly discovered alternative promoters and transcripts are discussed, considering also the potential consequences for the corresponding protein isoforms

    Designing synthetic spike-in controls for next-generation sequencing and beyond

    Full text link
    Next-generation sequencing (NGS) is a revolutionary tool that can be used for a myriad of applications, ranging from clinical genome sequencing, to gene expression profiling with RNA sequencing (RNA-seq), to the detection of microbes within environmental samples or isolates. However, significant analytical challenges remain with NGS data due to the complexity of genome architecture, as well as a range of biases introduced during library preparation, sequencing and analysis. These biases and challenges can be understood and mitigated through the use of spike-in controls – DNA or RNA oligonucleotides with known sequence and length that are added to samples prior to library preparation. While spike-in controls have previously been developed for transcriptomics, they were designed for technologies that predated the advent of NGS and consequently suffer from several limitations. In this thesis, I present a novel design framework for synthetic spike-in standards (‘sequins’) that can be applied to a range of NGS applications, and demonstrate how sequins can be used as internal controls to assist in the analysis of accompanying samples. In Chapter 1, I develop a set of spliced synthetic RNA standards that are encoded by artificial gene loci on an accompanying in silico chromosome. RNA sequins enable the assessment of important but previously intractable RNA-seq properties including split-read alignment, alternative splicing, isoform-level quantification and fusion gene detection. In Chapter 2, I present the design of a set of DNA sequins comprising a synthetic community of artificial microbial genomes, which can be used in metagenome sequencing and analysis. Importantly, DNA sequins facilitate the accurate resolution of microbial abundance shifts between samples, which are otherwise imperceptible with NGS. Finally, in Chapter 3, I show how RNA sequins can be used in the analysis of complex brain transcriptomes generated using targeted RNA-seq. This includes an assessment of capture efficiency, quantitative accuracy, and the setting of empirical thresholds to distinguish signal from noise. These transcriptomes are presented as an atlas that can be used to link gene expression with neurological phenotypes. The technologies, associated datasets and analytical methods developed herein provide a qualitative and quantitative reference with which to navigate the complexity of genome biology

    Probabilistic analysis of the human transcriptome with side information

    Get PDF
    Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure

    Integrated Spatial Genomics Reveals Organizational Principles of Single-Cell Nuclear Architecture

    Get PDF
    Three-dimensional (3D) nuclear architecture plays key roles in many cellular processes such as gene regulation and genome replication. Recent sequencing-based and imaging-based single-cell studies have characterized a high variability of nuclear features in individual cells from a wide-range of measurement modalities, such as chromosome structures, subnuclear structures, chromatin states, and nascent transcription. However, the lack of technologies that allow us to interrelate those nuclear features simultaneously in the same single cells limits our understanding of nuclear architecture. To overcome this limitation, a technology that can examine 3D nuclear features across modalities from the same single cells is required. Here, we demonstrate integrated spatial genomics approaches, which enable genome-wide investigation of chromosome structures, subnuclear structures, chromatin states, and transcriptional states in individual cells. In Chapter 2, we introduce the "track first and identify later" approach, which enables multiplexed tracking of genomic loci in live cells by combining CRISPR/Cas9 live imaging and DNA sequential fluorescence in situ hybridization (DNA seqFISH) technologies. We demonstrate our approach by resolving the dynamics of 12 unique subtelomeric loci in mouse embryonic stem (ES) cells. In Chapter 3, we present the intron seqFISH technology, which enables transcriptome-scale gene expression profiling at their nascent transcription active sites in individual nuclei in mouse ES cells and fibroblasts, along with mRNA and lncRNA seqFISH and immunofluorescence. We show the transcription active sites position at the surfaces of chromosome territories with variable inter-chromosomal organization in individual nuclei. By building upon those technologies, in Chapter 4, we demonstrate integrated spatial genomics in mouse ES cells, which enables to image thousands of genomic loci by DNA seqFISH+, along with sequential immunofluorescence and RNA seqFISH in individual cells. We show "fixed loci" that are invariably associated with specific subnuclear structures across hundreds of single cells that can constrain nuclear architecture in individual nuclei. In addition, we find individual genomic loci appear to be pre-positioned to specific nuclear compartments with different frequencies, which are independent from nascent transcriptional states of single cells. Lastly, in Chapter 5, we demonstrate the integrated spatial genomics technology in the mouse brain cortex, enabling the investigation of single-cell nuclear architecture in a cell-type specific fashion as well as the exploration of common organizational principles of nuclear architecture across cell types. We reveal that inter-chromosomal organization and radial positioning of chromosomes are arranged with cell-type specific chromatin fixed loci and subnuclear structure organization in diverse cell types. We also uncover the variable organization of chromosome domain structures at the sub-megabase scale in individual cells, which can be obscured with bulk measurements. Together, these results demonstrate the ability of integrated spatial genomics to advance our overall understanding of single-cell nuclear architecture in various biological systems.</p

    In silico prediction of active RNA genes in legumes

    No full text
    Accumulating evidence suggests that non-coding RNAs (ncRNAs) play key roles in gene regulation and may form the basis of an inter-gene communication system. MicroRNAs are a class of small non-coding RNAs found in both plants and animals that regulate the expression of other genes. Identification and analysis of microRNAs enhances our understanding of the important roles that microRNAs play in this complex regulatory network. The work presented in this thesis constitutes the first large-scale prediction and characterization of both ncRNAs and miRNAs in the model legume Medicago truncatula and Lotus japonicus, and provides a basis for further research on elucidating ncRNA function in legume genomics..

    A new computational framework for the classification and function prediction of long non-coding RNAs

    Get PDF
    Long non-coding RNAs (lncRNAs) are known to play a significant role in several biological processes. These RNAs possess sequence length greater than 200 base pairs (bp), and so are often misclassified as protein-coding genes. Most Coding Potential Computation (CPC) tools fail to accurately identify, classify and predict the biological functions of lncRNAs in plant genomes, due to previous research being limited to mammalian genomes. In this thesis, an investigation and extraction of various sequence and codon-bias features for identification of lncRNA sequences has been carried out, to develop a new CPC Framework. For identification of essential features, the framework implements regularisation-based selection. A novel classification algorithm is implemented, which removes the dependency on experimental datasets and provides a coordinate-based solution for sub-classification of lncRNAs. For imputing the lncRNA functions, lncRNA-protein interactions have been first determined through co-expression of genes which were re-analysed by a sequence similaritybased approach for identification of novel interactions and prediction of lncRNA functions in the genome. This integrates a D3-based application for visualisation of lncRNA sequences and their associated functions in the genome. Standard evaluation metrics such as accuracy, sensitivity, and specificity have been used for benchmarking the performance of the framework against leading CPC tools. Case study analyses were conducted with plant RNA-seq datasets for evaluating the effectiveness of the framework using a cross-validation approach. The tests show the framework can provide significant improvements on existing CPC models for plant genomes: 20-40% greater accuracy. Function prediction analysis demonstrates results are consistent with the experimentally-published findings

    Bayesian nonparametric clusterings in relational and high-dimensional settings with applications in bioinformatics.

    Get PDF
    Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task

    NEW APPROACHES IN UNDERSTANDING DRUG METABOLISM

    Get PDF
    Limitations in technology, such as DNA sequencing and appropriate model systems, have made it difficult to understand the genetic and non-genetic factors that influence the liver's role in metabolizing drugs. New approaches are required to overcome these limitations. In this Dissertation, we evaluate 3 such new approaches. Our first new approach relates to the field of pharmacogenetics: using genetics to predict how a patient will respond to medication based on their genetic code. We looked for polymorphisms in a novel target gene, Cytochrome P450 Oxidoreductase (POR). Our results show a mutation in P450 reductase (L577P) that associates with decreased metabolism for 8 of 10 major drug metabolizing enzymes. However, even though we found a statistical association between POR polymorphism and drug metabolism, a wide range of variation in POR activity was still observed among the samples with the L577/ P577 genotype, making predicting POR activity solely on the basis of L577P genotype difficult. POR represents only a single gene amongst the tens of thousands present in the human genome. To investigate the relationship between how genes and their products interact, a systems approach is necessary. Therefore, in our second new approach, we will characterize the transcriptome of our model system, the HepaRG cell line. We found that HepaRG cells globally transcribe genes at the levels more similar to human primary hepatocytes and human liver than HepG2 cells, particularly in genes encoding drug processing proteins. Finally, I describe the third new approach: the use of next-generation DNA sequencing to understand hepatic drug response. This section contains two parts. First, we introduce methods that significantly decrease the false discovery rate of genotyping from RNA-Seq data. With these high fidelity SNPs, we were able to perform a genome-wide pharmacogenomic analysis on HepaRG cells. Second, we introduce a new program, called PRUNE, to more accurately quantify gene expression, and compare its performance to that of established programs
    • …
    corecore