727 research outputs found

    Identification of novel non-coding small RNAs from Streptococcus pneumoniae TIGR4 using high-resolution genome tiling arrays

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The identification of non-coding transcripts in human, mouse, and <it>Escherichia coli </it>has revealed their widespread occurrence and functional importance in both eukaryotic and prokaryotic life. In prokaryotes, studies have shown that non-coding transcripts participate in a broad range of cellular functions like gene regulation, stress and virulence. However, very little is known about non-coding transcripts in <it>Streptococcus pneumoniae </it>(pneumococcus), an obligate human respiratory pathogen responsible for significant worldwide morbidity and mortality. Tiling microarrays enable genome wide mRNA profiling as well as identification of novel transcripts at a high-resolution.</p> <p>Results</p> <p>Here, we describe a high-resolution transcription map of the <it>S. pneumoniae </it>clinical isolate TIGR4 using genomic tiling arrays. Our results indicate that approximately 66% of the genome is expressed under our experimental conditions. We identified a total of 50 non-coding small RNAs (sRNAs) from the intergenic regions, of which 36 had no predicted function. Half of the identified sRNA sequences were found to be unique to <it>S. </it><it>pneumoniae </it>genome. We identified eight overrepresented sequence motifs among sRNA sequences that correspond to sRNAs in different functional categories. Tiling arrays also identified approximately 202 operon structures in the genome.</p> <p>Conclusions</p> <p>In summary, the pneumococcal operon structures and novel sRNAs identified in this study enhance our understanding of the complexity and extent of the pneumococcal 'expressed' genome. Furthermore, the results of this study open up new avenues of research for understanding the complex RNA regulatory network governing <it>S. pneumoniae </it>physiology and virulence.</p

    FUNCTIONAL CLASSIFICATION OF LONG NON-CODING RNAS

    Get PDF
    Long non-coding RNAs (lncRNA) play important roles in mammalian development and health. The relationships between lncRNAs’ sequences and their functions, however, are poorly understood. Unlike proteins, which often contain recognizable functional domain and are evolutionarily conserved, lncRNAs lack significant linear sequence homology. To address this issue and enable the prediction of a lncRNA’s biological properties from its sequence, we developed a non-linear sequence similarity algorithm called SEEKR. SEEKR allows for the comparison of sequences based on the non-linear abundance of short motifs. These short motifs, or k-mers, may represent potential protein binding sites within the lncRNA. We used SEEKR to form communities of similar lncRNAs from both human and mouse transcriptomes. We were then able to demonstrate that these communities predicted the biological properties of the lncRNAs within a given community, including cellular localization and protein binding. We also show we can predict RNAs’ repressive activity in vivo using SEEKR. Additionally, SEEKR provided evidence of similarity between certain pre-mRNA transcripts and known repressive lncRNAs. We hypothesized that some of these pre-mRNAs may have localized repressive capabilities. We demonstrated that pre-mRNAs are detectable at physiologically relevant levels in human cells and that some pre-mRNAs with repressive-like sequences may also interact with transcription regulating proteins in cis. Finally, we have packaged SEEKR into a user-friendly command line tool, which is free and open source. Here, we provide an extensive tutorial describing both how we have used SEEKR and how to perform common analysis tasks.Doctor of Philosoph

    ChIP-Hub provides an integrative platform for exploring plant regulome

    Get PDF
    Plant genomes encode a complex and evolutionary diverse regulatory grammar that forms the basis for most life on earth. A wealth of regulome and epigenome data have been generated in various plant species, but no common, standardized resource is available so far for biologists. Here, we present ChIP-Hub, an integrative web-based platform in the ENCODE standards that bundles >10,000 publicly available datasets reanalyzed from >40 plant species, allowing visualization and meta-analysis. We manually curate the datasets through assessing ~540 original publications and comprehensively evaluate their data quality. As a proof of concept, we extensively survey the co-association of different regulators and construct a hierarchical regulatory network under a broad developmental context. Furthermore, we show how our annotation allows to investigate the dynamic activity of tissue-specific regulatory elements (promoters and enhancers) and their underlying sequence grammar. Finally, we analyze the function and conservation of tissue-specific promoters, enhancers and chromatin states using comparative genomics approaches. Taken together, the ChIP-Hub platform and the analysis results provide rich resources for deep exploration of plant ENCODE. ChIP-Hub is available at https://biobigdata.nju.edu.cn/ChIPHub/.Peer Reviewe

    Computational identification of transcriptional regulatory elements in DNA sequence

    Get PDF
    Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges

    Development of computational tools and resources for systems biology of bacterial pathogens

    Get PDF
    Bacterial pathogens are a major cause of diseases in human, agricultural plants and farm animals. Even after decades of research they remain a challenge to health care as they are known to rapidly evolve and develop resistance to the existing drugs. Systems biology is an emerging area of research where all of the components of the system, their interactions, and the dynamics can be studied in a comprehensive, quantitative, and integrative fashion to generate predictive models. When applied to bacterial pathogenesis, systems biology approaches will help identify potential novel molecular targets for drug discovery. A pre-requisite for conducting systems analysis is the identification of the building blocks of the system i.e. individual components of the system (structural annotation), identification of their functions (functional annotation) and identification of the interactions among the individual components (interaction prediction). In the context of bacterial pathogenesis, it is necessary to identify the host-pathogen interactions. This dissertation work describes computational resources that enable comprehensive systems level study of host pathogen system to enhance our understanding of bacterial pathogenesis. It specifically focuses on improving the structural and functional annotation of pathogen genomes as well as identifying host-pathogen interactions at a genome scale. The novel contributions of this dissertation towards systems biology of bacterial pathogens include three computational tools/resources. “TAAPP” (Tiling array analysis pipeline for prokaryotes) is a web based tool for the analysis of whole genome tiling array data for bacterial pathogens. TAAPP helps improve the structural annotation of bacterial genomes. “ISO-IEA” (Inferred from sequence orthology - Inferred from electronic annotation) is a tool that can be used for the functional annotation of any sequenced genome. “HPIDB” (Host pathogen interaction database) is developed with data a mining capability that includes host-pathogen interaction prediction. The new knowledge gained due to the implementation of these tools is the description of the non coding RNA as well as a computationally predicted host-pathogen interaction network for the human respiratory pathogen Streptococcus pneumoniae. In summary, the computation tools and resources developed in this dissertation study will enable building systems biology models of bacterial pathogens

    Analysis, Visualization, and Machine Learning of Epigenomic Data

    Get PDF
    The goal of the Encyclopedia of DNA Elements (ENCODE) project has been to characterize all the functional elements of the human genome. These elements include expressed transcripts and genomic regions bound by transcription factors (TFs), occupied by nucleosomes, occupied by nucleosomes with modified histones, or hypersensitive to DNase I cleavage, etc. Chromatin Immunoprecipitation (ChIP-seq) is an experimental technique for detecting TF binding in living cells, and the genomic regions bound by TFs are called ChIP-seq peaks. ENCODE has performed and compiled results from tens of thousands of experiments, including ChIP-seq, DNase, RNA-seq and Hi-C. These efforts have culminated in two web-based resources from our lab—Factorbook and SCREEN—for the exploration of epigenomic data for both human and mouse. Factorbook is a peak-centric resource presenting data such as motif enrichment and histone modification profiles for transcription factor binding sites computed from ENCODE ChIP-seq data. SCREEN provides an encyclopedia of ~2 million regulatory elements, including promoters and enhancers, identified using ENCODE ChIP-seq and DNase data, with an extensive UI for searching and visualization. While we have successfully utilized the thousands of available ENCODE ChIP-seq experiments to build the Encyclopedia and visualizers, we have also struggled with the practical and theoretical inability to assay every possible experiment on every possible biosample under every conceivable biological scenario. We have used machine learning techniques to predict TF binding sites and enhancers location, and demonstrate machine learning is critical to help decipher functional regions of the genome

    Non-coding genome contributions to the development and evolution of mammalian organs

    Get PDF
    Protein-coding sequences only cover 1-2% of a typical mammalian genome. The remaining non-coding space hides thousands of genomic elements, some of which act via their DNA sequence while others are transcribed into non-coding RNAs. Many well-characterized non-coding elements are involved in the regulation of other genes, a process essential for the emergence of different cell types and organs during development. Changes in the expression of conserved genes during development are in turn thought to facilitate evolutionary innovation in form and function. Thus, non-coding genomic elements are hypothesized to play important roles in developmental and evolutionary processes. However, challenges related to the identification and characterization of these elements, in particular in non-model organisms, has limited the study of their overall contributions to mammalian organ development and evolution. During my dissertation work, I addressed this gap by studying two major classes of non-coding elements, long non-coding RNAs (lncRNAs) and cis-regulatory elements (CREs). In the first part of my thesis, I analyzed the expression profiles of lncRNAs during the development of seven major organs in six mammals and a bird. I showed that, unlike protein-coding genes, only a small fraction of lncRNAs is expressed in reproducibly dynamic patterns during organ development. These lncRNAs are enriched for a series of features associated with functional relevance, including increased evolutionary conservation and regulatory complexity, highlighting them as candidates for further molecular characterization. I then associated these lncRNAs with specific genes and functions based on their spatiotemporal expression profiles. My analyses also revealed differences in lncRNA contributions across organs and developmental stages, identifying a developmental transition from broadly expressed and conserved lncRNAs towards an increasing number of lineage- and organ-specific lncRNAs. Following up on these global analyses, I then focused on a newly-identified lncRNA in the marsupial opossum, Female Specific on chromosome X (FSX). The broad and likely autonomous female-specific expression of FSX suggests a role in marsupial X-chromosome inactivation (XCI). I showed that FSX shares many expression and sequence features with another lncRNA, RSX — a known regulator of XCI in marsupials. Comparisons to other marsupials revealed that both RSX and FSX emerged in the common marsupial ancestor and have since been preserved in marsupial genomes, while their broad and female-specific expression has been retained for at least 76 million years of evolution. Taken together, my analyses highlighted FSX as a novel candidate for regulating marsupial XCI. In the third part of this work, I shifted my focus to CREs and their cell type-specific activities in the developing mouse cerebellum. After annotating cerebellar cell types and states based on single-cell chromatin accessibility data, I identified putative CREs and characterized their spatiotemporal activity across cell types and developmental stages. Focusing on progenitor cells, I described temporal changes in CRE activity that are shared between early germinal zones, supporting a model of cell fate induction through common developmental cues. By examining chromatin accessibility dynamics during neuronal differentiation, I revealed a gradual divergence in the regulatory programs of major cerebellar neuron types. In the final part, I explored the evolutionary histories of CREs and their potential contributions to gene expression changes between species. By comparing mouse CREs to vertebrate genomes and chromatin accessibility profiles from the marsupial opossum, I identified a temporal decrease in CRE conservation, which is shared across cerebellar cell types. However, I also found differences in constraint between cell types, with microglia having the fastest evolving CREs in the mouse cerebellum. Finally, I used deep learning models to study the regulatory grammar of cerebellar cell types in human and mouse, showing that the sequence rules determining CRE activity are conserved across mammals. I then used these models to retrace the evolutionary changes leading to divergent CRE activity between species. Collectively, my PhD work provides insights into the evolutionary dynamics of non-coding genes and regulatory elements, the processes associated with their conservation, and their contributions to the development and evolution of mammalian cell types and organs

    Syöpägenetiikan tutkimus uusien sekvensointimenetelmien aikakaudella

    Get PDF
    The research in cancer genetics aims to detect genetic causes for the excessive growth of cells, which may subsequently form a tumor and further develop into cancer. The Human Genome Project succeeded in mapping the majority of the human DNA sequence, which enabled modern sequencing technologies to emerge, namely next-generation sequencing (NGS). The new era of disease genetics research shifted DNA analyses from laboratory to computer screens. Since then, the massive growth of sequencing data has been facilitating the detection of novel disease-causing mutations and thus improving the screening and medical treatments of cancer. However, the exponential growth of sequencing data brought new challenges for computing. The sheer size of the data is not only expensive to store and maintain, but also highly demanding to process and analyze. Moreover, not only has the amount of sequencing data increased, but new kinds of functional genomics data, which are instrumental in figuring out the consequences of detected mutations, have also emerged. To this end, continuous software development has become essential to enable the utilization of all produced research data, new and old. This thesis describes a software for the analysis and visualization of NGS data (publication I) that allows the integration of genomic data from various sources. The software, BasePlayer, was designed for the need of efficient and user-friendly methods that could be used to analyze and visualize massive variant, and various other types of genomic data. To this end, we developed a multi-purpose tool for the analysis of genomic data, such as DNA, RNA, ChIP-seq, and DNase. The capabilities of BasePlayer in the detection of putatively causative variants and data visualization have already been used in over twenty scientific publications. The applicability of the software is demonstrated in this thesis with two distinct analysis cases - publications II and III. The second study considered somatic mutations in colorectal cancer (CRC) genomes. We were able to identify distinct mutation patterns at the CTCF/Cohesin binding sites (CBSs) by analyzing whole-genome sequencing (WGS) data with BasePlayer. The sites were observed to be frequently mutated in CRC, especially in samples with a specific mutational signature. However, the source for the mutation accumulation remained unclear. On the contrary, a subset of samples with an ultra-mutator phenotype, caused by defective polymerase epsilon (POLE) gene, exhibited an inverse pattern at CBSs. We detected the same signal in other, predominantly gastrointestinal, cancers as well. However, we were not able to measure changes in gene expressions at mutated sites, so the role of the CBS mutations in tumorigenesis remained and still remains to be elucidated. The third study considered esophageal squamous cell carcinoma (ESCC), and the objective was to detect predisposing mutations using the Finnish Cancer Registry (FCR) data. We performed clustering analysis for the FCR data, with additional information obtained from the Population Information System of Finland. We detected an enrichment of ESCC in the Karelia region and were able to collect and sequence 30 formalin-fixed paraffin-embedded (FFPE) samples from the region. We reported several candidate genes, out of which EP300 and DNAH9 were considered the most interesting. The study not only reported putative genes predisposing to ESCC but also worked as a proof of concept for the feasibility of conducting genetic research utilizing both clustering of the FCR data and FFPE exome sequencing in such studies.Syöpägenetiikan tutkimuksen tavoitteena on löytää perimmäisiä syitä solujen liikakasvulle, joka voi johtaa kasvaimen muodostumiseen ja kehittyä edelleen syöväksi. Laajamittainen Human Genome Project, jonka tavoitteena oli selvittää ihmisen koko DNA sekvenssi (genomi) saatiin suurelta osin päätökseen vuosituhannen alussa. Kokonaisten genomien määrittäminen mahdollisti toisen sukupolven sekvensointimenetelmien (next-generation sequencing, NGS) kehityksen ja käyttöönoton. Tämä aloitti uuden aikakauden erityisesti tautigenetiikassa ja siirsi analyysit laboratorioista tietokoneiden ruuduille. NGS menetelmien tuottamat valtavat datamäärät vauhdittivat uusien geneettisten löydösten tekemistä, mutta toivat myös uusia haasteita erityisesti biologiseen tietojenkäsittelyyn - bioinformatiikkaan. Datamäärien lisäksi myös erilaisten datatyyppien määrä kasvoi ja kasvaa edelleen; kaiken tuotetun datan prosessointi analysoitavaan muotoon vaatii erittäin tehokkaita tietokoneita ja algoritmeja. Lisäksi monen eri näytteen ja datatyypin yhdistäminen (integrointi) järkeväksi kokonaisuudeksi vaatii analyysiohjelmistoilta joustavuutta ja tehokkuutta erityisesti säätelyalueisiin liittyvissä tutkimuksissa. Bioinformaattisten ohjelmistojen jatkuva kehitys on täten ensiarvoisen tärkeää, jotta kaikki tuotettu data saadaan mahdollisimman hyvin tutkijoille hyödynnettäväksi. Tässä väitöskirjassa esitellään ohjelmisto, BasePlayer, joka on kehitetty laajoihin sekvenssidata-analyyseihin ja visualisointiin (julkaisu I). BasePlayer yhdistää graafisessa käyttöliittymässä geneettiseen analyysiin tarvittavat ominaisuudet, dataintegraation sekä visualisaation. Ohjelmisto mahdollistaa esimerkiksi satojen kasvainnäytteiden samanaikaisen tarkastelun, jonka avulla voi tunnistaa altistavia tai syöpää ajavia mutaatioita geenien säätelyalueilla. BasePlayeria on käytetty jo yli kahdessakymmenessä tieteellisessä julkaisussa, joista kaksi on tämän väitöskirjan osatöinä (julkaisut II ja III). Toisessa julkaisussa etsittiin BasePlayeria hyödyntäen syöpää ajavia mutaatioita geenien säätelyalueilta käyttäen yli kahtasataa kolorektaalisyöpänäytettä. Koko-genomin kattavalla sekvensointiaineistolla havaitsimme, että osassa näytteitä mutaatioita on kertynyt runsaasti erityisesti kohesiinin sitoutumiskohtiin. Kohesiini on mukana useissa tärkeissä tehtävissä mm. DNA:n rakenteeseen ja geenien säätelyyn liittyen. Havaitsimme myös mutaatioiden vähenemän samoilla alueilla näytteissä, jotka olivat ultra-mutatoituneita (satakertainen mutaatiomäärä keskimääräisiin kolorektaalikasvaimiin verrattuna). Mutaatioiden kertymä havaittiin myös muissa, erityisesti ruoansulatuskanavan syövissä. Havaitun ilmiön rooli kasvainten kehittymisessä jäi tosin vielä selvittämättä. Kolmannessa työssä etsittiin ruokatorven syöpään altistavia geenimutaatioita. Haimme Suomen syöpärekisteriä ja väestötietojärjestelmää apuna käyttäen alueita, joissa ruokatorven syöpää esiintyi sukunimeen perustuen merkittävästi keskimääräistä enemmän. Merkittävästi rikastunut alue löytyi luovutetun Karjalan alueelta, josta saimme kerättyä ja sekvensoitua 30 arkistoitua kudosnäytettä. Hyödynsimme BasePlayerin näytevertailu-ominaisuuksia, joiden avulla havaitsimme potilaissa rikastuneet variantit normaaliväestöön verrattuna. Kiinnostavimmat tulokset liittyivät harvinaisiin variantteihin EP300 ja DNAH9 geeneissä. Mahdollisten uusien alttiusgeenien raportoinnin lisäksi tämä työ osoitti, että syöpärekisteriä hyödyntämällä voidaan löytää kuvatun kaltaisia tauti-tihentymiä ja myös sen, että arkistoitu kudosmateriaali on käyttökelpoista tämänkaltaisissa sekvensointiin pohjautuvissa tutkimuksissa

    Integrative Computational Genomics Based Approaches to Uncover the Tissue-Specific Regulatory Networks in Development and Disease

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Regulatory protein families such as transcription factors (TFs) and RNA Binding Proteins (RBPs) are increasingly being appreciated for their role in regulating the respective targeted genomic/transcriptomic elements resulting in dynamic transcriptional (TRNs) and post-transcriptional regulatory networks (PTRNs) in higher eukaryotes. The mechanistic understanding of these two regulatory network types require a high resolution tissue-specific functional annotation of both the proteins as well as their target sites. This dissertation addresses the need to uncover the tissue-specific regulatory networks in development and disease. This work establishes multiple computational genomics based approaches to further enhance our understanding of regulatory circuits and decipher the associated mechanisms at several layers of biological processes. This study potentially contributes to the research community by providing valuable resources including novel methods, web interfaces and software which transforms our ability to build high-quality regulatory binding maps of RBPs and TFs in a tissue specific manner using multi-omics datasets. The study deciphered the broad spectrum of temporal and evolutionary dynamics of the transcriptome and their regulation at transcriptional and post transcriptional levels. It also advances our ability to functionally annotate hundreds of RBPs and their RNA binding sites across tissues in the human genome which help in decoding the role of RBPs in the context of disease phenotype, networks, and pathways. The approaches developed in this dissertation is scalable and adaptable to further investigate the tissue specific regulators in any biological systems. Overall, this study contributes towards accelerating the progress in molecular diagnostics and drug target identification using regulatory network analysis method in disease and pathophysiology
    corecore