73 research outputs found

    Starr: Simple Tiling Array Analysis of Affymetrix ChIP-chip data

    Full text link
    Chromatin immunoprecipitation combined with DNA microarrays (ChIP-chip) is an assay for DNA-protein-binding or post-translational chromatin/histone modifications. As with all high-throughput technologies, it requires a thorough bioinformatic processing of the data for which there is no standard yet. The primary goal is the reliable identification and localization of genomic regions that bind a specific protein. The second step comprises comparison of binding profiles of functionally related proteins, or of binding profiles of the same protein in different genetic backgrounds or environmental conditions. Ultimately, one would like to gain a mechanistic understanding of the effects of DNA binding events on gene expression. We present a free, open-source R package Starr that, in combination with the package Ringo, facilitates the comparative analysis of ChIP-chip data across experiments and across different microarray platforms. Core features are data import, quality assessment, normalization and visualization of the data, and the detection of ChIP-enriched genomic regions. The use of common Bioconductor classes ensures the compatibility with other R packages. Most importantly, Starr provides methods for integration of complementary genomics data, e.g., it enables systematic investigation of the relation between gene expression and dna binding

    Supervised learning using routine surveillance data improves outbreak detection of Salmonella and Campylobacter infections in Germany

    Get PDF
    The early detection of infectious disease outbreaks is a crucial task to protect population health. To this end, public health surveillance systems have been established to systematically collect and analyse infectious disease data. A variety of statistical tools are available, which detect potential outbreaks as abberations from an expected endemic level using these data. Here, we present supervised hidden Markov models for disease outbreak detection, which use reported outbreaks that are routinely collected in the German infectious disease surveillance system and have not been leveraged so far. This allows to directly integrate labeled outbreak data in a statistical time series model for outbreak detection. We evaluate our model using real Salmonella and Campylobacter data, as well as simulations. The proposed supervised learning approach performs substantially better than unsupervised learning and on par with or better than a state-of-the-art approach, which is applied in multiple European countries including Germany.Peer Reviewe

    Genomic data integration with hidden Markov models to understand transcription regulation

    Get PDF
    Transcription is a tightly controlled process that involves the recruitment and prost-translational modification of DNA-associated protein complexes, which can be mapped to the genome using high-throughput experimental assays. An accurate annotation of genomic elements such as transcription units or cis-regulatory elements such as promoters or enhancers is crucial for the use and interpretation of data generated by these assays. Thus, integrative genomic data analysis of high-throughput assays with hidden Markov models (HMMs) has become a popular tool for genome annotation. However, current algorithms are limited by unrealistic data distribution assumptions and variance models. Moreover, they are not able to assign forward or reverse direction to states or properly integrate strand-specific (e.g., RNA expression) with non-strand-specific (e.g., ChIP) data, which is essential to characterize directed processes such as transcription. In this thesis new HMM-based methods are proposed to overcome these limitations. These include (i) bidirectional HMMs (bdHMMs) which integrate strand-specific with non-strand-specific data to infer directed genomic states de novo and (ii) GenoSTAN (Genomic STate ANnotation), a HMM using discrete probability distributions to model count data, for genome annotation from Next-Generation-Sequencing data. Both approaches are made available in the R/Bioconductor package STAN (STate ANnotation) which provides an efficient implementation that can be run on large genomes such as human. STAN is used to derive new and improved annotations of transcription in yeast and human and to generate a map of promoters and enhancers in 127 human cell types and tissues.Integration of transcription factor binding and RNA expression data in yeast recovers the majority of transcribed loci, reveals gene-specific variations in the yeast transcription cycle, identifies 32 new transcribed loci, a regulated initiation-elongation transition, the absence of elongation factors Ctk1 and Paf1 from a class of genes, a distinct transcription mechanism for highly expressed genes and novel DNA sequence motifs associated with transcription termination.Moreover, promoters and enhancers are predicted in 127 human cell types and tissues are mapped by integrating sequencing data from the ENCODE and Roadmap Epigenomics projects, today’s largest compendium of chromatin assays. Promoters and enhancers are identified with consistently higher accuracy and show significantly higher enrichment of complex trait-associated genetic variants than current annotations. Investigation of binding of 101 transcription factors in human K562 cells reveals common and distinctive TF binding properties of enhancers and promoters.Application of STAN to transient transcriptome sequencing (TT-Seq) data in human K562 cells recovers stable mRNAs, long intergenic non-coding RNAs, and additionally maps over 10,000 transient RNAs, including enhancer RNAs, antisense RNAs, and promoter-associated RNAs. Further analyses reveal that transient RNAs such as enhancer RNAs are short and lack U1 motifs and secondary structure. Taken together, the annotations inferred in this thesis gave new insights into transcription and its regulation and will be an important resource for future research in genomics. STAN is a valuable tool to create such annotations also in other organisms and as more data becomes available improve the existing ones

    Starr: Simple Tiling ARRay analysis of Affymetrix ChIP-chip data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Chromatin immunoprecipitation combined with DNA microarrays (ChIP-chip) is an assay used for investigating DNA-protein-binding or post-translational chromatin/histone modifications. As with all high-throughput technologies, it requires thorough bioinformatic processing of the data for which there is no standard yet. The primary goal is to reliably identify and localize genomic regions that bind a specific protein. Further investigation compares binding profiles of functionally related proteins, or binding profiles of the same proteins in different genetic backgrounds or experimental conditions. Ultimately, the goal is to gain a mechanistic understanding of the effects of DNA binding events on gene expression.</p> <p>Results</p> <p>We present a free, open-source <b>R</b>/Bioconductor package <it>Starr </it>that facilitates comparative analysis of ChIP-chip data across experiments and across different microarray platforms. The package provides functions for data import, quality assessment, data visualization and exploration. <it>Starr </it>includes high-level analysis tools such as the alignment of ChIP signals along annotated features, correlation analysis of ChIP signals with complementary genomic data, peak-finding and comparative display of multiple clusters of binding profiles. It uses standard Bioconductor classes for maximum compatibility with other software. Moreover, <it>Starr </it>automatically updates microarray probe annotation files by a highly efficient remapping of microarray probe sequences to an arbitrary genome.</p> <p>Conclusion</p> <p><it>Starr </it>is an <b>R </b>package that covers the complete ChIP-chip workflow from data processing to binding pattern detection. It focuses on the high-level data analysis, e.g., it provides methods for the integration and combined statistical analysis of binding profiles and complementary functional genomics data. <it>Starr </it>enables systematic assessment of binding behaviour for groups of genes that are alingned along arbitrary genomic features.</p

    Simultaneous characterization of sense and antisense genomic processes by the double-stranded hidden Markov model

    Get PDF
    Hidden Markov models (HMMs) have been extensively used to dissect the genome into functionally distinct regions using data such as RNA expression or DNA binding measurements. It is a challenge to disentangle processes occurring on complementary strands of the same genomic region. We present the double-stranded HMM (dsHMM), a model for the strand-specific analysis of genomic processes. We applied dsHMM to yeast using strand specific transcription data, nucleosome data, and protein binding data for a set of 11 factors associated with the regulation of transcription. The resulting annotation recovers the mRNA transcription cycle (initiation, elongation, termination) while correctly predicting strand-specificity and directionality of the transcription process. We find that pre-initiation complex formation is an essentially undirected process, giving rise to a large number of bidirectional promoters and to pervasive antisense transcription. Notably, 12% of all transcriptionally active positions showed simultaneous activity on both strands. Furthermore, dsHMM reveals that antisense transcription is specifically suppressed by Nrd1, a yeast termination factor

    Global DNA hypomethylation prevents consolidation of differentiation programs and allows reversion to the embryonic stem cell state.

    Get PDF
    DNA methylation patterns change dynamically during mammalian development and lineage specification, yet scarce information is available about how DNA methylation affects gene expression profiles upon differentiation. Here we determine genome-wide transcription profiles during undirected differentiation of severely hypomethylated (Dnmt1⁻/⁻) embryonic stem cells (ESCs) as well as ESCs completely devoid of DNA methylation (Dnmt1⁻/⁻;Dnmt3a⁻/⁻;Dnmt3b⁻/⁻ or TKO) and assay their potential to transit in and out of the ESC state. We find that the expression of only few genes mainly associated with germ line function and the X chromosome is affected in undifferentiated TKO ESCs. Upon initial differentiation as embryoid bodies (EBs) wild type, Dnmt1⁻/⁻ and TKO cells downregulate pluripotency associated genes and upregulate lineage specific genes, but their transcription profiles progressively diverge upon prolonged EB culture. While Oct4 protein levels are completely and homogeneously suppressed, transcription of Oct4 and Nanog is not completely silenced even at late stages in both Dnmt1⁻/⁻ and TKO EBs. Despite late wild type and Dnmt1⁻/⁻ EBs showing a much higher degree of concordant expression, after EB dissociation and replating under pluripotency promoting conditions both Dnmt1⁻/⁻ and TKO cells, but not wild type cells rapidly revert to expression profiles typical of undifferentiated ESCs. Thus, while DNA methylation seems not to be critical for initial activation of differentiation programs, it is crucial for permanent restriction of developmental fate during differentiation

    Annotation of genomics data using bidirectional hidden Markov models unveils variations in Pol II transcription cycle

    Get PDF
    DNA replication, transcription and repair involve the recruitment of protein complexes that change their composition as they progress along the genome in a directed or strand-specific manner. Chromatin immunoprecipitation in conjunction with hidden Markov models (HMMs) has been instrumental in understanding these processes, as they segment the genome into discrete states that can be related to DNA-associated protein complexes. However, current HMM-based approaches are not able to assign forward or reverse direction to states or properly integrate strand-specific (e.g.,RNA expression) with non-strand-specific (e.g.,ChIP) data, which is indispensable to accurately characterize directed processes. To overcome these limitations, we introduce bidirectional HMMs which infer directed genomic states from occupancy profiles de novo. Application to RNA polymerase II-associated factors in yeast and chromatin modifications in human T cells recovers the majority of transcribed loci, reveals gene-specific variations in the yeast transcription cycle and indicates the existence of directed chromatin state patterns at transcribed, but not at repressed, regions in the human genome. In yeast, we identify 32 new transcribed loci, a regulated initiation-elongation transition, the absence of elongation factors Ctk1 and Paf1 from a class of genes, a distinct transcription mechanism for highly expressed genes and novel DNA sequence motifs associated with transcription termination. We anticipate bidirectional HMMs to significantly improve the analyses of genome-associated directed processes

    Accurate Promoter and Enhancer Identification in 127 ENCODE and Roadmap Epigenomics Cell Types and Tissues by GenoSTAN

    Get PDF
    Accurate maps of promoters and enhancers are required for understanding transcriptional regulation. Promoters and enhancers are usually mapped by integration of chromatin assays charting histone modifications, DNA accessibility, and transcription factor binding. However, current algorithms are limited by unrealistic data distribution assumptions. Here we propose GenoSTAN (Genomic STate ANnotation), a hidden Markov model overcoming these limitations. We map promoters and enhancers for 127 cell types and tissues from the ENCODE and Roadmap Epigenomics projects, today's largest compendium of chromatin assays. Extensive benchmarks demonstrate that GenoSTAN generally identifies promoters and enhancers with significantly higher accuracy than previous methods. Moreover, GenoSTAN-derived promoters and enhancers showed significantly higher enrichment of complex trait-associated genetic variants than current annotations. Altogether, GenoSTAN provides an easy-to-use tool to define promoters and enhancers in any system, and our annotation of human transcriptional cis-regulatory elements constitutes a rich resource for future research in biology and medicine

    Geographical differences of carbapenem non-susceptible Enterobacterales and Acinetobacter spp. in Germany from 2017 to 2019

    Get PDF
    Background Since May 2016, infection and colonisation with carbapenem non-susceptible Acinetobacter spp. (CRA) and Enterobacterales (CRE) have to be notified to health authorities in Germany. The aim of our study was to assess the epidemiology of CRA and CRE from 2017 to 2019 in Germany, to identify risk groups and to determine geographical differences of CRA and CRE notifications. Methods Cases were notified from laboratories to local public health authorities and forwarded to state and national level. Non-susceptibility was defined as intermediate or resistant to ertapenem, imipenem, or meropenem excluding intrinsic bacterial resistance or the detection of a carbapenemase gene. We analysed CRA and CRE notifications from 2017, 2018 and 2019 per 100,000 inhabitants (notification incidence), regarding their demographic, clinical and laboratory information. The effect of regional hospital-density on CRA and CRE notification incidence was estimated using negative binomial regression. Results From 2017 to 2019, 2278 CRA and 12,282 CRE cases were notified in Germany. CRA and CRE cases did not differ regarding demographic and clinical information, e.g. proportion infected. The notification incidence of CRA declined slightly from 0.95 in 2017 to 0.86 in 2019, whereas CRE increased from 4.23 in 2017 to 5.72 in 2019. The highest CRA and CRE notification incidences were found in the age groups above 70 years. Infants below 1 year showed a high CRE notification incidence, too. Notification incidences varied between 0.10 and 2.86 for CRA and between 1.49 and 9.99 for CRE by federal state. The notification incidence of CRA and CRE cases increased with each additional hospital per district. Conclusion The notification incidence of CRA and CRE varied geographically and was correlated with the number of hospitals.The results support the assumption that hospitals are the main driver for higher CRE and CRA incidence. Preventive strategies and early control measures should target older age groups and newborns and areas with a high incidence.Peer Reviewe
    corecore