2,909 research outputs found

    Genome Biol.

    No full text
    With genome analysis expanding from the study of genes to the study of gene regulation, 'regulatory genomics' utilizes sequence information, evolution and functional genomics measurements to unravel how regulatory information is encoded in the genome

    Promoter addresses: revelations from oligonucleotide profiling applied to the Escherichia coli genome

    Get PDF
    BACKGROUND: Transcription is the first step in cellular information processing. It is regulated by cis-acting elements such as promoters and operators in the DNA, and trans-acting elements such as transcription factors and sigma factors. Identification of cis-acting regulatory elements on a genomic scale requires computational analysis. RESULTS: We have used oligonucleotide profiling to predict regulatory regions in a bacterial genome. The method has been applied to the Escherichia coli K12 genome and the results analyzed. The information content of the putative regulatory oligonucleotides so predicted is validated through intra-genomic analyses, correlations with experimental data and inter-genome comparisons. Based on the results we have proposed a model for the bacterial promoter. The results show that the method is capable of identifying, in the E.coli genome, cis-acting elements such as TATAAT (sigma70 binding site), CCCTAT (1 base relative of sigma32 binding site), CTATNN (LexA binding site), AGGA-containing hexanucleotides (Shine Dalgarno consensus) and CTAG-containing hexanucleotides (core binding sites for Trp and Met repressors). CONCLUSION: The method adopted is simple yet effective in predicting upstream regulatory elements in bacteria. It does not need any prior experimental data except the sequence itself. This method should be applicable to most known genomes. Profiling, as applied to the E.coli genome, picks up known cis-acting and regulatory elements. Based on the profile results, we propose a model for the bacterial promoter that is extensible even to eukaryotes. The model is that the core promoter lies within a plateau of bent AT-rich DNA. This bent DNA acts as a homing segment for the sigma factor to recognize the promoter. The model thus suggests an important role for local landscapes in prokaryotic and eukaryotic gene regulation

    Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Biologically active sequence motifs often have positional preferences with respect to a genomic landmark. For example, many known transcription factor binding sites (TFBSs) occur within an interval [-300, 0] bases upstream of a transcription start site (TSS). Although some programs for identifying sequence motifs exploit positional information, most of them model it only implicitly and with <it>ad hoc </it>methods, making them unsuitable for general motif searches.</p> <p>Results</p> <p>A-GLAM, a user-friendly computer program for identifying sequence motifs, now incorporates a Bayesian model systematically combining sequence and positional information. A-GLAM's predictions with and without positional information were compared on two human TFBS datasets, each containing sequences corresponding to the interval [-2000, 0] bases upstream of a known TSS. A rigorous statistical analysis showed that positional information significantly improved the prediction of sequence motifs, and an extensive cross-validation study showed that A-GLAM's model was robust against mild misspecification of its parameters. As expected, when sequences in the datasets were successively truncated to the intervals [-1000, 0], [-500, 0] and [-250, 0], positional information aided motif prediction less and less, but never hurt it significantly.</p> <p>Conclusion</p> <p>Although sequence truncation is a viable strategy when searching for biologically active motifs with a positional preference, a probabilistic model (used reasonably) generally provides a superior and more robust strategy, particularly when the sequence motifs' positional preferences are not well characterized.</p

    Statistical methods for biological sequence analysis for DNA binding motifs and protein contacts

    Get PDF
    Over the last decades a revolution in novel measurement techniques has permeated the biological sciences filling the databases with unprecedented amounts of data ranging from genomics, transcriptomics, proteomics and metabolomics to structural and ecological data. In order to extract insights from the vast quantity of data, computational and statistical methods are nowadays crucial tools in the toolbox of every biological researcher. In this thesis I summarize my contributions in two data-rich fields in biological sciences: transcription factor binding to DNA and protein structure prediction from protein sequences with shared evolutionary ancestry. In the first part of my thesis I introduce our work towards a web server for analysing transcription factor binding data with Bayesian Markov Models. In contrast to classical PWM or di-nucleotide models, Bayesian Markov models can capture complex inter-nucleotide dependencies that can arise from shape-readout and alternative binding modes. In addition to giving access to our methods in an easy-to-use, intuitive web-interface, we provide our users with novel tools and visualizations to better evaluate the biological relevance of the inferred binding motifs. We hope that our tools will prove useful for investigating weak and complex transcription factor binding motifs which cannot be predicted accurately with existing tools. The second part discusses a statistical attempt to correct out the phylogenetic bias arising in co-evolution methods applied to the contact prediction problem. Co-evolution methods have revolutionized the protein-structure prediction field more than 10 years ago, and, until very recently, have retained their importance as crucial input features to deep neural networks. As the co-evolution information is extracted from evolutionarily related sequences, we investigated whether the phylogenetic bias to the signal can be corrected out in a principled way using a variation of the Felsenstein's tree-pruning algorithm applied in combination with an independent-pair assumption to derive pairwise amino counts that are corrected for the evolutionary history. Unfortunately, the contact prediction derived from our corrected pairwise amino acid counts did not yield a competitive performance.2021-09-2

    Integrating Epigenetic Priors For Improving Computational Identification of Transcription Factor Binding Sites

    Get PDF
    Transcription factors and histone modifications play critical roles in tissue-specific gene expression. Identifying binding sites is key in understanding the regulatory interactions of gene expression. Nave computational approaches uses solely DNA sequence data to construct models known as Position Weight Matrices. However, the various assumptions and the lack of background genomic information leads to a high false positive rate. In an attempt to improve the predictive performance of a PWM, we use a Hidden Markov Model to incorporate chromatin structure, in particular histone modifications. The HMM captures physical interactions between distinct HMs. Indeed, the integration of sequence based PWM models and chromatin modifications improve the predictive ability of the integrative model

    Exploring Patterns of Epigenetic Information With Data Mining Techniques

    Get PDF
    [Abstract] Data mining, a part of the Knowledge Discovery in Databases process (KDD), is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Analyses of epigenetic data have evolved towards genome-wide and high-throughput approaches, thus generating great amounts of data for which data mining is essential. Part of these data may contain patterns of epigenetic information which are mitotically and/or meiotically heritable determining gene expression and cellular differentiation, as well as cellular fate. Epigenetic lesions and genetic mutations are acquired by individuals during their life and accumulate with ageing. Both defects, either together or individually, can result in losing control over cell growth and, thus, causing cancer development. Data mining techniques could be then used to extract the previous patterns. This work reviews some of the most important applications of data mining to epigenetics.Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo; 209RT-0366Galicia. Consellería de Economía e Industria; 10SIN105004PRInstituto de Salud Carlos III; RD07/0067/000
    • …
    corecore