Search CORE

2,275 research outputs found

The Gibbs Centroid Sampler

Author: Conlan Sean
Lawrence Charles E.
McCue Lee Ann
Newberg Lee A.
Thompson William A.
Publication venue: Oxford University Press
Publication date
Field of study

The Gibbs Centroid Sampler is a software package designed for locating conserved elements in biopolymer sequences. The Gibbs Centroid Sampler reports a centroid alignment, i.e. an alignment that has the minimum total distance to the set of samples chosen from the a posteriori probability distribution of transcription factor binding-site alignments. In so doing, it garners information from the full ensemble of solutions, rather than only the single most probable point that is the target of many motif-finding algorithms, including its predecessor, the Gibbs Recursive Sampler. Centroid estimators have been shown to yield substantial improvements, in both sensitivity and positive predictive values, to the prediction of RNA secondary structure and motif finding. The Gibbs Centroid Sampler, along with interactive tutorials, an online user manual, and information on downloading the software, is available at: http://bayesweb.wadsworth.org/gibbs/gibbs.html

Crossref

PubMed Central

Recommended from our members

MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes

Author: Kohane Isaac S
Marinescu Voichita D
Riva Alberto
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Cis-regulatory modules are combinations of regulatory elements occurring in close proximity to each other that control the spatial and temporal expression of genes. The ability to identify them in a genome-wide manner depends on the availability of accurate models and of search methods able to detect putative regulatory elements with enhanced sensitivity and specificity. RESULTS: We describe the implementation of a search method for putative transcription factor binding sites (TFBSs) based on hidden Markov models built from alignments of known sites. We built 1,079 models of TFBSs using experimentally determined sequence alignments of sites provided by the TRANSFAC and JASPAR databases and used them to scan sequences of the human, mouse, fly, worm and yeast genomes. In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods. Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method. CONCLUSION: The search engine, available at , allows the identification, visualization and selection of putative TFBSs occurring in the promoter or other regions of a gene from the human, mouse, fly, worm and yeast genomes. In addition it allows the user to upload a sequence to query and to build a model by supplying a multiple sequence alignment of binding sites for a transcription factor of interest. Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation

Harvard University - DASH

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Genetic networks of antibacterial responses of eukaryotic cells. Bioinformatics analysis and modeling

Author: Shelest Ekaterina
Publication venue
Publication date: 07/02/2006
Field of study

This work describes the development of new methods to construction of promoter models as one of necessary steps of regulatory networks construction. Identification of characteristic promoter features shows the role of specific transcription factors (TFs) in triggering the response, which in turn sheds light on the signaling pathways activating these TFs. Treating reported results of microarray analyses together with other available information about the genes expressed in different cellular systems under consideration, we search for distinguishing features of the promoters of coexpressed genes. The application of such promoter models enables to identify additional candidate genes belonging to the same regulatory network. Four novel approaches are presented in this work: (i) subtractive approach to matrix generation; (ii) distance distribution approach; (iii) "seed" sets approach; (iv) complementary pairs approach. These approaches help to solve serious problems in promoter model construction such as the doubtful reliability of positive training sets ("seed" sets approach) and lack of knowledge about the exact signaling pathways triggering the gene expression (complementary pairs approach); the subtractive approach to matrix generation allows to refine positional weight matrices (PWM) for heterogeneous sets of binding sites, thus to improve the PWM search for single TFBS. A significant improvement of the specificity of promoter analysis has been achieved by applying statistical methods for characterizing TFBS combinations at over-represented distances rather than the mere identification of single potential TFBS (distance distributions approach). The newly developed methods were applied to the description of four defensive eukaryotic systems in terms of transcription regulation. The obtained models enabled us to gain better insights into the pathways of the corresponding signaling networks.Diese Arbeit beschreibt die Entwicklung mehrerer neuer Methoden zur Konstruktion von Promotormodellen als einen der notwendigen Schritte zur Konstruktion regulatorischer Netzwerke. Die Identifizierung charakteristischer Eigenschaften von Promotoren zeigt die Rolle bestimmter Transkriptionsfaktoren (TF) beim Auslösen spezifischer Antworten auf, was wiederum Aufschluss über die Signalwege zur Aktivierung dieser TF gibt. Durch Verarbeitung von Ergebnissen aus Microarray-Analysen zusammen mit weiteren verfügbaren Informationen über die in den betrachteten zellulären Systemen exprimierten Gene suchen wir nach kennzeichnenden Eigenschaften koregulierter Promotoren. Die Applikation solcher Promotermodelle ermöglicht die Identifizierung zusätzlicher Kandidatengene, die demselben regulatorischen Netzwerk angehören. Vier neue Ansätze werden in dieser Arbeit präsentiert: (i) der subtraktive Ansatz zur Matrixerzeugung; (ii) der Distanzverteilungsansatz; (iii) der "seed"-Set-Ansatz; (iv) der Ansatz komplementärer Paare. Diese Ansätze helfen, beträchtliche Probleme der Promotormodellkonstruktion zu lösen, wie die zweifelhafte Zuverlässigkeit positiver Trainingsets ("seed"-Set-Ansatz) und der Mangel an Wissen über die präzisen Signalwege, die bestimmte Genexpressionsereignisse auslösen (Ansatz komplementärer Paare). Der subtraktive Ansatz zur Matrixerzeugung erlaubt, Positionsgewichtungsmatrizen (PWM) für heterogene Sets von Bindungsstellen zu verfeinern und dadurch die PWM-Suche für einzelne TFBSs zur verbessern. Eine signifikante Verbesserung der Spezifität der Promotoranalyse wurde durch die Anwendung statistischer Methoden zur Charakterisierung von TFBS-Kombinationen in überrepräsentierten Distanzen anstelle der bloßen Identifizierung einzelner potentieller TFBSs erreicht. Die neuentwickelten Methoden wurden zur Beschreibung von vier eukaryotischen Abwehrsystemen verwendet. Die erhaltenen Modelle eröffneten tiefergehende Einsichten in die Pfade der zugehörigen Signalnetzwerke

Digitale Bibliothek Braunschweig

A Generalized Biophysical Model of Transcription Factor Binding Specificity and Its Application on High-Throughput SELEX Data

Author: Ruan Shuxiang
Publication venue: Washington University Open Scholarship
Publication date: 15/12/2017
Field of study

The interaction between transcription factors (TFs) and DNA plays an important role in gene expression regulation. In the past, experiments on protein–DNA interactions could only identify a handful of sequences that a TF binds with high affinities. In recent years, several high-throughput experimental techniques, such as high-throughput SELEX (HT-SELEX), protein-binding microarrays (PBMs) and ChIP-seq, have been developed to estimate the relative binding affinities of large numbers of DNA sequences both in vitro and in vivo. The large volume of data generated by these techniques proved to be a challenge and prompted the development of novel motif discovery algorithms. These algorithms are based on a range of TF binding models, including the widely used probabilistic model that represents binding motifs as position frequency matrices (PFMs). However, the probabilistic model has limitations and the PFMs extracted from some of the high-throughput experiments are known to be suboptimal. In this dissertation, we attempt to address these important questions and develop a generalized biophysical model and an expectation maximization (EM) algorithm for estimating position weight matrices (PWMs) and other parameters using HT-SELEX data. First, we discuss the inherent limitations of the popular probabilistic model and compare it with a biophysical model that assumes the nucleotides in a binding site contribute independently to its binding energy instead of binding probability. We use simulations to demonstrate that the biophysical model almost always provides better fits to the data and conclude that it should take the place of the probabilistic model in charactering TF binding specificity. Then we describe a generalized biophysical model, which removes the assumption of known binding locations and is particularly suitable for modeling protein–DNA interactions in HT-SELEX experiments, and BEESEM, an EM algorithm capable of estimating the binding model and binding locations simultaneously. BEESEM can also calculate the confidence intervals of the estimated parameters in the binding model, a rare but useful feature among motif discovery algorithms. By comparing BEESEM with 5 other algorithms on HT-SELEX, PBM and ChIP-seq data, we demonstrate that BEESEM provides significantly better fits to in vitro data and is similar to the other methods (with one exception) on in vivo data under the criterion of the area under the receiver operating characteristic curve (AUROC). We also discuss the limitations of the AUROC criterion, which is purely rank-based and thus misses quantitative binding information. Finally, we investigate whether adding DNA shape features can significantly improve the accuracy of binding models. We evaluate the ability of the gradient boosting classifiers generated by DNAshapedTFBS, an algorithm that takes account of DNA shape features, to differentiate ChIP-seq peaks from random background sequences, and compare them with various matrix-based binding models. The results indicate that, compared with optimized PWMs, adding DNA shape features does not produce significantly better binding models and may increase the risk of overfitting on training datasets

Washington University St. Louis: Open Scholarship

Discovery and prediction of protein binding sites in DNA and RNA sequences using Bayesian Markov models

Author: Ge Wanwan
Publication venue
Publication date: 10/07/2020
Field of study

Georg-August-University Göttingen

The Structural Dynamics of Soluble and Membrane Proteins Explored through Molecular Simulations

Author: Yazdi S.
Publication venue: Otto-von-Guericke-Universität
Publication date: 27/09/2016
Field of study

MPG.PuRe

Minimum Description Length codes are critical

Author: Cubero Ryan John
Marsili Matteo
Roudi Yasser
Publication venue: 'MDPI AG'
Publication date: 01/01/2018
Field of study

In the Minimum Description Length (MDL) principle, learning from the data is equivalent to an optimal coding problem. We show that the codes that achieve optimal compression in MDL are critical in a very precise sense. First, when they are taken as generative models of samples, they generate samples with broad empirical distributions and with a high value of the relevance, defined as the entropy of the empirical frequencies. These results are derived for different statistical models (Dirichlet model, independent and pairwise dependent spin models, and restricted Boltzmann machines). Second, MDL codes sit precisely at a second order phase transition point where the symmetry between the sampled outcomes is spontaneously broken. The order parameter controlling the phase transition is the coding cost of the samples. The phase transition is a manifestation of the optimality of MDL codes, and it arises because codes that achieve a higher compression do not exist. These results suggest a clear interpretation of the widespread occurrence of statistical criticality as a characterization of samples which are maximally informative on the underlying generative process.Comment: 23 pages, 5 figures; Corrected the author name, revised Section 2.2 (Large Deviations of the Universal Codes Exhibit Phase Transitions), corrected Eq. (89

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

NORA - Norwegian Open Research Archives

Modular design and analysis of synthetic biochemical networks

Author: Roekel van, H.W.H.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2015
Field of study

Repository TU/e

Pure OAI Repository