2,275 research outputs found
The Gibbs Centroid Sampler
The Gibbs Centroid Sampler is a software package designed for locating conserved elements in biopolymer sequences. The Gibbs Centroid Sampler reports a centroid alignment, i.e. an alignment that has the minimum total distance to the set of samples chosen from the a posteriori probability distribution of transcription factor binding-site alignments. In so doing, it garners information from the full ensemble of solutions, rather than only the single most probable point that is the target of many motif-finding algorithms, including its predecessor, the Gibbs Recursive Sampler. Centroid estimators have been shown to yield substantial improvements, in both sensitivity and positive predictive values, to the prediction of RNA secondary structure and motif finding. The Gibbs Centroid Sampler, along with interactive tutorials, an online user manual, and information on downloading the software, is available at: http://bayesweb.wadsworth.org/gibbs/gibbs.html
Recommended from our members
MAPPER: a search engine for the computational identification of putative transcription factor binding sites in multiple genomes
BACKGROUND: Cis-regulatory modules are combinations of regulatory elements occurring in close proximity to each other that control the spatial and temporal expression of genes. The ability to identify them in a genome-wide manner depends on the availability of accurate models and of search methods able to detect putative regulatory elements with enhanced sensitivity and specificity. RESULTS: We describe the implementation of a search method for putative transcription factor binding sites (TFBSs) based on hidden Markov models built from alignments of known sites. We built 1,079 models of TFBSs using experimentally determined sequence alignments of sites provided by the TRANSFAC and JASPAR databases and used them to scan sequences of the human, mouse, fly, worm and yeast genomes. In several cases tested the method identified correctly experimentally characterized sites, with better specificity and sensitivity than other similar computational methods. Moreover, a large-scale comparison using synthetic data showed that in the majority of cases our method performed significantly better than a nucleotide weight matrix-based method. CONCLUSION: The search engine, available at , allows the identification, visualization and selection of putative TFBSs occurring in the promoter or other regions of a gene from the human, mouse, fly, worm and yeast genomes. In addition it allows the user to upload a sequence to query and to build a model by supplying a multiple sequence alignment of binding sites for a transcription factor of interest. Due to its extensive database of models, powerful search engine and flexible interface, MAPPER represents an effective resource for the large-scale computational analysis of transcriptional regulation
Genetic networks of antibacterial responses of eukaryotic cells. Bioinformatics analysis and modeling
This work describes the development of new methods to construction of promoter models as one of necessary steps of regulatory networks construction. Identification of characteristic promoter features shows the role of specific transcription factors (TFs) in triggering the response, which in turn sheds light on the signaling pathways activating these TFs. Treating reported results of microarray analyses together with other available information about the genes expressed in different cellular systems under consideration, we search for distinguishing features of the promoters of coexpressed genes. The application of such promoter models enables to identify additional candidate genes belonging to the same regulatory network. Four novel approaches are presented in this work: (i) subtractive approach to matrix generation; (ii) distance distribution approach; (iii) "seed" sets approach; (iv) complementary pairs approach. These approaches help to solve serious problems in promoter model construction such as the doubtful reliability of positive training sets ("seed" sets approach) and lack of knowledge about the exact signaling pathways triggering the gene expression (complementary pairs approach); the subtractive approach to matrix generation allows to refine positional weight matrices (PWM) for heterogeneous sets of binding sites, thus to improve the PWM search for single TFBS. A significant improvement of the specificity of promoter analysis has been achieved by applying statistical methods for characterizing TFBS combinations at over-represented distances rather than the mere identification of single potential TFBS (distance distributions approach). The newly developed methods were applied to the description of four defensive eukaryotic systems in terms of transcription regulation. The obtained models enabled us to gain better insights into the pathways of the corresponding signaling networks.Diese Arbeit beschreibt die Entwicklung mehrerer neuer Methoden zur Konstruktion von Promotormodellen als einen der notwendigen Schritte zur Konstruktion regulatorischer Netzwerke. Die Identifizierung charakteristischer Eigenschaften von Promotoren zeigt die Rolle bestimmter Transkriptionsfaktoren (TF) beim Auslösen spezifischer Antworten auf, was wiederum Aufschluss ĂŒber die Signalwege zur Aktivierung dieser TF gibt. Durch Verarbeitung von Ergebnissen aus Microarray-Analysen zusammen mit weiteren verfĂŒgbaren Informationen ĂŒber die in den betrachteten zellulĂ€ren Systemen exprimierten Gene suchen wir nach kennzeichnenden Eigenschaften koregulierter Promotoren. Die Applikation solcher Promotermodelle ermöglicht die Identifizierung zusĂ€tzlicher Kandidatengene, die demselben regulatorischen Netzwerk angehören. Vier neue AnsĂ€tze werden in dieser Arbeit prĂ€sentiert: (i) der subtraktive Ansatz zur Matrixerzeugung; (ii) der Distanzverteilungsansatz; (iii) der "seed"-Set-Ansatz; (iv) der Ansatz komplementĂ€rer Paare. Diese AnsĂ€tze helfen, betrĂ€chtliche Probleme der Promotormodellkonstruktion zu lösen, wie die zweifelhafte ZuverlĂ€ssigkeit positiver Trainingsets ("seed"-Set-Ansatz) und der Mangel an Wissen ĂŒber die prĂ€zisen Signalwege, die bestimmte Genexpressionsereignisse auslösen (Ansatz komplementĂ€rer Paare). Der subtraktive Ansatz zur Matrixerzeugung erlaubt, Positionsgewichtungsmatrizen (PWM) fĂŒr heterogene Sets von Bindungsstellen zu verfeinern und dadurch die PWM-Suche fĂŒr einzelne TFBSs zur verbessern. Eine signifikante Verbesserung der SpezifitĂ€t der Promotoranalyse wurde durch die Anwendung statistischer Methoden zur Charakterisierung von TFBS-Kombinationen in ĂŒberreprĂ€sentierten Distanzen anstelle der bloĂen Identifizierung einzelner potentieller TFBSs erreicht. Die neuentwickelten Methoden wurden zur Beschreibung von vier eukaryotischen Abwehrsystemen verwendet. Die erhaltenen Modelle eröffneten tiefergehende Einsichten in die Pfade der zugehörigen Signalnetzwerke
A Generalized Biophysical Model of Transcription Factor Binding Specificity and Its Application on High-Throughput SELEX Data
The interaction between transcription factors (TFs) and DNA plays an important role in gene expression regulation. In the past, experiments on proteinâDNA interactions could only identify a handful of sequences that a TF binds with high affinities. In recent years, several high-throughput experimental techniques, such as high-throughput SELEX (HT-SELEX), protein-binding microarrays (PBMs) and ChIP-seq, have been developed to estimate the relative binding affinities of large numbers of DNA sequences both in vitro and in vivo. The large volume of data generated by these techniques proved to be a challenge and prompted the development of novel motif discovery algorithms. These algorithms are based on a range of TF binding models, including the widely used probabilistic model that represents binding motifs as position frequency matrices (PFMs). However, the probabilistic model has limitations and the PFMs extracted from some of the high-throughput experiments are known to be suboptimal. In this dissertation, we attempt to address these important questions and develop a generalized biophysical model and an expectation maximization (EM) algorithm for estimating position weight matrices (PWMs) and other parameters using HT-SELEX data. First, we discuss the inherent limitations of the popular probabilistic model and compare it with a biophysical model that assumes the nucleotides in a binding site contribute independently to its binding energy instead of binding probability. We use simulations to demonstrate that the biophysical model almost always provides better fits to the data and conclude that it should take the place of the probabilistic model in charactering TF binding specificity. Then we describe a generalized biophysical model, which removes the assumption of known binding locations and is particularly suitable for modeling proteinâDNA interactions in HT-SELEX experiments, and BEESEM, an EM algorithm capable of estimating the binding model and binding locations simultaneously. BEESEM can also calculate the confidence intervals of the estimated parameters in the binding model, a rare but useful feature among motif discovery algorithms. By comparing BEESEM with 5 other algorithms on HT-SELEX, PBM and ChIP-seq data, we demonstrate that BEESEM provides significantly better fits to in vitro data and is similar to the other methods (with one exception) on in vivo data under the criterion of the area under the receiver operating characteristic curve (AUROC). We also discuss the limitations of the AUROC criterion, which is purely rank-based and thus misses quantitative binding information. Finally, we investigate whether adding DNA shape features can significantly improve the accuracy of binding models. We evaluate the ability of the gradient boosting classifiers generated by DNAshapedTFBS, an algorithm that takes account of DNA shape features, to differentiate ChIP-seq peaks from random background sequences, and compare them with various matrix-based binding models. The results indicate that, compared with optimized PWMs, adding DNA shape features does not produce significantly better binding models and may increase the risk of overfitting on training datasets
Minimum Description Length codes are critical
In the Minimum Description Length (MDL) principle, learning from the data is
equivalent to an optimal coding problem. We show that the codes that achieve
optimal compression in MDL are critical in a very precise sense. First, when
they are taken as generative models of samples, they generate samples with
broad empirical distributions and with a high value of the relevance, defined
as the entropy of the empirical frequencies. These results are derived for
different statistical models (Dirichlet model, independent and pairwise
dependent spin models, and restricted Boltzmann machines). Second, MDL codes
sit precisely at a second order phase transition point where the symmetry
between the sampled outcomes is spontaneously broken. The order parameter
controlling the phase transition is the coding cost of the samples. The phase
transition is a manifestation of the optimality of MDL codes, and it arises
because codes that achieve a higher compression do not exist. These results
suggest a clear interpretation of the widespread occurrence of statistical
criticality as a characterization of samples which are maximally informative on
the underlying generative process.Comment: 23 pages, 5 figures; Corrected the author name, revised Section 2.2
(Large Deviations of the Universal Codes Exhibit Phase Transitions),
corrected Eq. (89
- âŠ