1,235 research outputs found

    Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors

    Get PDF
    As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.NPRP grant No. 4-1454-1-233 from the Qatar National Research Fund (a member of Qatar Foundation).Scopu

    Single DNA conformations and biological function

    Get PDF
    From a nanoscience perspective, cellular processes and their reduced in vitro imitations provide extraordinary examples for highly robust few or single molecule reaction pathways. A prime example are biochemical reactions involving DNA molecules, and the coupling of these reactions to the physical conformations of DNA. In this review, we summarise recent results on the following phenomena: We investigate the biophysical properties of DNA-looping and the equilibrium configurations of DNA-knots, whose relevance to biological processes are increasingly appreciated. We discuss how random DNA-looping may be related to the efficiency of the target search process of proteins for their specific binding site on the DNA molecule. And we dwell on the spontaneous formation of intermittent DNA nanobubbles and their importance for biological processes, such as transcription initiation. The physical properties of DNA may indeed turn out to be particularly suitable for the use of DNA in nanosensing applications.Comment: 53 pages, 45 figures. Slightly revised version of a review article, that is going to appear in the J. Comput. Theoret. Nanoscience; some typos correcte

    Derivation of Context-free Stochastic L-Grammar Rules for Promoter Sequence Modeling Using Support Vector Machine

    Get PDF
    Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences

    Computational and biochemical analysis of genomic aptamers in multiple species

    Get PDF
    Der Beweis, dass der Großteil, wenn nicht sogar die Gesamtheit des Genoms in RNA umgeschrieben wird, stellt eine interessante Herausforderung für die Definition eines Gens dar. Viele dieser Transkripte werden nur kurzfristig exprimiert, was eine zusätzliche Herausforderung bei der Entdeckung und Charakterisierung dieser Gene darstellt. Genomisches SELEX stellt eine Methode dar, die geeignet ist, aus einer Bibliothek genomischer RNAs seltene aber funktionelle RNAs zu entdecken, die Bindungsaktivität besitzen. Die Methode stellt eine evolutionäre in vitro Strategie dar, die damit beginnt, dass ein Genom vollständig in RNAs unterschiedlicher Länge transkribiert wird, und anschließend in mehreren Runden von Selektion und Amplifikation jene Moleküle angereichert werden, die eine vorgegebene Aktivität besitzen, meistens Bindung an ein Protein oder an ein kleines Molekül. ”High-throughput“ Sequenzierung der angereicherten Bibliotheken kann ein genaueres Abbild der Verteilung der ”genomischen Aptamere“ liefern, welche als Bindungselemente, die mittels genomischem SELEX angereichert wurden, definiert werden. In dieser Arbeit werden wir erstmals beschreiben, wie eine high-throughput Analyse verwendet werden kann, um die Veränderung der Sequenzen während eines SELEX Experimentes mit und ohne Selektion zu verfolgen. Wir schlagen dieses ”neutrale“ Kontrollexperiment als eine Methode vor die Rate der falsch positiven und des Grundrauschens zu bestimmen und das Basisrauschen zu definieren. Anschließend wird das Ergebnis einer positiven Selektion für Aptamere, die an den pleiotropen Regulator Hfq binden gezeigt. Die Etablierung eines Datenportals, welches interaktiv mehrere Datensammlungen für Analysen zugänglich macht, ermöglichte die Beobachtung, dass diese Aptamere vor allem in der intergenischen Region des antisense Stranges liegen. Schließlich wird hier ein Algorithmus präsentiert, welcher die minimalen Sequenzen identifizieren, für ihre Bindungsaktivität benötigen. Diese Dissertation liefert mehrere Fortschritte für Methoden im Bereich des genomischen SELEX: anfängliche bioinformatische Analysen, experimentelle Strategien für die Bestätigung von erhaltenen Zielsequenzen und vor allem zeigt diese Dissertation, dass genomisches SELEX ein leistungsstarkes Instrument ist um neue nicht-kodierende RNAs zu finden.Growing evidence that most (if not all) of the genome is transcribed into RNA that does not code for protein presents an interesting challenge to the definition of a gene. Many of these transcripts are only transiently expressed, which complicates the process of finding and characterizing these genes. Genomic SELEX provides a platform for discovery of such sparse, yet functional, RNAs by enriching those which confer binding activity out of a pool of genomic RNAs of varying length. The screen is an in vitro evolutionary strategy that begins with these genomically-encoded RNAs, which is subjected to rounds of selection for an activity of interest, usually binding a protein or small molecule. High-throughput sequencing can provide a more complete picture of the landscape of the resulting ``genomic aptamers'', binding elements enriched with genomic SELEX. Here, we will first describe how high-throughput analysis can be used to monitor the changes in sequences in the presence and absence of a selection step. We propose the ``neutral SELEX'' control experiment as a means for detection of false positive rates and biases that can be treated as a background signal. We proceed to show an example of positive selection of genomic aptamers binding pleiotropic protein Hfq, and how genomic analysis and interactive databases uncovered a trend that Hfq genomic aptamers accumulate antisense to intergenic regions in polycistronic genes. Finally, we present an algorithm for detecting minimal motifs required of genomic aptamers, and provide biochemical evidence that the predicted RNAs are sufficient for binding. This dissertation contributes several advances in the methods used during genomic SELEX, the initial computational analysis, experimental design for target confirmation, and it provides evidence that genomic SELEX is a powerful tool for the discovery of novel non-coding RNAs

    Cloning of the T4 polynucleotide kinase gene and amplification of its product

    Get PDF

    Cloning, sequencing and expression of glucose dehydrogenase from Thermoplasma acidophilum

    Get PDF
    SIGLEAvailable from British Library Document Supply Centre- DSC:DX95643 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    Regulation of transcription factor binding specificity: from binding motifs to local DNA context

    Get PDF
    Regulation of transcription factor (TF) binding specificity lies at the heart of transcriptional control which governs how cells divide, differentiate, and respond to their environments. TFs are known to bind to DNA in a sequence specific manner, and such short sequence is known as transcription factor binding site (TFBS). However, the in vivo TF bound regions do not always contain a TFBS, and additionally, there are often excessive non-functional TFBSs with binding potential in the regulatory regions that are unbound for a given TF. This dissertation focuses on understanding the principles of TF binding specificity and is divided into two chapters: 1) developing a novel high throughput method that would facilitate the study of TF binding regulations and the resulting functional output; 2) analyzing the roles of local DNA context around TFBS in specifying TF localization. In the first chapter of this dissertation, we report a tool, Calling Cards Reporter Arrays (CCRA), that measures transcription factor (TF) binding and the consequences on gene expression for hundreds of synthetic promoters in yeast. Using Cbf1p and MAX, we demonstrate that the CCRA method is able to detect small changes in binding free energy with a sensitivity comparable to in vitro methods, enabling the measurement of energy landscapes in vivo. We then demonstrate the quantitative analysis of cooperative interactions by measuring Cbf1p binding at synthetic promoters with multiple sites. We find that the cooperativity between Cbf1p dimers varies sinusoidally with a period of 10.65 bp and energetic cost of 1.37 KBT for sites that are positioned “out of phase”. Finally, we characterize the binding and expression of a group of TFs, Tye7p, Gcr1p, and Gcr2p, that act together as a “TF collective”, an important but poorly characterized model of TF cooperativity. We demonstrate that Tye7p often binds promoters without its recognition site because it is recruited by other collective members, whereas these other members require their recognition sites, suggesting a hierarchy where these factors recruit Tye7p but not vice versa. Our experiments establish CCRA as a useful tool for quantitative investigations into TF binding and function. In the second chapter of this dissertation, we seek out to investigate if predictive information is embedded in local DNA context (LDC) on a large collection of TFs in Saccharomyces cerevisiae. We identify there is a general preference for TFs to bind at CG rich sequences; we then analyze whether such preference is linked to intrinsic nucleosome binding preference and found the CG preference in LDC for TF binding was independent of nucleosome regulation. We next examine the possible mechanism by which LDC influence TFs binding site selection, through recruiting ‘licensing’ factors or kinetically assisting TF search for a target site. We show high CG LDC is preferred by TFs in vitro condition, which suggests such preference only involves TFs and DNA and directs us to TF search kinetics mechanism. CG rich feature in LDC may act as an energetical funnel to facilitate TF recognizing a target binding site, and we verify the theoretical validity of this hypothesis with Gillespie simulation. In the end, we reveal CG preference was also present in a large group of human TFs, indicating the usage of LDC is a general mechanism for TF binding specificity

    Genomic data mining for the computational prediction of small non-coding RNA genes

    Get PDF
    The objective of this research is to develop a novel computational prediction algorithm for non-coding RNA (ncRNA) genes using features computable for any genomic sequence without the need for comparative analysis. Existing comparative-based methods require the knowledge of closely related organisms in order to search for sequence and structural similarities. This approach imposes constraints on the type of ncRNAs, the organism, and the regions where the ncRNAs can be found. We have developed a novel approach for ncRNA gene prediction without the limitations of current comparative-based methods. Our work has established a ncRNA database required for subsequent feature and genomic analysis. Furthermore, we have identified significant features from folding-, structural-, and ensemble-based statistics for use in ncRNA prediction. We have also examined higher-order gene structures, namely operons, to discover potential insights into how ncRNAs are transcribed. Being able to automatically identify ncRNAs on a genome-wide scale is immensely powerful for incorporating it into a pipeline for large-scale genome annotation. This work will contribute to a more comprehensive annotation of ncRNA genes in microbial genomes to meet the demands of functional and regulatory genomic studies.Ph.D.Committee Chair: Dr. G. Tong Zhou; Committee Member: Dr. Arthur Koblasz; Committee Member: Dr. Eberhard Voit; Committee Member: Dr. Xiaoli Ma; Committee Member: Dr. Ying X

    Crystallographic studies on molybdopterin-dependent enzymes

    Get PDF
    Dissertação para obtenção do Grau de Doutor em Bioquímica, Especialidade Bioquímica EstruturalThis Thesis reports the determination of the crystal structure of two molybdenum-dependent enzymes, as well as its functional interpretation. In Chapter 1 is given a general introduction on the use of molybdenum in biological systems, particularly its incorporation into the active site of several enzymes. In the same chapter is also included an overview on X-ray protein crystallography, briefly describing its main basic principles. Aldehyde oxidases are homodimeric proteins belonging to the xanthine oxidase (XO) family of molybdenum containing enzymes. The three-dimensional structure of mouse aldehyde oxidase homologue1 (mAOH1) is here reported and described (Chapter 2). This constitutes the first crystal structure ever obtained for an aldehyde oxidase. The mAOH1 protein was extracted from rat liver, and heterologously expressed in E.coli. The recombinant protein allowed determining suitable crystallization conditions, which were reproduced using the native enzyme from mouse liver. Suitable crystals were obtained, allowing to solve the protein structure at 2.9Å resolution, using bovine milk xanthine oxidase as a search model. Both proteins belong to the XO family of Mo proteins and are very similar, although catalyzing different reactions. The structure of mAOH1 and its comparison with the XO structure allowed drawing important structure and function correlations, and to explain the different enzyme specificities. These studies have also contributed to better understand the role of aldehyde oxidase in human health. The enzyme has received considerable attention from several pharmaceutical companies, as it is involved in the detoxification of several drugs and xenobiotics, assuming particular relevance in human health and drug design studies. Periplasmic nitrate reductase from the Cupriavidus necator bacterium (Cn NapAB) is a heterodimeric protein, and belongs to the DMSO reductase family of molybdenum containing enzymes. The three-dimensional structure of the C.necator NapAB was solved at 1.5Å resolution using crystals obtained from a crystallization robot. Structural, spectroscopic and functional studies of this protein are reported in Chapter 3. The high resolution of the model, allowed identifying the true nature of all Mo ligands. In the first reported nitrate reductase crystal structure (NapA from Desulfovibrio desulfuricans), the 6th Mo ligand had been identified as an oxygen, but in Cn NapAB, a sulfur atom could be unambiguously assigned to this position. It is believed that this is a general feature of all nitrate reductases, which has led to the necessary revisions on the reaction mechanism for this family of enzymes. To further characterize C.necator NapAB, spectroscopic and electrochemical studies have also been performed, and have shown unexpected features, particularly regarding the potential of the two c-type hemes.Fundação para a Ciência e Tecnologia - Bolsa de Doutoramento individual SFRH/BD/37948/2007. Programa POCI2010 - projectos POCI/QUI/57641/2004 e PTDC/QUI/64733/2006. Fundo Europeu de Desenvolvimento Regiona
    corecore