Search CORE

UCL Discovery

Open Access Repository of IISc Research Publications

Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors

Author: Abbas Mostafa M.
El-Manzalawy Yasser
Mohie-Eldin Mostafa M.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.NPRP grant No. 4-1454-1-233 from the Qatar National Research Fund (a member of Qatar Foundation).Scopu

Qatar University Institutional Repository

Public Library of Science (PLOS)

FigShare

Prediction of Transcriptional Terminators in Bacillus subtilis and Related Species

Author: Chen X Su Z, Dam P, Palenik B, Xu Y, et al
d'Aubenton Carafa Y Brody E, Thermes C
Diana Murray
Kenta Nakai
Michiel J. L. de Hoon
Satoru Miyano
Yuko Makita
Publication venue: Public Library of Science
Publication date: 01/01/2005
Field of study

In prokaryotes, genes belonging to the same operon are transcribed in a single mRNA molecule. Transcription starts as the RNA polymerase binds to the promoter and continues until it reaches a transcriptional terminator. Some terminators rely on the presence of the Rho protein, whereas others function independently of Rho. Such Rho-independent terminators consist of an inverted repeat followed by a stretch of thymine residues, allowing us to predict their presence directly from the DNA sequence. Unlike in Escherichia coli, the Rho protein is dispensable in Bacillus subtilis, suggesting a limited role for Rho-dependent termination in this organism and possibly in other Firmicutes. We analyzed 463 experimentally known terminating sequences in B. subtilis and found a decision rule to distinguish Rho-independent transcriptional terminators from non-terminating sequences. The decision rule allowed us to find the boundaries of operons in B. subtilis with a sensitivity and specificity of about 94%. Using the same decision rule, we found an average sensitivity of 94% for 57 bacteria belonging to the Firmicutes phylum, and a considerably lower sensitivity for other bacteria. Our analysis shows that Rho-independent termination is dominant for Firmicutes in general, and that the properties of the transcriptional terminators are conserved. Terminator prediction can be used to reliably predict the operon structure in these organisms, even in the absence of experimentally known operons. Genome-wide predictions of Rho-independent terminators for the 57 Firmicutes are available in the Supporting Information section

FigShare

Bacterial Promoter Features Description and Their Application on E. coli in silico Prediction and Recognition Approaches

Author: Scheila de Avila e Silva
Sergio Echeverrigaray
Publication venue: 'IntechOpen'
Publication date: 28/11/2012
Field of study

IntechOpen

Structural properties of promoters: similarities and differences between prokaryotes and eukaryotes

Author: Bansal Manju
Kanhere Aditi
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

During the process of transcription, RNA polymerase can exactly locate a promoter sequence in the complex maze of a genome. Several experimental studies and computational analyses have shown that the promoter sequences apparently possess some special properties, such as unusual DNA structures and low stability, which make them distinct from the rest of the genome. But most of these studies have been carried out on a particular set of promoter sequences or on promoter sequences from similar organisms. To examine whether the promoters from a wide variety of organisms share these special properties, we have carried out an analysis of sets of promoters from bacteria, vertebrates and plants. These promoters were analyzed with respect to the prediction of three different properties, such as DNA curvature, bendability and stability, which are relevant to transcription. All the promoter sequences are predicted to share certain features, such as stability and bendability profiles, but there are significant differences in DNA curvature profiles and nucleotide composition between the different organisms. These similarities and differences are correlated with some of the known facts about transcription process in the promoters from the three groups of organisms

CiteSeerX

University of Birmingham Research Portal

Open Access Repository of IISc Research Publications

UCL Discovery

A functional genomics catalogue of activated transcription factors during pathogenesis of pneumococcal disease

Author: Adelson David L.
Deihim Tahereh
Ebrahimie Esmaeil
Fruzangohar Mario
Mahdi Layla K.
Ogunniyi Abiodun D.
Paton James C.
Zamansani Fatemeh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Background: Association analysis is an alternative to conventional family-based methods to detect the location of gene(s) or quantitative trait loci (QTL) and provides relatively high resolution in terms of defining the genome position of a gene or QTL. Seed protein and oil concentration are quantitative traits which are determined by the interaction among many genes with small to moderate genetic effects and their interaction with the environment. In this study, a genome-wide association study (GWAS) was performed to identify quantitative trait loci (QTL)controlling seed protein and oil concentration in 298 soybean germplasm accessions exhibiting a wide range of seed protein and oil content. Results: A total of 55,159 single nucleotide polymorphisms (SNPs) were genotyped using various methods including illumina Infinium and GoldenGate assays and 31,954 markers with minor allele frequency >0.10 were used to estimate linkage disequilibrium (LD) in heterochromatic and euchromatic regions. In euchromatic regions, the mean LD (r2) rapidly declined to 0.2 within 360 Kbp, whereas the mean LD declined to 0.2 at 9,600 Kbp in heterochromatic regions. The GWAS results identified 40 SNPs in 17 different genomic regions significantly associated with seed protein. Of these, the five SNPs with the highest associations and seven adjacent SNPs were located in the 27.6-30.0 Mbp region of Gm20. A major seed protein QTL has been previously mapped to the same location and potential candidate genes have recently been identified in this region. The GWAS results also detected 25 SNPs in 13 different genomic regions associated with seed oil. Of these markers, seven SNPs had a significant association with both protein and oil. Conclusions: This research indicated that GWAS not only identified most of the previously reported QTL controlling seed protein and oil, but also resulted in narrower genomic regions than the regions reported as containing these QTL. The narrower GWAS-defined genome regions will allow more precise marker-assisted allele selection and will expedite positional cloning of the causal gene(s)

Adelaide Research & Scholarship

University of Southern Queensland ePrints

University of Melbourne Institutional Repository

Gene prediction in metagenomic fragments: A large scale machine learning approach

Author: A Lukashin
AL Delcher
BE Suzek
Burkhard Morgenstern
CJ van Rijsbergen
CM Bishop
CS Riesenfeld
D Frishman
DA Benson
DJC MacKay
F Sanger
F Wilcoxon
GW Tyson
H Noguchi
HY Ou
IT Nabney
J Besemer
J Handelsman
JC Venter
K Chen
Katharina J Hoff
KE Rudd
L Krause
M Ronaghi
M Tech
M Tech
Maike Tech
MS Rappe
P Hugenholtz
P Nielson
Peter Meinicke
R Amann
R Daniel
R Daniel
R Development Core Team
RA Edwards
Rolf Daniel
S Altschul
S Voget
SG Tringe
T Hastie
T Jarvie
Thomas Lingner
V Torsvik
VB Bajic
W Streit
Publication venue: BioMed Central
Publication date: 01/04/2008
Field of study

Abstract Background Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions. Results We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability. Conclusion Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).</p

Promoter prediction and annotation of microbial genomes based on DNA sequence and structural responses to superhelical stress

Author: A Kanhere
AG Pedersen
AL Delcher
AM Huerta
CB Harley
CE Shannon
CJ Benham
CJ Benham
Craig J Benham
DK Hawley
ES Shpigelman
GZ Hertz
H Salgado
H Wang
H Wang
H Wang
Huiquan Wang
J SantaLucia Jr
JD Helmann
L Kozobay-Avraham
M Rosenberg
MG Reese
ML Opel
R Durbin
RA Johnson
RR Sokal
S Lisser
SD Sheridan
U Ohler
WK Olson
WS Hayes
Y Makita
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: In our previous studies, we found that the sites in prokaryotic genomes which are most susceptible to duplex destabilization under the negative superhelical stresses that occur in vivo are statistically highly significantly associated with intergenic regions that are known or inferred to contain promoters. In this report we investigate how this structural property, either alone or together with other structural and sequence attributes, may be used to search prokaryotic genomes for promoters. RESULTS: We show that the propensity for stress-induced DNA duplex destabilization (SIDD) is closely associated with specific promoter regions. The extent of destabilization in promoter-containing regions is found to be bimodally distributed. When compared with DNA curvature, deformability, thermostability or sequence motif scores within the -10 region, SIDD is found to be the most informative DNA property regarding promoter locations in the E. coli K12 genome. SIDD properties alone perform better at detecting promoter regions than other programs trained on this genome. Because this approach has a very low false positive rate, it can be used to predict with high confidence the subset of promoters that are strongly destabilized. When SIDD properties are combined with -10 motif scores in a linear classification function, they predict promoter regions with better than 80% accuracy. When these methods were tested with promoter and non-promoter sequences from Bacillus subtilis, they achieved similar or higher accuracies. We also present a strictly SIDD-based predictor for annotating promoter sequences in complete microbial genomes. CONCLUSION: In this report we show that the propensity to undergo stress-induced duplex destabilization (SIDD) is a distinctive structural attribute of many prokaryotic promoter sequences. We have developed methods to identify promoter sequences in prokaryotic genomes that use SIDD either as a sole predictor or in combination with other DNA structural and sequence properties. Although these methods cannot predict all the promoter-containing regions in a genome, they do find large sets of potential regions that have high probabilities of being true positives. This approach could be especially valuable for annotating those genomes about which there is limited experimental data