3,976 research outputs found
Accurate prediction of gene expression by integration of DNA sequence statistics with detailed modeling of transcription regulation
Gene regulation involves a hierarchy of events that extend from specific
protein-DNA interactions to the combinatorial assembly of nucleoprotein
complexes. The effects of DNA sequence on these processes have typically been
studied based either on its quantitative connection with single-domain binding
free energies or on empirical rules that combine different DNA motifs to
predict gene expression trends on a genomic scale. The middle-point approach
that quantitatively bridges these two extremes, however, remains largely
unexplored. Here, we provide an integrated approach to accurately predict gene
expression from statistical sequence information in combination with detailed
biophysical modeling of transcription regulation by multidomain binding on
multiple DNA sites. For the regulation of the prototypical lac operon, this
approach predicts within 0.3-fold accuracy transcriptional activity over a
10,000-fold range from DNA sequence statistics for different intracellular
conditions.Comment: 15 pages, 5 figure
Inferring a Transcriptional Regulatory Network from Gene Expression Data Using Nonlinear Manifold Embedding
Transcriptional networks consist of multiple regulatory layers corresponding to the activity of global regulators, specialized repressors and activators of transcription as well as proteins and enzymes shaping the DNA template. Such intrinsic multi-dimensionality makes uncovering connectivity patterns difficult and unreliable and it calls for adoption of methodologies commensurate with the underlying organization of the data source. Here we present a new computational method that predicts interactions between transcription factors and target genes using a compendium of microarray gene expression data and the knowledge of known interactions between genes and transcription factors. The proposed method called Kernel Embedding of REgulatory Networks (KEREN) is based on the concept of gene-regulon association and it captures hidden geometric patterns of the network via manifold embedding. We applied KEREN to reconstruct gene regulatory interactions in the model bacteria E.coli on a genome-wide scale. Our method not only yields accurate prediction of verifiable interactions, which outperforms on certain metrics comparable methodologies, but also demonstrates the utility of a geometric approach to the analysis of high-dimensional biological data. We also describe the general application of kernel embedding techniques to some other function and network discovery algorithms
Dissecting the Specificity of Protein-Protein Interaction in Bacterial Two-Component Signaling: Orphans and Crosstalks
Predictive understanding of the myriads of signal transduction pathways in a
cell is an outstanding challenge of systems biology. Such pathways are
primarily mediated by specific but transient protein-protein interactions,
which are difficult to study experimentally. In this study, we dissect the
specificity of protein-protein interactions governing two-component signaling
(TCS) systems ubiquitously used in bacteria. Exploiting the large number of
sequenced bacterial genomes and an operon structure which packages many pairs
of interacting TCS proteins together, we developed a computational approach to
extract a molecular interaction code capturing the preferences of a small but
critical number of directly interacting residue pairs. This code is found to
reflect physical interaction mechanisms, with the strongest signal coming from
charged amino acids. It is used to predict the specificity of TCS interaction:
Our results compare favorably to most available experimental results, including
the prediction of 7 (out of 8 known) interaction partners of orphan signaling
proteins in Caulobacter crescentus. Surveying among the available bacterial
genomes, our results suggest 15~25% of the TCS proteins could participate in
out-of-operon "crosstalks". Additionally, we predict clusters of crosstalking
candidates, expanding from the anecdotally known examples in model organisms.
The tools and results presented here can be used to guide experimental studies
towards a system-level understanding of two-component signaling.Comment: Supplementary information available on
http://www.plosone.org/article/info:doi/10.1371/journal.pone.001972
Inter-protein sequence co-evolution predicts known physical interactions in bacterial ribosomes and the trp operon
Interaction between proteins is a fundamental mechanism that underlies
virtually all biological processes. Many important interactions are conserved
across a large variety of species. The need to maintain interaction leads to a
high degree of co-evolution between residues in the interface between partner
proteins. The inference of protein-protein interaction networks from the
rapidly growing sequence databases is one of the most formidable tasks in
systems biology today. We propose here a novel approach based on the
Direct-Coupling Analysis of the co-evolution between inter-protein residue
pairs. We use ribosomal and trp operon proteins as test cases: For the small
resp. large ribosomal subunit our approach predicts protein-interaction
partners at a true-positive rate of 70% resp. 90% within the first 10
predictions, with areas of 0.69 resp. 0.81 under the ROC curves for all
predictions. In the trp operon, it assigns the two largest interaction scores
to the only two interactions experimentally known. On the level of residue
interactions we show that for both the small and the large ribosomal subunit
our approach predicts interacting residues in the system with a true positive
rate of 60% and 85% in the first 20 predictions. We use artificial data to show
that the performance of our approach depends crucially on the size of the joint
multiple sequence alignments and analyze how many sequences would be necessary
for a perfect prediction if the sequences were sampled from the same model that
we use for prediction. Given the performance of our approach on the test data
we speculate that it can be used to detect new interactions, especially in the
light of the rapid growth of available sequence data
Discriminative Topological Features Reveal Biological Network Mechanisms
Recent genomic and bioinformatic advances have motivated the development of
numerous random network models purporting to describe graphs of biological,
technological, and sociological origin. The success of a model has been
evaluated by how well it reproduces a few key features of the real-world data,
such as degree distributions, mean geodesic lengths, and clustering
coefficients. Often pairs of models can reproduce these features with
indistinguishable fidelity despite being generated by vastly different
mechanisms. In such cases, these few target features are insufficient to
distinguish which of the different models best describes real world networks of
interest; moreover, it is not clear a priori that any of the presently-existing
algorithms for network generation offers a predictive description of the
networks inspiring them. To derive discriminative classifiers, we construct a
mapping from the set of all graphs to a high-dimensional (in principle
infinite-dimensional) ``word space.'' This map defines an input space for
classification schemes which allow us for the first time to state unambiguously
which models are most descriptive of the networks they purport to describe. Our
training sets include networks generated from 17 models either drawn from the
literature or introduced in this work, source code for which is freely
available. We anticipate that this new approach to network analysis will be of
broad impact to a number of communities.Comment: supplemental website:
http://www.columbia.edu/itc/applied/wiggins/netclass
Operon prediction in Pyrococcus furiosus
Identification of operons in the hyperthermophilic archaeon Pyrococcus furiosus represents an important step to understanding the regulatory mechanisms that enable the organism to adapt and thrive in extreme environments. We have predicted operons in P.furiosus by combining the results from three existing algorithms using a neural network (NN). These algorithms use intergenic distances, phylogenetic profiles, functional categories and gene-order conservation in their operon prediction. Our method takes as inputs the confidence scores of the three programs, and outputs a prediction of whether adjacent genes on the same strand belong to the same operon. In addition, we have applied Gene Ontology (GO) and KEGG pathway information to improve the accuracy of our algorithm. The parameters of this NN predictor are trained on a subset of all experimentally verified operon gene pairs of Bacillus subtilis. It subsequently achieved 86.5% prediction accuracy when applied to a subset of gene pairs for Escherichia coli, which is substantially better than any of the three prediction programs. Using this new algorithm, we predicted 470 operons in the P.furiosus genome. Of these, 349 were validated using DNA microarray data
A Method for Improving the Accuracy and Efficiency of Bacteriophage Genome Annotation
Bacteriophages are the most numerous entities on Earth. The number of sequenced phage genomes is approximately 8000 and increasing rapidly. Sequencing of a genome is followed by annotation, where genes, start codons, and functions are putatively identified. The mainstays of phage genome annotation are auto-annotation programs such as Glimmer and GeneMark. Due to the relatively small size of phage genomes, many groups choose to manually curate auto-annotation results to increase accuracy. An additional benefit of manual curation of auto-annotated phage genomes is that the process is amenable to be performed by students, and has been shown to improve student recruitment to the sciences. However, despite its greater accuracy and pedagogical value, manual curation suffers from high labor cost, lack of standardization and a degree of subjectivity in decision making, and susceptibility to mistakes. Here, we present a method developed in our lab that is designed to produce accurate annotations while reducing subjectivity and providing a degree of standardization in decision-making. We show that our method produces genome annotations more accurate than auto-annotation programs while retaining the pedagogical benefits of manual genome curation
Global Functional Atlas of \u3cem\u3eEscherichia coli\u3c/em\u3e Encompassing Previously Uncharacterized Proteins
One-third of the 4,225 protein-coding genes of Escherichia coli K-12 remain functionally unannotated (orphans). Many map to distant clades such as Archaea, suggesting involvement in basic prokaryotic traits, whereas others appear restricted to E. coli, including pathogenic strains. To elucidate the orphans’ biological roles, we performed an extensive proteomic survey using affinity-tagged E. coli strains and generated comprehensive genomic context inferences to derive a high-confidence compendium for virtually the entire proteome consisting of 5,993 putative physical interactions and 74,776 putative functional associations, most of which are novel. Clustering of the respective probabilistic networks revealed putative orphan membership in discrete multiprotein complexes and functional modules together with annotated gene products, whereas a machine-learning strategy based on network integration implicated the orphans in specific biological processes. We provide additional experimental evidence supporting orphan participation in protein synthesis, amino acid metabolism, biofilm formation, motility, and assembly of the bacterial cell envelope. This resource provides a “systems-wide” functional blueprint of a model microbe, with insights into the biological and evolutionary significance of previously uncharacterized proteins
Genomic data mining for the computational prediction of small non-coding RNA genes
The objective of this research is to develop a novel computational prediction algorithm for non-coding RNA (ncRNA) genes using features computable for any genomic sequence without the need for comparative analysis. Existing comparative-based methods require the knowledge of closely related organisms in order to search for sequence and structural similarities. This approach imposes constraints on the type of ncRNAs, the organism, and the regions where the ncRNAs can be found. We have developed a novel approach for ncRNA gene prediction without the limitations of current comparative-based methods. Our work has established a ncRNA database required for subsequent feature and genomic analysis. Furthermore, we have identified significant features from folding-, structural-, and ensemble-based statistics for use in ncRNA prediction. We have also examined higher-order gene structures, namely operons, to discover potential insights into how ncRNAs are transcribed. Being able to automatically identify ncRNAs on a genome-wide scale is immensely powerful for incorporating it into a pipeline for large-scale genome annotation. This work will contribute to a more comprehensive annotation of ncRNA genes in microbial genomes to meet the demands of functional and regulatory genomic studies.Ph.D.Committee Chair: Dr. G. Tong Zhou; Committee Member: Dr. Arthur Koblasz; Committee Member: Dr. Eberhard Voit; Committee Member: Dr. Xiaoli Ma; Committee Member: Dr. Ying X
- …