275 research outputs found

    A novel method for accurate operon predictions in all sequenced prokaryotes

    Get PDF
    We combine comparative genomic measures and the distance separating adjacent genes to predict operons in 124 completely sequenced prokaryotic genomes. Our method automatically tailors itself to each genome using sequence information alone, and thus can be applied to any prokaryote. For Escherichia coli K12 and Bacillus subtilis, our method is 85 and 83% accurate, respectively, which is similar to the accuracy of methods that use the same features but are trained on experimentally characterized transcripts. In Halobacterium NRC-1 and in Helicobacter pylori, our method correctly infers that genes in operons are separated by shorter distances than they are in E.coli, and its predictions using distance alone are more accurate than distance-only predictions trained on a database of E.coli transcripts. We use microarray data from six phylogenetically diverse prokaryotes to show that combining intergenic distance with comparative genomic measures further improves accuracy and that our method is broadly effective. Finally, we survey operon structure across 124 genomes, and find several surprises: H.pylori has many operons, contrary to previous reports; Bacillus anthracis has an unusual number of pseudogenes within conserved operons; and Synechocystis PCC 6803 has many operons even though it has unusually wide spacings between conserved adjacent genes

    Local feature based pattern classification: from principle to application

    Get PDF
    This thesis demonstrates that local feature based approaches are always more stable than global feature based approaches for pattern classification problems. Guided by the original theory that a regional matching approach is more robust than a national matching approach for two-dimensional pattern classification, this thesis examines the applications of the theory in one-dimensional and two-dimensional pattern classifications. We propose two local feature based approaches for two significant applications of pattern classification, namely start codon prediction and content based image classification. For start codon prediction which is considered as a typical one-dimensional pattern classification problem, we have developed a districted neural network that can be taken as a regional voting version of the conventional neural network. Experiments have been performed on the well known translation initiation sites (TIS) data sets and results have shown significant improvement of prediction accuracy. For two-dimensional pattern classification, we propose differential latent semantic index (DLSI) approach for content based image classification. The feasibility of using local features in the DLSI method is also investigated and an extensive experimental study on a real image database has proved its effectiveness.The original print copy of this thesis may be available here: http://wizard.unbc.ca/record=b130288

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    Mitochondial parts, pathways, and pathogenesis

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2009.Cataloged from PDF version of thesis.Includes bibliographical references.In title on title page, the word "Mitochondrial" is spelled "Mitochondial."Mitochondria are cellular compartments that perform essential roles in energy metabolism, ion homeostasis, and apoptosis. Mitochondrial dysfunction causes disease in 1 in 5,000 live births and also has been associated with aging, neurodegeneration, cancer, and diabetes. To systematically explore the function of mitochondria in health and in disease, it is necessary to identify all of the proteins resident in this organelle and to understand how they integrate into pathways. However, traditional molecular and biochemistry methods have identified only half of the estimated 1200 mitochondrial proteins, including the 13 encoded by the tiny mitochondrial genome. Now, newly available genomic technologies make it possible to identify the remainder and explore their roles in cellular pathways and disease. Toward this goal, we performed mass spectrometry, GFP tagging, and machine learning on multiple genomic datasets to create a mitochondrial compendium of 1098 genes and their protein expression across 14 mouse tissues. We linked poorly characterized proteins in this inventory to known mitochondrial pathways by virtue of shared evolutionary history. We additionally used our matched mRNA and protein measurements to demonstrate a widespread role of upstream open reading frames (uORFs) in blunting translation of mitochondrial and other cellular proteins. Next we used the mitochondrial protein inventory to identify genes underlying inherited diseases of mitochondrial dysfunction. In collaboration with clinicians, we identified causal mutations in five genes underlying diseases including hepatocerebral mtDNA depletion syndrome, autosomal dominant mitochondrial myopathy, and several forms of inherited complex I deficiency. These discoveries have enabled the development of diagnostic tests now widely available. More broadly, the mitochondrial compendium provides a foundation for systematically exploring the organelle's contribution to both basic cellular biology and human disease.by Sarah E. Calvo.Ph.D
    • …
    corecore