9 research outputs found

    Prediction of solvent accessibility and sites of deleterious mutations from protein sequence

    Get PDF
    Residues that form the hydrophobic core of a protein are critical for its stability. A number of approaches have been developed to classify residues as buried or exposed. In order to optimize the classification, we have refined a suite of five methods over a large dataset and proposed a metamethod based on an ensemble average of the individual methods, leading to a two-state classification accuracy of 80%. Many studies have suggested that hydrophobic core residues are likely sites of deleterious mutations, so we wanted to see to what extent these sites can be predicted from the putative buried residues. Residues that were most confidently classified as buried were proposed as sites of deleterious mutations. This proposition was tested on six proteins for which sites of deleterious mutations have previously been identified by stability measurement or functional assay. Of the total of 130 residues predicted as sites of deleterious mutations, 104 (or 80%) were correct

    Whole genome sequencing to investigate the emergence of clonal complex 23 Neisseria meningitidis serogroup Y disease in the United States

    Get PDF
    In the United States, serogroup Y, ST-23 clonal complex Neisseria meningitidis was responsible for an increase in meningococcal disease incidence during the 1990s. This increase was accompanied by antigenic shift of three outer membrane proteins, with a decrease in the population that predominated in the early 1990s as a different population emerged later in that decade. To understand factors that may have been responsible for the emergence of serogroup Y disease, we used whole genome pyrosequencing to investigate genetic differences between isolates from early and late N. meningitidis populations, obtained from meningococcal disease cases in Maryland in the 1990s. The genomes of isolates from the early and late populations were highly similar, with 1231 of 1776 shared genes exhibiting 100% amino acid identity and an average πN = 0.0033 and average πS = 0.0216. However, differences were found in predicted proteins that affect pilin structure and antigen profile and in predicted proteins involved in iron acquisition and uptake. The observed changes are consistent with acquisition of new alleles through horizontal gene transfer. Changes in antigen profile due to the genetic differences found in this study likely allowed the late population to emerge due to escape from population immunity. These findings may predict which antigenic factors are important in the cyclic epidemiology of meningococcal disease

    Knowledge discovery in biological databases : a neural network approach

    Get PDF
    Knowledge discovery, in databases, also known as data mining, is aimed to find significant information from a set of data. The knowledge to be mined from the dataset may refer to patterns, association rules, classification and clustering rules, and so forth. In this dissertation, we present a neural network approach to finding knowledge in biological databases. Specifically, we propose new methods to process biological sequences in two case studies: the classification of protein sequences and the prediction of E. Coli promoters in DNA sequences. Our proposed methods, based oil neural network architectures combine techniques ranging from Bayesian inference, coding theory, feature selection, dimensionality reduction, to dynamic programming and machine learning algorithms. Empirical studies show that the proposed methods outperform previously published methods and have excellent performance on the latest dataset. We have implemented the proposed algorithms into an infrastructure, called Genome Mining, developed for biosequence classification and recognition

    Protein family classification using multiple-class neural networks.

    Get PDF
    The objective of genomic sequence analysis is to retrieve important information from the vast amount of genomic sequence data, such as DNA, RNA and protein sequences. The main task includes the interpretation of the function of DNA sequence on a genomic scale, the comparisons among genomes to gain insight into the universality of biological mechanisms and into the details of gene structure and function, the determination of the structure of all proteins and protein family classification. With its many features and capabilities for recognition, generalization and classification, artificial neural network technology is well suited for sequence analysis. At the state of the art, many methods have been devised to determine if a given protein sequence is member of a given protein superfamily. This is a binary classification problem, and efficient neural network techniques are mentioned in literature for solving such problem. In this Master\u27s thesis, we consider the problem of classifying given protein sequences into one among at least three protein families using neural networks, and, propose two methods: Pair-wise Multiple Classification Approach and Single Network Approach for this problem. In Pair-wise Multiple Classification Approach , several sub-networks are employed to perform the task whereas a compact network system is used in Single Network Approach . We performed experiments, using SNNS and UOWNNS neural network simulator on our NNs with different input/output representation, and reported accuracies as high as 95%. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .Z54. Source: Masters Abstracts International, Volume: 43-01, page: 0248. Adviser: Alioune Ngom. Thesis (M.Sc.)--University of Windsor (Canada), 2004

    Development of gene-finding algorithms for fungal genomes : dealing with small datasets and leveraging comparative genomics

    Get PDF
    Thesis (M.Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (leaves 60-62).A computer program called FUNSCAN was developed which identifies protein coding regions in fungal genomes. Gene structural and compositional properties are modeled using a Hidden Markov Model. Separate training and testing sets for FUNSCAN were obtained by aligning cDNAs from an organism to their genomic loci, generating a 'gold standard' set of annotated genes. The performance of FUNSCAN is competitive with other computer programs design to identify protein coding regions in fungal genomes. A technique called 'Training Set Augmentation' is described which can be used to train FUNSCAN when only a small training set of genes is available. Techniques that combine alignment algorithms with FUNSCAN to identify novel genes are also discussed and explored.by Allan Lazarovici.M.Eng.and S.B

    Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

    Get PDF
    2021 Spring.Includes bibliographical references.Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification

    A Decision Tree System for Finding Genes in DNA

    No full text
    MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in MORGAN are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that MORGAN has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95%, with..
    corecore