Search CORE

9 research outputs found

Prediction of solvent accessibility and sites of deleterious mutations from protein sequence

Author: Chen Huiling
Zhou Huan-Xiang
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

Residues that form the hydrophobic core of a protein are critical for its stability. A number of approaches have been developed to classify residues as buried or exposed. In order to optimize the classification, we have refined a suite of five methods over a large dataset and proposed a metamethod based on an ensemble average of the individual methods, leading to a two-state classification accuracy of 80%. Many studies have suggested that hydrophobic core residues are likely sites of deleterious mutations, so we wanted to see to what extent these sites can be predicted from the putative buried residues. Residues that were most confidently classified as buried were proposed as sites of deleterious mutations. This proposition was tested on six proteins for which sites of deleterious mutations have previously been identified by stability measurement or functional assay. Of the total of 130 residues predicted as sites of deleterious mutations, 104 (or 80%) were correct

CiteSeerX

Crossref

PubMed Central

Whole genome sequencing to investigate the emergence of clonal complex 23 Neisseria meningitidis serogroup Y disease in the United States

In the United States, serogroup Y, ST-23 clonal complex Neisseria meningitidis was responsible for an increase in meningococcal disease incidence during the 1990s. This increase was accompanied by antigenic shift of three outer membrane proteins, with a decrease in the population that predominated in the early 1990s as a different population emerged later in that decade. To understand factors that may have been responsible for the emergence of serogroup Y disease, we used whole genome pyrosequencing to investigate genetic differences between isolates from early and late N. meningitidis populations, obtained from meningococcal disease cases in Maryland in the 1990s. The genomes of isolates from the early and late populations were highly similar, with 1231 of 1776 shared genes exhibiting 100% amino acid identity and an average πN = 0.0033 and average πS = 0.0216. However, differences were found in predicted proteins that affect pilin structure and antigen profile and in predicted proteins involved in iron acquisition and uptake. The observed changes are consistent with acquisition of new alleles through horizontal gene transfer. Changes in antigen profile due to the genetic differences found in this study likely allowed the late population to emerge due to escape from population immunity. These findings may predict which antigenic factors are important in the cyclic epidemiology of meningococcal disease

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

D-Scholarship@Pitt

Knowledge discovery in biological databases : a neural network approach

Author: Ma Qicheng
Publication venue: Digital Commons @ NJIT
Publication date: 31/08/2000
Field of study

Knowledge discovery, in databases, also known as data mining, is aimed to find significant information from a set of data. The knowledge to be mined from the dataset may refer to patterns, association rules, classification and clustering rules, and so forth. In this dissertation, we present a neural network approach to finding knowledge in biological databases. Specifically, we propose new methods to process biological sequences in two case studies: the classification of protein sequences and the prediction of E. Coli promoters in DNA sequences. Our proposed methods, based oil neural network architectures combine techniques ranging from Bayesian inference, coding theory, feature selection, dimensionality reduction, to dynamic programming and machine learning algorithms. Empirical studies show that the proposed methods outperform previously published methods and have excellent performance on the latest dataset. We have implemented the proposed algorithms into an infrastructure, called Genome Mining, developed for biosequence classification and recognition

Digital Commons @ New Jersey Institute of Technology (NJIT)

Protein family classification using multiple-class neural networks.

Author: Zhang Xi
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2004
Field of study

The objective of genomic sequence analysis is to retrieve important information from the vast amount of genomic sequence data, such as DNA, RNA and protein sequences. The main task includes the interpretation of the function of DNA sequence on a genomic scale, the comparisons among genomes to gain insight into the universality of biological mechanisms and into the details of gene structure and function, the determination of the structure of all proteins and protein family classification. With its many features and capabilities for recognition, generalization and classification, artificial neural network technology is well suited for sequence analysis. At the state of the art, many methods have been devised to determine if a given protein sequence is member of a given protein superfamily. This is a binary classification problem, and efficient neural network techniques are mentioned in literature for solving such problem. In this Master\u27s thesis, we consider the problem of classifying given protein sequences into one among at least three protein families using neural networks, and, propose two methods: Pair-wise Multiple Classification Approach and Single Network Approach for this problem. In Pair-wise Multiple Classification Approach , several sub-networks are employed to perform the task whereas a compact network system is used in Single Network Approach . We performed experiments, using SNNS and UOWNNS neural network simulator on our NNs with different input/output representation, and reported accuracies as high as 95%. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .Z54. Source: Masters Abstracts International, Volume: 43-01, page: 0248. Adviser: Alioune Ngom. Thesis (M.Sc.)--University of Windsor (Canada), 2004

Scholarship at UWindsor

Development of gene-finding algorithms for fungal genomes : dealing with small datasets and leveraging comparative genomics

Author: Lazarovici Allan, 1979-
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2003
Field of study

Thesis (M.Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (leaves 60-62).A computer program called FUNSCAN was developed which identifies protein coding regions in fungal genomes. Gene structural and compositional properties are modeled using a Hidden Markov Model. Separate training and testing sets for FUNSCAN were obtained by aligning cDNAs from an organism to their genomic loci, generating a 'gold standard' set of annotated genes. The performance of FUNSCAN is competitive with other computer programs design to identify protein coding regions in fungal genomes. A technique called 'Training Set Augmentation' is described which can be used to train FUNSCAN when only a small training set of genes is available. Techniques that combine alignment algorithms with FUNSCAN to identify novel genes are also discussed and explored.by Allan Lazarovici.M.Eng.and S.B

DSpace@MIT

Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

Author: Lakin Steven M.
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2021
Field of study

2021 Spring.Includes bibliographical references.Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Recommended from our members

Methods for cost-sensitive learning

Author: Margineantu Dragos D. (Dragos Dorin)
Publication venue: 'Oregon State University'
Publication date
Field of study

Many approaches for achieving intelligent behavior of automated (computer) systems involve components that learn from past experience. This dissertation studies computational methods for learning from examples, for classification and for decision making, when the decisions have different non-zero costs associated with them. Many practical applications of learning algorithms, including transaction monitoring, fraud detection, intrusion detection, and medical diagnosis, have such non-uniform costs, and there is a great need for new methods that can handle them. This dissertation discusses two approaches to cost-sensitive classification: input data weighting and conditional density estimation. The first method assigns a weight to each training example in order to force the learning algorithm (which is otherwise unchanged) to pay more attention to examples with higher misclassification costs. The dissertation discusses several different weighting methods and concludes that a method that gives higher weight to examples from rarer classes works quite well. Another algorithm that gave good results was a wrapper method that applies Powell's gradient-free algorithm to optimize the input weights. The second approach to cost-sensitive classification is conditional density estimation. In this approach, the output of the learning algorithm is a classifier that estimates, for a new data point, the probability that it belongs to each of the classes. These probability estimates can be combined with a cost matrix to make decisions that minimize the expected cost. The dissertation presents a new algorithm, bagged lazy option trees (B-LOTs), that gives better probability estimates than any previous method based on decision trees. In order to evaluate cost-sensitive classification methods, appropriate statistical methods are needed. The dissertation presents two new statistical procedures: BLOTs provides a confidence interval on the expected cost of a classifier, and BDELTACOST provides a confidence interval on the difference in expected costs of two classifiers. These methods are applied to a large set of experimental studies to evaluate and compare the cost-sensitive methods presented in this dissertation. Finally, the dissertation describes the application of the B-LOTs to a problem of predicting the stability of river channels. In this study, B-LOTs were shown to be superior to other methods in cases where the classes have very different frequencies a situation that arises frequently in cost-sensitive classification problems

ScholarsArchive@OSU

A Decision Tree System for Finding Genes in DNA

Author: Arthur L. Delcher
John Henderson
Kenneth H. Fasman
Steven Salzberg
Publication venue
Publication date
Field of study

MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in MORGAN are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that MORGAN has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95%, with..

CiteSeerX