6 research outputs found
Improving protein secondary structure prediction using a simple k-mer model
Motivation: Some first order methods for protein sequence analysis inherently treat each position as independent. We develop a general framework for introducing longer range interactions. We then demonstrate the power of our approach by applying it to secondary structure prediction; under the independence assumption, sequences produced by existing methods can produce features that are not protein like, an extreme example being a helix of length 1. Our goal was to make the predictions from state of the art methods more realistic, without loss of performance by other measures
VIPR: A probabilistic algorithm for analysis of microbial detection microarrays
<p>Abstract</p> <p>Background</p> <p>All infectious disease oriented clinical diagnostic assays in use today focus on detecting the presence of a single, well defined target agent or a set of agents. In recent years, microarray-based diagnostics have been developed that greatly facilitate the highly parallel detection of multiple microbes that may be present in a given clinical specimen. While several algorithms have been described for interpretation of diagnostic microarrays, none of the existing approaches is capable of incorporating training data generated from positive control samples to improve performance.</p> <p>Results</p> <p>To specifically address this issue we have developed a novel interpretive algorithm, VIPR (<b>V</b>iral <b>I</b>dentification using a <b>PR</b>obabilistic algorithm), which uses Bayesian inference to capitalize on empirical training data to optimize detection sensitivity. To illustrate this approach, we have focused on the detection of viruses that cause hemorrhagic fever (HF) using a custom HF-virus microarray. VIPR was used to analyze 110 empirical microarray hybridizations generated from 33 distinct virus species. An accuracy of 94% was achieved as measured by leave-one-out cross validation. <it>Conclusions</it></p> <p>VIPR outperformed previously described algorithms for this dataset. The VIPR algorithm has potential to be broadly applicable to clinical diagnostic settings, wherein positive controls are typically readily available for generation of training data.</p
Recommended from our members
Protein secondary structure prediction using conditional random fields and profiles
Protein secondary structure prediction plays a pivotal role in predicting protein folding in three-dimensions. Its task is to assign each residue one of the three secondary structure classes helix, strand, or random coil. This is an instance of the problem of sequential supervised learning in machine learning. This thesis describes a new model, TreeCRFpsi, for addressing this problem. TreeCRFpsi combines recent advances in machine learning with new sequence representations developed in molecular biology. The machine learning method, TreeCRF, constructs a conditional random field (CRF) by fitting a set of regression trees via an algorithm known as gradient tree boosting. The new sequence representation is the PSI-BLAST profile introduced by D. Jones, which is based on matching sequences of known protein structure against a much larger set of sequences drawn from the NCBI non-redundant protein sequence database. A new methodology of cross validation was developed and applied to choose the best parameter values for the model. The chosen parameters were the following: tree size of 10 leaves, sliding window size of 15 residues, and 3 rounds of PSI-BLAST searching. The mean three-state prediction accuracy reached 77.6% on both our new SD482 and the popular CB513 datasets. This result is the best among all published results. TreeCRFpsi improved especially on helix and strand predictions by 1-2.3 percentage points over the previous best methods. SOV99 scores were 74.6% and 73.9% for SD482 and CB513, respectively. In addition, there was no apparent overfitting problem observed in our model. Besides achieving higher accuracy, TreeCRFpsi is the first secondary structure prediction method based on a well-defined probabilistic model, which makes it easier to use the output predictions as inputs to subsequent analysis steps
Bioinformatics for High-throughput Virus Detection and Discovery
Pathogen detection is a challenging problem given that any given specimen may contain one or more of many different microbes. Additionally, a specimen may contain microbes that have yet to be discovered. Traditional diagnostics are ill-equipped to address these challenges because they are focused on the detection of a single agent or panel of agents. I have developed three innovative computational approaches for analyzing high-throughput genomic assays capable of detecting many microbes in a parallel and unbiased fashion. The first is a metagenomic sequence analysis pipeline that was initially applied to 12 pediatric diarrhea specimens in order to give the first ever look at the diarrhea virome. Metagenomic sequencing and subsequent analysis revealed a spectrum of viruses in these specimens including known and highly divergent viruses. This metagenomic survey serves as a basis for future investigations about the possible role of these viruses in disease. The second tool I developed is a novel algorithm for diagnostic microarray analysis called VIPR: Viral Identification with a PRobabilistic algorithm). The main advantage of VIPR relative to other published methods for diagnostic microarray analysis is that it relies on a training set of empirical hybridizations of known viruses to guide future predictions. VIPR uses a Bayesian statistical framework in order to accomplish this. A set of hemorrhagic fever viruses and their relatives were hybridized to a total of 110 microarrays in order to test the performance of VIPR. VIPR achieved an accuracy of 94% and outperformed existing approaches for this dataset. The third tool I developed for pathogen detection is called VIPR HMM. VIPR HMM expands upon VIPR\u27s previous implementation by incorporating a hidden Markov model: HMM) in order to detect recombinant viruses. VIPR HMM correctly identified 95% of inter-species breakpoints for a set of recombinant alphaviruses and flaviviruses Mass sequencing and diagnostic microarrays require robust computational tools in order to make predictions regarding the presence of microbes in specimens of interest. High-throughput diagnostic assays coupled with powerful analysis tools have the potential to increase the efficacy with which we detect pathogens and treat disease as these technologies play more prominent roles in clinical laboratories