Search CORE

333 research outputs found

Unsupervised and semi-supervised training methods for eukaryotic gene prediction

Author: Ter-Hovhannisyan Vardges
Publication venue: Georgia Institute of Technology
Publication date: 17/11/2008
Field of study

This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing. Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species with posses with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semi-supervised training approach is designed for eukaryotic species with small number of introns. The results indicate that the developed unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments. Analysis of novel genomes reveals interesting biological findings and show that several candidates of under-annotated and over-annotated fungal species are present in the current set of annotated of fungal genomes.Ph.D.Committee Chair: Mark Borodovky; Committee Member: Jung H. Choi; Committee Member: King Jordan; Committee Member: Leonid Bunimovich; Committee Member: Yury Chernof

Scholarly Materials And Research @ Georgia Tech

An empirical analysis of training protocols for probabilistic gene finders

Author: Majoros William H
Salzberg Steven L
Publication venue: BioMed Central
Publication date: 01/12/2004
Field of study

BACKGROUND: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. RESULTS: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. CONCLUSIONS: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements

Springer

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

AUGUSTUS: ab initio prediction of alternative transcripts

Author: Gunduz Irfan
Hayes Alec
Keller Oliver
Morgenstern Burkhard
Stanke Mario
Waack Stephan
Publication venue: Oxford University Press
Publication date: 01/01/2006
Field of study

AUGUSTUS is a software tool for gene prediction in eukaryotes based on a Generalized Hidden Markov Model, a probabilistic model of a sequence and its gene structure. Like most existing gene finders, the first version of AUGUSTUS returned one transcript per predicted gene and ignored the phenomenon of alternative splicing. Herein, we present a WWW server for an extended version of AUGUSTUS that is able to predict multiple splice variants. To our knowledge, this is the first ab initio gene finder that can predict multiple transcripts. In addition, we offer a motif searching facility, where user-defined regular expressions can be searched against putative proteins encoded by the predicted genes. The AUGUSTUS web interface and the downloadable open-source stand-alone program are freely available from

CiteSeerX

Crossref

PubMed Central

A Brief Review of Computational Gene Prediction Methods

Author: Chen Yazhu
Li Yixue
Wang Zhuo
Publication venue: 'Elsevier BV'
Publication date: 30/11/2004
Field of study

With the development of genome sequencing for many organisms, more and more raw sequences need to be annotated. Gene prediction by computational methods for finding the location of protein coding regions is one of the essential issues in bioinformatics. Two classes of methods are generally adopted: similarity based searches and ab initio prediction. Here, we review the development of gene prediction methods, summarize the measures for evaluating predictor quality, highlight open problems in this area, and discuss future research directions

Elsevier - Publisher Connector

JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions

Author: Allen Jonathan E
Majoros William H
Pertea Mihaela
Salzberg Steven L
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Predicting complete protein-coding genes in human DNA remains a significant challenge. Though a number of promising approaches have been investigated, an ideal suite of tools has yet to emerge that can provide near perfect levels of sensitivity and specificity at the level of whole genes. As an incremental step in this direction, it is hoped that controlled gene finding experiments in the ENCODE regions will provide a more accurate view of the relative benefits of different strategies for modeling and predicting gene structures. RESULTS: Here we describe our general-purpose eukaryotic gene finding pipeline and its major components, as well as the methodological adaptations that we found necessary in accommodating human DNA in our pipeline, noting that a similar level of effort may be necessary by ourselves and others with similar pipelines whenever a new class of genomes is presented to the community for analysis. We also describe a number of controlled experiments involving the differential inclusion of various types of evidence and feature states into our models and the resulting impact these variations have had on predictive accuracy. CONCLUSION: While in the case of the non-comparative gene finders we found that adding model states to represent specific biological features did little to enhance predictive accuracy, for our evidence-based 'combiner' program the incorporation of additional evidence tracks tended to produce significant gains in accuracy for most evidence types, suggesting that improved modeling efforts at the hidden Markov model level are of relatively little value. We relate these findings to our current plans for future research

Springer - Publisher Connector

PubMed Central

Digital Repository at the University of Maryland

Recommended from our members

Improved Reference Genome Sequence of Coccidioides immitis Strain WA_211, Isolated in Washington State.

Author: Barker Bridget Marie
Stajich Jason E
Teixeira Marcus de Melo
Publication venue: eScholarship, University of California
Publication date: 01/08/2019
Field of study

Coccidioides fungi are widely distributed in the American continents, with an expanding western range documented by a recently discovered cryptic population of Coccidioides immitis in Washington State. The assembled and annotated reference genome sequence of the soil-derived C. immitis strain WA_211 will support population and functional genomics studies

eScholarship - University of California

AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome

Author: Morgenstern Burkhard
Stanke Mario
Tzvetkova Ana
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: A large number of gene prediction programs for the human genome exist. These annotation tools use a variety of methods and data sources. In the recent ENCODE genome annotation assessment project (EGASP), some of the most commonly used and recently developed gene-prediction programs were systematically evaluated and compared on test data from the human genome. AUGUSTUS was among the tools that were tested in this project. RESULTS: AUGUSTUS can be used as an ab initio program, that is, as a program that uses only one single genomic sequence as input information. In addition, it is able to combine information from the genomic sequence under study with external hints from various sources of information. For EGASP, we used genomic sequence alignments as well as alignments to expressed sequence tags (ESTs) and protein sequences as additional sources of information. Within the category of ab initio programs AUGUSTUS predicted significantly more genes correctly than any other ab initio program. At the same time it predicted the smallest number of false positive genes and the smallest number of false positive exons among all ab initio programs. The accuracy of AUGUSTUS could be further improved when additional extrinsic data, such as alignments to EST, protein and/or genomic sequences, was taken into account. CONCLUSION: AUGUSTUS turned out to be the most accurate ab initio gene finder among the tested tools. Moreover it is very flexible because it can take information from several sources simultaneously into consideration

CiteSeerX

Springer - Publisher Connector

PubMed Central

A phylogenetic generalized hidden Markov model for predicting alternatively spliced exons

Author: A Siepel
AA Mironov
B Modrek
B Modrek
BJ Haas
CW Sugnet
D Boffelli
D Brett
DL Black
DL Philipps
G Dror
G Rätsch
GW Yeo
H Nagasqaki
I Korf
J Felsenstein
JD McAuliffe
JE Allen
JM Johnson
Jonathan E Allen
JS Pedersen
L Cartegni
L Croft
M Alexandersson
M Hasegawa
M Hiller
M Hiller
P Carninci
Q Xu
R Sorek
R Sorek
RA Drysdale
RC Edgar
SL Cawley
SS Gross
Steven L Salzberg
T Maniatis
U Ohler
Z Kan
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: An important challenge in eukaryotic gene prediction is accurate identification of alternatively spliced exons. Functional transcripts can go undetected in gene expression studies when alternative splicing only occurs under specific biological conditions. Non-expression based computational methods support identification of rarely expressed transcripts. RESULTS: A non-expression based statistical method is presented to annotate alternatively spliced exons using a single genome sequence and evidence from cross-species sequence conservation. The computational method is implemented in the program ExAlt and an analysis of prediction accuracy is given for Drosophila melanogaster. CONCLUSION: ExAlt identifies the structure of most alternatively spliced exons in the test set and cross-species sequence conservation is shown to improve the precision of predictions. The software package is available to run on Drosophila genomes to search for new cases of alternative splicing

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts

Author: A Guida
A Kumar
A Lomsadze
AD Neverov
Alison C Testa
AM McGuire
AV Lukashin
BJ Haas
BJ Haas
BJ Haas
BJ Loftus
BL Cantarel
C Camacho
C Holt
C Trapnell
C Trapnell
C Zhao
D Cullen
D Kim
D Martinez
DHD Kulp
DM Kupfer
GC Cerqueira
I Korf
I Reid
J Liu
James K Hane
JE Galagan
JK Hane
KJ Hoff
KR Christie
L Wang
M Berg Van Den
M Burset
M Dashtban
M Kozak
M Marcet-Houben
M Martin
M Stanke
M Stanke
M Stanke
MG Grabherr
N Rhind
NR Coordinators
R Dean
R Leinonen
RD Finn
Richard P Oliver
RP Oliver
RY Eberhardt
SB Hedges
Simon R Ellwood
SL Forsburg
SR Ellwood
T Steijger
TL Friesen
TU Consortium
V Ter-Hovhannisyan
VM Bruno
WM Vos de
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. Results: CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. Conclusions: We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available (https://sourceforge.net/projects/codingquarry/), and suitable for incorporation into genome annotation pipelines

Crossref

Springer - Publisher Connector

PubMed Central

espace@Curtin