Search CORE

12 research outputs found

A Brief Review of Computational Gene Prediction Methods

Author: Chen Yazhu
Li Yixue
Wang Zhuo
Publication venue: 'Elsevier BV'
Publication date: 30/11/2004
Field of study

With the development of genome sequencing for many organisms, more and more raw sequences need to be annotated. Gene prediction by computational methods for finding the location of protein coding regions is one of the essential issues in bioinformatics. Two classes of methods are generally adopted: similarity based searches and ab initio prediction. Here, we review the development of gene prediction methods, summarize the measures for evaluating predictor quality, highlight open problems in this area, and discuss future research directions

Elsevier - Publisher Connector

DIGAP - a Database of Improved Gene Annotation for Phytopathogens

Author: Chang Ji-Wei
Chen Ling-Ling
Gao Bei
Gao Na
Ji Hong-Fang
Wang Wei
Zhang Hong-Yu
Zhang Lin
Zhang Shi-Cui
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Bacterial plant pathogens are very harmful to their host plants, which can cause devastating agricultural losses in the world. With the development of microbial genome sequencing, many strains of phytopathogens have been sequenced. However, some misannotations exist in these phytopathogen genomes. Our objective is to improve these annotations and store them in a central database DIGAP. Description DIGAP includes the following improved information on phytopathogen genomes. (i) All the 'hypothetical proteins' were checked, and non-coding ORFs recognized by the Z curve method were removed. (ii) The translation initiation sites (TISs) of 20% ~ 25% of all the protein-coding genes have been corrected based on the NCBI RefSeq, ProTISA database and an <it>ab initio </it>program, GS-Finder. (iii) Potential functions of about 10% 'hypothetical proteins' have been predicted using sequence alignment tools. (iv) Two theoretical gene expression indices, the codon adaptation index (CAI) and the <it>E</it>(<it>g</it>) index, were calculated to predict the gene expression levels. (v) Potential agricultural bactericide targets and their homology-modeled 3D structures are provided in the database, which is of significance for agricultural antibiotic discovery. Conclusion The results in DIGAP provide useful information for understanding the pathogenetic mechanisms of phytopathogens and for finding agricultural bactericides. DIGAP is freely available at <url>http://ibi.hzau.edu.cn/digap/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A rebuttal to the comments on the genome order index and the Z-curve

Author: Zhang Ren
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Elhaik, Graur and Josic recently commented on the genome order index (<it>S</it>) and the <it>Z</it>-curve (Elhaik et al. Biol Direct 2010, 5: 10). <it>S </it>is a quantity defined as <it>S </it>= <it>a</it>2 + <it>c</it>2 + <it>g</it>2 + <it>t</it>2, where <it>a</it>, <it>c</it>, <it>g </it>and <it>t </it>denote corresponding base frequencies. The <it>Z</it>-curve is a three dimensional curve that represents a DNA sequence in the manner that each can be uniquely reconstructed given the other. Elhaik et al. made 4 major claims. 1) In the previous mapping system with the regular tetrahedron, calculation of the radius of the inscribed sphere is "a mathematical error". 2) <it>S </it>follows an exponential distribution and is narrowly distributed with a range of (0.25 - 0.33). 3) Based on the Chargaff's second parity rule (PR2), "<it>S </it>is equivalent to <it>H </it>[Shannon entropy]" and they are derivable from each other. 4) <it>Z</it>-curve "suffers from over dimensionality", because based on the analysis of 235 bacterial genomes, <it>x </it>and <it>y </it>components contributed only less than 1% of the variance and therefore "would be of little use". Results 1) Elhaik et al. mistakenly neglected the parameter <inline-formula><m:math xmlns:m="http://www.w3.org/1998/Math/MathML" name="1745-6150-6-10-i1"><m:mrow><m:mn>4</m:mn><m:mo>/</m:mo><m:msqrt><m:mn>3</m:mn></m:msqrt></m:mrow></m:math></inline-formula> when calculating the radius of the inscribed sphere. 2) The exponential distribution of <it>S </it>is a restatement of our previous conclusion, and the range of (0.25 - 0.33) only paraphrases the previously suggested <it>S </it>range (0.25 -1/3). 3) Elhaik et al. incorrectly disregard deviations from PR2 by treating the deviations as 0 altogether, reduce <it>S </it>and <it>H</it>, both having 4 variables, <it>a, c, g </it>and <it>t</it>, into functions of one single variable, <it>a </it>only, and apply this treatment to all DNA sequences as the basis of their "demonstration", which is therefore invalid. 4) Elhaik et al. confuse numeral smallness with biological insignificance, and disregard the distributions of purine/pyrimidine and amino/keto bases (<it>x </it>and <it>y </it>components), the variations of which, although can be less than that of GC content, contain rich information that is important and useful, such as in locating replication origins of bacterial and archaeal genomes, and in studies of gene recognition in various species. Conclusion Elhaik et al. confuse <it>S </it>(a single number) with <it>Z</it>-curve (a series of 3D coordinates), which are distinct. To use <it>S </it>as a case study of <it>Z</it>-curve, by itself, is invalid. <it>S </it>and <it>H </it>are neither equivalent nor derivable from each other. The criticisms of Elhaik, Graur and Josic are wrong. Reviewers This article was reviewed by Erik van Nimwegen.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Re-annotation of protein-coding genes in 10 complete genomes of Neisseriaceae family by combining similarity-based and composition-based methods.

Author: Guo F
Lau SKP
Teng LL
Woo PCY
Xiong L
Yuen KY
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2013
Field of study

published_or_final_versio

HKU Scholars Hub

Recognition of prokaryotic promoters based on a novel variable-window Z-curve method

Author: Alberts
Askary
Bansal
Barrios
Barrios
Benson
Bland
Burden
Burnham
Estrem
Evans
Gama-Castro
Gao
Gao
Geladi
Gordon
Gruber
Gruber
Guo
Helmann
Helmann
Hook-Barnard
Hook-Barnard
Höskuldsson
Kai Song
Kvalheim
Kvalheim
Lin
Lindgren
Mann
McCracken
Paget
Paget
Perez-Rueda
Perez-Rueda
Rani
Rosipal
Rosipal
Rännar
Samal
Shultzaberger
Shultzaberger
Sierro
Sierro
Tsukahara
van Hijum
Wold
Wold
Wosten
Yang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Publication venue: Oxford University Press
Publication date
Field of study

Transcription is the first step in gene expression, and it is the step at which most of the regulation of expression occurs. Although sequenced prokaryotic genomes provide a wealth of information, transcriptional regulatory networks are still poorly understood using the available genomic information, largely because accurate prediction of promoters is difficult. To improve promoter recognition performance, a novel variable-window Z-curve method is developed to extract general features of prokaryotic promoters. The features are used for further classification by the partial least squares technique. To verify the prediction performance, the proposed method is applied to predict promoter fragments of two representative prokaryotic model organisms (Escherichia coli and Bacillus subtilis). Depending on the feature extraction and selection power of the proposed method, the promoter prediction accuracies are improved markedly over most existing approaches: for E. coli, the accuracies are 96.05% (σ70 promoters, coding negative samples), 90.44% (σ70 promoters, non-coding negative samples), 92.13% (known sigma-factor promoters, coding negative samples), 92.50% (known sigma-factor promoters, non-coding negative samples), respectively; for B. subtilis, the accuracies are 95.83% (known sigma-factor promoters, coding negative samples) and 99.09% (known sigma-factor promoters, non-coding negative samples). Additionally, being a linear technique, the computational simplicity of the proposed method makes it easy to run in a matter of minutes on ordinary personal computers or even laptops. More importantly, there is no need to optimize parameters, so it is very practical for predicting other species promoters without any prior knowledge or prior information of the statistical properties of the samples

Crossref

PubMed Central

Classifying Coding DNA with Nucleotide Statistics

Author: Carels Nicolas
Frías Diego
Publication venue: Libertas Academica
Publication date: 01/01/2009
Field of study

In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences

CiteSeerX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Directory of Open Access Journals

PubMed Central

An Integrative Method for Identifying the Over-Annotated Protein-Coding Genes in Microbial Genomes

Author: Bakke
Bocs
Brenner
Burset
Camus
Chen
Chen
Cho
D.-K. Jiang
Da Silva
Devos
Fickett
Garcia-Vallve
Garcia-Vallve
Gundogdu
Guo
Guo
Guo
Guo
Gupta
Hamori
Heidelberg
J. Guo
J.-F. Yu
J.-H. Wang
Jones
K. Xiao
Liu
Luo
Monier
Muto
Nagy
Ochman
Palleja
Pruitt
Salzberg
Sharp
Skovgaard
Tatusov
Tech
Trifonov
Vernikos
Wang
Warren
X. Sun
Yu
Yu
Zhang
Zhang
Publication venue: Oxford University Press
Publication date
Field of study

The falsely annotated protein-coding genes have been deemed one of the major causes accounting for the annotating errors in public databases. Although many filtering approaches have been designed for the over-annotated protein-coding genes, some are questionable due to the resultant increase in false negative. Furthermore, there is no webserver or software specifically devised for the problem of over-annotation. In this study, we propose an integrative algorithm for detecting the over-annotated protein-coding genes in microorganisms. Overall, an average accuracy of 99.94% is achieved over 61 microbial genomes. The extremely high accuracy indicates that the presented algorithm is efficient to differentiate the protein-coding genes from the non-coding open reading frames. Abundant analyses show that the predicting results are reliable and the integrative algorithm is robust and convenient. Our analysis also indicates that the over-annotated protein-coding genes can cause the false positive of horizontal gene transfers detection. The webserver of the proposed algorithm can be freely accessible from www.cbi.seu.edu.cn/RPGM

Crossref

PubMed Central

Evidence of abundant stop codon readthrough in Drosophila and other Metazoa

Author: Chan Clara Sophia
Jungreis Irwin
Kellis Manolis
Lin Michael F.
Negre Nicolas
Spokony Rebecca
Victorsen Alec
White Kevin P.
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/12/2010
Field of study

While translational stop codon readthrough is often used by viral genomes, it has been observed for only a handful of eukaryotic genes. We previously used comparative genomics evidence to recognize protein-coding regions in 12 species of Drosophila and showed that for 149 genes, the open reading frame following the stop codon has a protein-coding conservation signature, hinting that stop codon readthrough might be common in Drosophila. We return to this observation armed with deep RNA sequence data from the modENCODE project, an improved higher-resolution comparative genomics metric for detecting protein-coding regions, comparative sequence information from additional species, and directed experimental evidence. We report an expanded set of 283 readthrough candidates, including 16 double-readthrough candidates; these were manually curated to rule out alternatives such as A-to-I editing, alternative splicing, dicistronic translation, and selenocysteine incorporation. We report experimental evidence of translation using GFP tagging and mass spectrometry for several readthrough regions. We find that the set of readthrough candidates differs from other genes in length, composition, conservation, stop codon context, and in some cases, conserved stem–loops, providing clues about readthrough regulation and potential mechanisms. Lastly, we expand our studies beyond Drosophila and find evidence of abundant readthrough in several other insect species and one crustacean, and several readthrough candidates in nematode and human, suggesting that functionally important translational stop codon readthrough is significantly more prevalent in Metazoa than previously recognized.National Institutes of Health (U.S.) (U54 HG00455-01)National Science Foundation (U.S.) (CAREER 0644282)Alfred P. Sloan Foundatio

DSpace@MIT

Crossref

PubMed Central

Novel bioinformatics programs for taxonomical classification and functional analysis of the whole genome sequencing data of arbuscular mycorrhizal fungi

Author: Kang Jee Eun
Publication venue
Publication date: 01/10/2018
Field of study

Résumé [TITRE] Classification taxonomique et analyse fonctionnelle spécifique àla position des séquences génomique des champignons mycorhiziens arbusculaires et les microorganismes qui leurs sont associés [PROBLÉMATIQUE ET CADRE CONCEPTUEL] Les champignons mycorhiziens arbusculaires (CMA) sont des symbiotes obligatoires des racines de la majoritédes plantes vasculaires. Les CMA appartiennent au phylum Glomeromycota et ils sont considérés comme une lignée fongique primitive qui a conservé la structure coenocytique des hyphes et la production des spores asexuées multinucléées. De nombeuses études ont démontréque plusieurs microorganismes sont associés avec les mycélia des CMA soit àla surface des hyphes et des spores mais aussi àl'intérieurs de celles-ci. Le séquençage des génomes des CMA cultivés in-vivo représente un défi considérable car il s’agit d’un métagénome constituédu génome du CMA lui-même et les génomes des microbes qui lui sont associés. Par conséquence, l’identification de l'origine taxonomique de chaque séquence représente une tâche extrêmement ardue. Dans mon projet, j’ai développédeux nouveaux programmes bioinformatiques qui permettent de classer les séquences selon groupe taxonomique et d’identifier les fonctions de celles-ci. J’ai crééune base de données avec 444 génomes d'espèces appartenant à54 genres. Le choix de ces espèces des bactéries et des champignons a étébasésur leur abondance dans les sols). [MÉTHODOLOGIE] Le programme bioinformatique utilise le tableau des références des microorganismes et des méthodes statistiques pour la classification taxonomique des séquences. Par la suite, des tableaux des codons synonymes étaient créés àpartir des structures secondaires (SS) des bases de données de protéines (PDB) pour les séquences codantes (SC) et des motifs de composition pour les séquences non-codantes (SNC). Chaque tableau est composéde 3 niveaux - les caractéristiques d'acides aminés; l'utilisation des acides aminés synonymes correspondants, et l'utilisation des codons synonymes correspondants. En comparant les méthodes existantes qui utilisent les taux de substitution moyenne globale quelle que soit les spécificités des acides aminés dans diverses structures, mon programme fournit une classification àhaute résolution pour des séquences courtes (150-300 pb) parce que les biais dans l'utilisation des codons synonymes àpartir d'environ 8000 trimères d'acides aminés spécifiques des sous-unités de structure secondaire, ont étéextraits avec des substitutions d'acides aminés pris en considération dans chaque trimère spécifique. Pour l'analyse fonctionnelle, le programme crée dynamiquement des données comparatives de 54 genres microbiens basés sur leurs biais dans l'utilisation des codons synonymes d'appariement de trois codons d’ADN (9-mères) identifiés dans une séquence de requête. Le programme applique une analyse en composantes principales basée sur la matrice de corrélation en association avec le partitionnement en k-moyennes aux données comparatives. [RETOMBÉES] Les taux de prédiction correcte de la CDS et les non-CDS étaient de 50 à71% pour les bactéries, et 65 à73% pour les champignons, respectivement. Pour les CMA, 49% des CDS et 72% des non-CDS ont étécorrectement classés. Ce programme nous permet d'estimer les abondances approximatives des communautés microbiennes associées au CMA. Les résultats de l'analyse fonctionnelle peuvent fournir des informations sur des sites d'interaction moléculaire importants impliqués dans la diversification des séquences et l’évolution des gènes. Les programmes sont disponibles gratuitement sur www.fungalsesame.org. Mots-clés: sesame, sesame PS function, les caractéristiques d'acides aminés, trois codons ADN 9-mères, structure secondaire, classification taxonomique, analyse fonctionnelle spécifique àla position; Code génétique; Étude Comparative; Génome MitochondrialAbstract Arbuscular Mycorrhizal Fungi (AMF) are obligate plant-root symbionts belonging to the phylum Glomeromycota. They form coenocytic hyphae and reproduce through large multinucleated asexual spores. Numerous studies have shown that AMF interact closely or loosely with a myriad of microorganisms, particularly bacteria and fungi that live on the surface of or inside of their mycelia and spores. Whole genome sequencing (WGS) data of the AMF grown in-vivo (typically grown in root of a host plant in pot filled with soil) contain a large amount of sequences from microorganisms inhabiting in their spore along with their own genome sequences, resulting in a metagenome. The goal of my study was to develop bioinformatics programs for taxonomical classification and for functional analysis of the WGS data of the AMF. In the area of metagenomics, there are mainly two approaches for taxonomical classification: similarity-based (i.e., homology search) and composition-based (i.e., k-mers) methods. Similarity-based method solely depends on bioinformatics sequence databases and homology search programs such as BLAST program. The similarity-based method may not be suitable for ancient fungi AMF, because bioinformatics databases represent only a small fraction of the diversity of existing microorganisms, and gene prediction programs are highly biased towards intensively studied microorganisms. Considering that AMF have high inter/ intra genome variations, in addition to coenocytic and multi-genomic characteristics, probably due to their adaptation via various kinds of symbioses, composition-based method alone is not an effective solution for AMF, because it relies on base composition biases and focuses on taxonomical classification for prokaryotic organisms. In the first project, I a developed novel bioinformatics program, called SeSaMe (Spore associated Symbiotic Microbes), for taxonomical classification of the WGS data of the AMF. I selected microorganisms that were dominant in soil environment and grouped them into 54 genera which were used as references. I created a reference sequence database with a variable called Three codon DNA 9-mer. They were created based on a large number of structure files from Protein Data Bank (PDB): approx. 224,000 Three codon DNA 9-mers encoding for subunits of protein secondary structures. Based on the reference sequence database, I created genus specific usage databases containing codon usage and amino acid usage per taxonomic rank- genus. The program distinguishes between coding sequence (CDS) and non-CDS, detects an open reading frame, and classifies a query sequence into a genus group out of 54 genera used as reference. The developed program enables us to estimate relative abundances of taxonomic groups and to assess symbiotic roles of taxonomic groups associated with AMF. The program can be applied to other microorganisms as well as soil metagenome data. The program has applications in applied environmental microbiology. The developed program is available for free of charge at www.fungalsesame.org. In the second project, I developed another bioinformatics program, called SeSaMe PS Function, for position specific functional analysis of the WGS data of the AMF. AMF may contain a large portion of genes with unknown functions for which we may not be able to find homologues in existing sequence databases. While existing motif annotation programs rely on sequence alignment and have limitations for inferring functionality of novel genes, the developed program identifies potentially important interaction sites that are structurally and functionally distinctive from other subsequences, within a query sequence with exploratory data analysis. The program identifies matching Three codon DNA 9-mers in a query sequence, and dynamically creates comparative dataset of 54 genera, based on codon usage bias information retrieved from the genus specific usage databases. The program applies correlation Principal Component Analysis in conjunction with K-means clustering method to the comparative dataset. The program identifies outliers; Three codon DNA 9-mers, assigned into a cluster with a single member or with only a few members, are often outliers with important structures that may play roles in molecular interaction. In the third project, I developed a novel bioinformatics program called Posts (POsition Specific genetic code Tables) that assigns a codon into an amino acid group according to the codon position. The standard genetic code table may be more readily applicable to the genes whose genetic codes comply with the standard biological coding rules obtained from model organisms grown under laboratory condition. However, it may be insufficient for studying evolutions of genetic codes that may provide important information about codon properties. The mainstream hypotheses of genetic code origin suggested that codon position played important roles in the evolution of genetic codes. As a case study, we investigated irregular codons in 187 mitochondrial genomes of plants, lichen-forming fungi, endophytic fungi, and AMF. Each column of the Post contains 16 codons and the amino acids encoded by these are called an amino acid characteristics group (A.A. Char Group). Based on A.A. Char Group, an irregular codon can be classified into within-column type or trans-column type. The majority of the identified irregular codons belonged to the within-column type. The Post may offer new perspectives on codon property and codon assignment. The developed program is freely available at www.codon.kr. Taken together, the developed programs, the SeSaMe, the SeSaMe PS Function, and the Post, provide important research tools for advancing our knowledge of AMF genomics and for studying their symbiotic relations with associated microorganisms. Keywords: Sesame; Spore associated Symbiotic Microbes; Symbiosis; Sesame PS function; Arbuscular mycorrhizal fungi; Three codon DNA 9-mer; Amino acid characteristics; Secondary structure; Taxonomical classification; Position specific functional analysis; Position specific genetic code tables; Post; Comparative study; Mitochondrial genom

Dépôt Institutionnel Numérique