48 research outputs found

    A Brief Review of Computational Gene Prediction Methods

    Get PDF
    With the development of genome sequencing for many organisms, more and more raw sequences need to be annotated. Gene prediction by computational methods for finding the location of protein coding regions is one of the essential issues in bioinformatics. Two classes of methods are generally adopted: similarity based searches and ab initio prediction. Here, we review the development of gene prediction methods, summarize the measures for evaluating predictor quality, highlight open problems in this area, and discuss future research directions

    Gene identification and analysis: an application of neural network-based information fusion

    Full text link

    Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences

    Get PDF
    A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%–70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively contextindependent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time

    An efficient and high-throughput approach for experimental validation of novel human gene predictions

    Get PDF
    AbstractA highly automated RT-PCR-based approach has been established to validate novel human gene predictions with no prior experimental evidence of mRNA splicing (ab initio predictions). Ab initio gene predictions were selected for high-throughput validation using predicted protein classification, sequence similarity to other genomes, colocalization with an MPSS tag, or microarray expression. Initial microarray prioritization followed by RT-PCR validation was the most efficient combination, resulting in approximately 35% of the ab initio predictions being validated by RT-PCR. Of the 7252 novel genes that were prioritized and processed, 796 constituted real transcripts. In addition, high-throughput RACE successfully extended the 5′ and/or 3′ ends of >60% of RT-PCR-validated genes. Reevaluation of these transcripts produced 574 novel transcripts using RefSeq as a reference. RT-PCR sequencing in combination with RACE on ab initio gene predictions could be used to define the transcriptome across all species

    Fusion of the human gene for the polyubiquitination coeffector UEV1 with Kua, a newly identified gene

    Full text link
    UEV proteins are enzymatically inactive variants of the E2 ubiquitin-conjugating enzymes that regulate noncanonical elongation of ubiquitin chains. In Saccharomyces cerevisiae, UEV is part of the RAD6-mediated error-free DNA repair pathway. In mammalian cells, UEV proteins can modulate c-FOS transcription and the G2-M transition of the cell cycle. Here we show that the UEV genes from phylogenetically distant organisms present a remarkable conservation in their exon-intron structure. We also show that the human UEV1 gene is fused with the previously unknown geneKua. In Caenorhabditis elegans and Drosophila melanogaster, Kua and UEV are in separated loci, and are expressed as independent transcripts and proteins. In humans,Kua and UEV1 are adjacent genes, expressed either as separate transcripts encoding independent Kua and UEV1 proteins, or as a hybrid Kua-UEV transcript, encoding a two-domain protein. Kua proteins represent a novel class of conserved proteins with juxtamembrane histidine-rich motifs. Experiments with epitope-tagged proteins show that UEV1A is a nuclear protein, whereas both Kua and Kua-UEV localize to cytoplasmic structures, indicating that the Kua domain determines the cytoplasmic localization of Kua-UEV. Therefore, the addition of a Kua domain to UEV in the fused Kua-UEV protein confers new biological properties to this regulator of variant polyubiquitination

    Construction of an ~700-kb transcript map around the Familial Mediterranean Fever locus on human chromosome 16p13.3

    Get PDF
    We used a combination of cDNA selection, exon amplification, and computational prediction from genomic sequence to isolate transcribed sequences from genomic DNA surrounding the familial Mediterranean fever (FMF) locus. Eighty-seven kb of genomic DNA around D16S3370, a marker showing a high degree of linkage disequilibrium with FMF, was sequenced to completion, and the sequence annotated. A transcript map reflecting the minimal number of genes encoded within the ∼700 kb of genomic DNA surrounding the FMF locus was assembled. This map consists of 27 genes with discreet messages detectable on Northerns, in addition to three olfactory-receptor genes, a cluster of 18 tRNA genes, and two putative transcriptional units that have typical intron–exon splice junctions yet do not detect messages on Northerns. Four of the transcripts are identical to genes described previously, seven have been independently identified by the French FMF Consortium, and the others are novel. Six related zinc-finger genes, a cluster of tRNAs, and three olfactory receptors account for the majority of transcribed sequences isolated from a 315-kb FMF central region (betweenD16S468/D16S3070 and cosmid 377A12). Interspersed among them are several genes that may be important in inflammation. This transcript map not only has permitted the identification of the FMF gene (MEFV), but also has provided us an opportunity to probe the structural and functional features of this region of chromosome 16.Michael Centola, Xiaoguang Chen, Raman Sood, Zuoming Deng, Ivona Aksentijevich, Trevor Blake, Darrell O. Ricke, Xiang Chen, Geryl Wood, Nurit Zaks, Neil Richards, David Krizman, Elizabeth Mansfield, Sinoula Apostolou, Jingmei Liu, Neta Shafran, Anil Vedula, Melanie Hamon, Andrea Cercek, Tanaz Kahan, Deborah Gumucio, David F. Callen, Robert I. Richards, Robert K. Moyzis, Norman A. Doggett, Francis S. Collins, P. Paul Liu, Nathan Fischel-Ghodsian and Daniel L. Kastne

    Isolation and characterization of EgGST, a glutathione S-transferase protein transcript in oil palm (Elaeis guineensis Jacq.)

    Get PDF
    The formation of callus and somatic embryos remains one of the major bottlenecks in oil palm tissue culture. Unlike other crops, oil palm tissue culture is a very slow process. In the present study, EgGST (GenBank accession no. AIC33066.1), an oil palm gene coding for a putative glutathione S-transferase protein, has been characterized molecularly. The full length cDNA sequence of EgGST isolated from oil palm cultured leaf explants at the 6th week is 1002 bp in length, with an Open Reading Frame (ORF) of 651 bp. The deduced EgGST encodes a 216-amino-acid protein with a predicted molecular mass of 23.68 kD and a pI value of 6.16. Its protein sequence shares 63% identity with the glutathione s-transferase gstf2 from Oryza sativa Indica Group (GenBank accession no. ABR25713.1) and contains thioredoxin fold and chloride channel domain. Real-time RT-PCR results showed that the EgGST transcript was differentially expressed across a time series of fortnightly-cultured leaf explants and had a higher transcript levels in nodular callus (NC) compared to friable callus (FC) for oil palm ortet of clone 4178. EgGST was also found to be preferentially expressed in all tissue culture derived materials except for oil palm cell suspension culture (CSC), whereas there were almost negligible expressions in all the non-tissue culture derived materials, except for root. Hence, it can be suggested that EgGST transcript may possibly be regulated differently at different stages of tissue culture and various tissues. Interestingly, EgGST also displayed a tissue-specific expression pattern via RNA in situ hybridization. To our knowledge, this is the first reported study on the analysis of the localization of target mRNA transcript of EgGST in different oil palm tissues. We postulated that EgGST might play significant roles at different stages of oil palm callogenesis, and could potentially be a candidate marker for oil palm callogenesis

    Knowledge discovery and modeling in genomic databases

    Get PDF
    This dissertation research is targeted toward developing effective and accurate methods for identifying gene structures in the genomes of high eukaryotes, such as vertebrate organisms. Several effective hidden Markov models (HMMs) are developed to represent the consensus and degeneracy features of the functional sites including protein-translation start sites, mRNA splicing junction donor and acceptor sites in vertebrate genes. The HMM system based on the developed models is fully trained using an expectation maximization (EM) algorithm and the system performance is evaluated using a 10-way cross-validation method. Experimental results show that the proposed HMM system achieves high sensitivity and specificity in detecting the functional sites. This HMM system is then incorporated into a new gene detection system, called GeneScout. The main hypothesis is that, given a vertebrate genomic DNA sequence S, it is always possible to construct a directed acyclic graph G such that the path for the actual coding region of S is in the set of all paths on G. Thus, the gene detection problem is reduced to the analysis of paths in the graph G. A dynamic programming algorithm is employed by GeneScout to find the optimal path in G. Experimental results on the standard test dataset collected by Burset and Guigo indicate that GeneScout is comparable to existing gene discovery tools and complements the widely used GenScan system
    corecore