627 research outputs found
PIntron: a Fast Method for Gene Structure Prediction via Maximal Pairings of a Pattern and a Text
Current computational methods for exon-intron structure prediction from a
cluster of transcript (EST, mRNA) data do not exhibit the time and space
efficiency necessary to process large clusters of over than 20,000 ESTs and
genes longer than 1Mb. Guaranteeing both accuracy and efficiency seems to be a
computational goal quite far to be achieved, since accuracy is strictly related
to exploiting the inherent redundancy of information present in a large
cluster. We propose a fast method for the problem that combines two ideas: a
novel algorithm of proved small time complexity for computing spliced
alignments of a transcript against a genome, and an efficient algorithm that
exploits the inherent redundancy of information in a cluster of transcripts to
select, among all possible factorizations of EST sequences, those allowing to
infer splice site junctions that are highly confirmed by the input data. The
EST alignment procedure is based on the construction of maximal embeddings that
are sequences obtained from paths of a graph structure, called Embedding Graph,
whose vertices are the maximal pairings of a genomic sequence T and an EST P.
The procedure runs in time linear in the size of P, T and of the output.
PIntron, the software tool implementing our methodology, is able to process in
a few seconds some critical genes that are not manageable by other gene
structure prediction tools. At the same time, PIntron exhibits high accuracy
(sensitivity and specificity) when compared with ENCODE data. Detailed
experimental data, additional results and PIntron software are available at
http://www.algolab.eu/PIntron
Complexity of Bidirectional Transcription and Alternative Splicing at Human RCAN3 Locus
Human RCAN3 (regulator of calcineurin 3) belongs to the human RCAN gene family
Computational modeling of gene structure in Arabidopsis thaliana
Computational gene identification by sequence inspection remains a challenging problem. For a typical Arabidopsis thaliana gene with five exons, at least one of the exons is expected to have at least one of its borders predicted incorrectly by ab initio gene finding programs. More detailed analysis for individual genomic loci can often resolve the uncertainty on the basis of EST evidence or similarity to potential protein homologues. Such methods are part of the routine annotation process. However, because the EST and protein databases are constantly growing, in many cases original annotation must be re-evaluated, extended, and corrected on the basis of the latest evidence. The Arabidopsis Genome Initiative is undertaking this task on the whole-genome scale via its participating genome centers. The current Arabidopsis genome annotation provides an excellent starting point for assessing the protein repertoire of a flowering plant. More accurate whole-genome annotation will require the combination of high-throughput and individual gene experimental approaches and computational methods. The purpose of this article is to discuss tools available to an individual researcher to evaluate gene structure prediction for a particular locus
Leveraging EST Evidence to Automatically Predict Alternatively Spliced Genes, Master\u27s Thesis, December 2006
Current methods for high-throughput automatic annotation of newly sequenced genomes are largely limited to tools which predict only one transcript per gene locus. Evidence suggests that 20-50% of genes in higher eukariotic organisms are alternatively spliced. This leaves the remainder of the transcripts to be annotated by hand, an expensive time-consuming process. Genomes are being sequenced at a much higher rate than they can be annotated. We present three methods for using the alignments of inexpensive Expressed Sequence Tags in combination with HMM-based gene prediction with N-SCAN EST to recreate the vast majority of hand annotations in the D.melanogaster genome. In our first method, we “piece together” N-SCAN EST predictions with clustered EST alignments to increase the number of transcripts per locus predicted. This is shown to be a sensitve and accurate method, predicting the vast majority of known transcripts in the D.melanogaster genome. We present an approach of using these clusters of EST alignments to construct a Multi-Pass gene prediction phase, again, piecing it together with clusters of EST alignments. While time consuming, Multi-Pass gene prediction is very accurate and more sensitive than single-pass. Finally, we present a new Hidden Markov Model instance, which augments the current N-SCAN EST HMM, that predicts multiple splice forms in a single pass of prediction. This method is less time consuming, and performs nearly as well as the multi-pass approach
Gene Capture by Helitron Transposons Reshuffles the Transcriptome of Maize
Helitrons are a family of mobile elements that were discovered in 2001 and are now known to exist in the entire eukaryotic kingdom. Helitrons, particularly those of maize, exhibit an intriguing property of capturing gene fragments and placing them into the mobile element. Helitron-captured genes are sometimes transcribed, giving birth to chimeric transcripts that intertwine coding regions of different captured genes. Here, we perused the B73 maize genome for high-quality, putative Helitrons that exhibit plus/minus polymorphisms and contain pieces of more than one captured gene. Selected Helitrons were monitored for expression via in silico EST analysis. Intriguingly, expression validation of selected elements by RT–PCR analysis revealed multiple transcripts not seen in the EST databases. The differing transcripts were generated by alternative selection of splice sites during pre-mRNA processing. Selection of splice sites was not random since different patterns of splicing were observed in the root and shoot tissues. In one case, an exon residing in close proximity but outside of the Helitron was found conjoined with Helitron-derived exons in the mature transcript. Hence, Helitrons have the ability to synthesize new genes not only by placing unrelated exons into common transcripts, but also by transcription readthrough and capture of nearby exons. Thus, Helitrons have a phenomenal ability to “display” new coding regions for possible selection in nature. A highly conservative, minimum estimate of the number of new transcripts expressed by Helitrons is ∼11,000 or ∼25% of the total number of genes in the maize genome
Fusion of the human gene for the polyubiquitination coeffector UEV1 with Kua, a newly identified gene
UEV proteins are enzymatically inactive variants of the E2 ubiquitin-conjugating enzymes that regulate noncanonical elongation of ubiquitin chains. In Saccharomyces cerevisiae, UEV is part of the RAD6-mediated error-free DNA repair pathway. In mammalian cells, UEV proteins can modulate c-FOS transcription and the G2-M transition of the cell cycle. Here we show that the UEV genes from phylogenetically distant organisms present a remarkable conservation in their exon-intron structure. We also show that the human UEV1 gene is fused with the previously unknown geneKua. In Caenorhabditis elegans and Drosophila melanogaster, Kua and UEV are in separated loci, and are expressed as independent transcripts and proteins. In humans,Kua and UEV1 are adjacent genes, expressed either as separate transcripts encoding independent Kua and UEV1 proteins, or as a hybrid Kua-UEV transcript, encoding a two-domain protein. Kua proteins represent a novel class of conserved proteins with juxtamembrane histidine-rich motifs. Experiments with epitope-tagged proteins show that UEV1A is a nuclear protein, whereas both Kua and Kua-UEV localize to cytoplasmic structures, indicating that the Kua domain determines the cytoplasmic localization of Kua-UEV. Therefore, the addition of a Kua domain to UEV in the fused Kua-UEV protein confers new biological properties to this regulator of variant polyubiquitination
Gene models from ESTs (GeneModelEST): an application on the Solanum lycopersicum genome
Background: The structure annotation of a genome is based either on ab
initio methodologies or on similaritiy searches versus molecules that
have been already annotated. Ab initio gene predictions in a genome are
based on a priori knowledge of species-specific features of genes. The
training of ab initio gene finders is based on the definition of a
data-set of gene models. To accomplish this task the common approach is
to align species-specific full length cDNA and EST sequences along the
genomic sequences in order to define exon/intron structure of mRNA
coding genes.
Results: GeneModelEST is the software here proposed for defining a
data-set of candidate gene models using exclusively evidence derived
from cDNA/EST sequences.
GeneModelEST requires the genome coordinates of the spliced-alignments
of ESTs and of contigs (tentative consensus sequences) generated by an
EST clustering/assembling procedure to be formatted in a General Feature
Format (GFF) standard file. Moreover, the alignments of the contigs
versus a protein database are required as an NCBI BLAST formatted report
file.
The GeneModelEST analysis aims to i) evaluate each exon as defined from
contig spliced alignments onto the genome sequence; ii) classify the
contigs according to quality levels in order to select candidate gene
models; iii) assign to the candidate gene models preliminary functional
annotations.
We discuss the application of the proposed methodology to build a
data-set of gene models of Solanum lycopersicum, whose genome sequencing
is an ongoing effort by the International Tomato Genome Sequencing
Consortium.
Conclusion: The contig classification procedure used by GeneModelEST
supports the detection of candidate gene models, the identification of
potential alternative transcripts and it is useful to filter out
ambiguous information. An automated procedure, such as the one proposed
here, is fundamental to support large scale analysis in order to provide
species-specific gene models, that could be useful as a training
data-set for ab initio gene finders and/or as a reference gene list for
a human curated annotation
- …