70 research outputs found
Genome-wide identification of coding and non-coding conserved sequence tags in human and mouse genomes
<p>Abstract</p> <p>Background</p> <p>The accurate detection of genes and the identification of functional regions is still an open issue in the annotation of genomic sequences. This problem affects new genomes but also those of very well studied organisms such as human and mouse where, despite the great efforts, the inventory of genes and regulatory regions is far from complete. Comparative genomics is an effective approach to address this problem. Unfortunately it is limited by the computational requirements needed to perform genome-wide comparisons and by the problem of discriminating between conserved coding and non-coding sequences. This discrimination is often based (thus dependent) on the availability of annotated proteins.</p> <p>Results</p> <p>In this paper we present the results of a comprehensive comparison of human and mouse genomes performed with a new high throughput grid-based system which allows the rapid detection of conserved sequences and accurate assessment of their coding potential. By detecting clusters of coding conserved sequences the system is also suitable to accurately identify potential gene loci.</p> <p>Following this analysis we created a collection of human-mouse conserved sequence tags and carefully compared our results to reliable annotations in order to benchmark the reliability of our classifications. Strikingly we were able to detect several potential gene loci supported by EST sequences but not corresponding to as yet annotated genes.</p> <p>Conclusion</p> <p>Here we present a new system which allows comprehensive comparison of genomes to detect conserved coding and non-coding sequences and the identification of potential gene loci. Our system does not require the availability of any annotated sequence thus is suitable for the analysis of new or poorly annotated genomes.</p
Untranslated regions of mRNAs
Gene expression is finely regulated at the post-transcriptional level. Features of the untranslated regions of mRNAs that control their translation, degradation and localization include stem-loop structures, upstream initiation codons and open reading frames, internal ribosome entry sites and various cis-acting elements that are bound by RNA-binding proteins
UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs
The 5Ⲡand 3Ⲡuntranslated regions of eukaryotic mRNAs play crucial roles in the post-transcriptional regulation of gene expression through the modulation of nucleo-cytoplasmic mRNA transport, translation efficiency, subcellular localization and message stability. UTRdb is a curated database of 5Ⲡand 3Ⲡuntranslated sequences of eukaryotic mRNAs, derived from several sources of primary data. Experimentally validated functional motifs are annotated (and also collated as the UTRsite database) and cross-links to genomic and protein data are provided. The integration of UTRdb with genomic and protein data has allowed the implementation of a powerful retrieval resource for the selection and extraction of UTR subsets based on their genomic coordinates and/or features of the protein encoded by the relevant mRNA (e.g. GO term, PFAM domain, etc.). All internet resources implemented for retrieval and functional analysis of 5Ⲡand 3Ⲡuntranslated regions of eukaryotic mRNAs are accessible at http://www.ba.itb.cnr.it/UTR/
Filtering "genic" open reading frames from genomic DNA samples for advanced annotation
<p>Abstract</p> <p>Background</p> <p>In order to carry out experimental gene annotation, DNA encoding open reading frames (ORFs) derived from real genes (termed "genic") in the correct frame is required. When genes are correctly assigned, isolation of genic DNA for functional annotation can be carried out by PCR. However, not all genes are correctly assigned, and even when correctly assigned, gene products are often incorrectly folded when expressed in heterologous hosts. This is a problem that can sometimes be overcome by the expression of protein fragments encoding domains, rather than full-length proteins. One possible method to isolate DNA encoding such domains would to "filter" complex DNA (cDNA libraries, genomic and metagenomic DNA) for gene fragments that confer a selectable phenotype relying on correct folding, with all such domains present in a complex DNA sample, termed the âdomainomeâ.</p> <p>Results</p> <p>In this paper we discuss the preparation of diverse genic ORF libraries from randomly fragmented genomic DNA using Ă-lactamase to filter out the open reading frames. By cloning DNA fragments between leader sequences and the mature Ă-lactamase gene, colonies can be selected for resistance to ampicillin, conferred by correct folding of the lactamase gene. Our experiments demonstrate that the majority of surviving colonies contain genic open reading frames, suggesting that Ă-lactamase is acting as a selectable folding reporter. Furthermore, different leaders (Sec, TAT and SRP), normally translocating different protein classes, filter different genic fragment subsets, indicating that their use increases the fraction of the âdomainoneâ that is accessible.</p> <p>Conclusions</p> <p>The availability of ORF libraries, obtained with the filtering method described here, combined with screening methods such as phage display and protein-protein interaction studies, or with protein structure determination projects, can lead to the identification and structural determination of functional genic ORFs. ORF libraries represent, moreover, a useful tool to proceed towards high-throughput functional annotation of newly sequenced genomes.</p
STAble: A novel approach to de novo assembly of RNA-seq data and its application in a metabolic model network based metatranscriptomic workflow
Background: De novo assembly of RNA-seq data allows the study of transcriptome in absence of a reference genome either if data is obtained from a single organism or from a mixed sample as in metatranscriptomics studies. Given the high number of sequences obtained from NGS approaches, a critical step in any analysis workflow is the assembly of reads to reconstruct transcripts thus reducing the complexity of the analysis. Despite many available tools show a good sensitivity, there is a high percentage of false positives due to the high number of assemblies considered and it is likely that the high frequency of false positive is underestimated by currently used benchmarks. The reconstruction of not existing transcripts may false the biological interpretation of results as - for example - may overestimate the identification of "novel" transcripts. Moreover, benchmarks performed are usually based on RNA-seq data from annotated genomes and assembled transcripts are compared to annotations and genomes to identify putative good and wrong reconstructions, but these tests alone may lead to accept a particular type of false positive as true, as better described below. Results: Here we present a novel methodology of de novo assembly, implemented in a software named STAble (Short-reads Transcriptome Assembler). The novel concept of this assembler is that the whole reads are used to determine possible alignments instead of using smaller k-mers, with the aim of reducing the number of chimeras produced. Furthermore, we applied a new set of benchmarks based on simulated data to better define the performance of assembly method and carefully identifying true reconstructions. STAble was also used to build a prototype workflow to analyse metatranscriptomics data in connection to a steady state metabolic modelling algorithm. This algorithm was used to produce high quality metabolic interpretations of small gene expression sets obtained from already published RNA-seq data that we assembled with STAble. Conclusions: The presented results, albeit preliminary, clearly suggest that with this approach is possible to identify informative reactions not directly revealed by raw transcriptomic data
Identification of tumor-associated cassette exons in human cancer through EST-based computational prediction and experimental validation
Background: Many evidences report that alternative splicing, the mechanism which produces mRNAs and proteins with different structures and functions from the same gene, is altered in cancer cells. Thus, the identification and characterization of cancer-specific splice variants may give large impulse to the discovery of novel diagnostic and prognostic tumour biomarkers, as well as of new targets for more selective and effective therapies. Results: We present here a genome-wide analysis of the alternative splicing pattern of human genes through a computational analysis of normal and cancer-specific ESTs from seventeen anatomical groups, using data available in AspicDB, a database resource for the analysis of alternative splicing in human. By using a statistical methodology, normal and cancer-specific genes, splice sites and cassette exons were predicted in silico. The condition association of some of the novel normal/tumoral cassette exons was experimentally verified by RT-qPCR assays in the same anatomical system where they were predicted. Remarkably, the presence in vivo of the predicted alternative transcripts, specific for the nervous system, was confirmed in patients affected by glioblastoma. Conclusion: This study presents a novel computational methodology for the identification of tumor-associated transcript variants to be used as cancer molecular biomarkers, provides its experimental validation, and reports specific biomarkers for glioblastoma
DG-CST (Disease Gene Conserved Sequence Tags), a database of humanâmouse conserved elements associated to disease genes
The identification and study of evolutionarily conserved genomic sequences that surround disease-related genes is a valuable tool to gain insight into the functional role of these genes and to better elucidate the pathogenetic mechanisms of disease. We created the DG-CST (Disease Gene Conserved Sequence Tags) database for the identification and detailed annotation of humanâmouse conserved genomic sequences that are localized within or in the vicinity of human disease-related genes. CSTs are defined as sequences that show at least 70% identity between human and mouse over a length of at least 100 bp. The database contains CST data relative to over 1088 genes responsible for monogenetic human genetic diseases or involved in the susceptibility to multifactorial/polygenic diseases. DG-CST is accessible via the internet at http://dgcst.ceinge.unina.it/ and may be searched using both simple and complex queries. A graphic browser allows direct visualization of the CSTs and related annotations within the context of the relative gene and its transcripts
STAble: a novel approach to de novo assembly of RNA-seq data and its application in a metabolic model network based metatranscriptomic workflow.
BACKGROUND: De novo assembly of RNA-seq data allows the study of transcriptome in absence of a reference genome either if data is obtained from a single organism or from a mixed sample as in metatranscriptomics studies. Given the high number of sequences obtained from NGS approaches, a critical step in any analysis workflow is the assembly of reads to reconstruct transcripts thus reducing the complexity of the analysis. Despite many available tools show a good sensitivity, there is a high percentage of false positives due to the high number of assemblies considered and it is likely that the high frequency of false positive is underestimated by currently used benchmarks. The reconstruction of not existing transcripts may false the biological interpretation of results as - for example - may overestimate the identification of "novel" transcripts. Moreover, benchmarks performed are usually based on RNA-seq data from annotated genomes and assembled transcripts are compared to annotations and genomes to identify putative good and wrong reconstructions, but these tests alone may lead to accept a particular type of false positive as true, as better described below. RESULTS: Here we present a novel methodology of de novo assembly, implemented in a software named STAble (Short-reads Transcriptome Assembler). The novel concept of this assembler is that the whole reads are used to determine possible alignments instead of using smaller k-mers, with the aim of reducing the number of chimeras produced. Furthermore, we applied a new set of benchmarks based on simulated data to better define the performance of assembly method and carefully identifying true reconstructions. STAble was also used to build a prototype workflow to analyse metatranscriptomics data in connection to a steady state metabolic modelling algorithm. This algorithm was used to produce high quality metabolic interpretations of small gene expression sets obtained from already published RNA-seq data that we assembled with STAble. CONCLUSIONS: The presented results, albeit preliminary, clearly suggest that with this approach is possible to identify informative reactions not directly revealed by raw transcriptomic data
Identification of novel proteins binding the AU-rich element of \u3b1-prothymosin mRNA through the selection of open reading frames (RIDome)
We describe here a platform for high-throughput protein expression and interaction analysis aimed at identifying the RNA-interacting domainome. This approach combines the selection of a phage library displaying "filtered" open reading frames with next-generation DNA sequencing. The method was validated using an RNA bait corresponding to the AU-rich element of \u3b1-prothymosin, an RNA motif that promotes mRNA stability and translation through its interaction with the RNA-binding protein ELAVL1. With this strategy, we not only confirmed known RNA-binding proteins that specifically interact with the target RNA (such as ELAVL1/HuR and RBM38) but also identified proteins not previously known to be ARE-binding (R3HDM2 and RALY). We propose this technology as a novel approach for studying the RNA-binding proteome
- âŚ