21,106 research outputs found
Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology
Genomic Selective Constraints in Murid Noncoding DNA
Recent work has suggested that there are many more selectively constrained, functional noncoding than coding sites in mammalian genomes. However, little is known about how selective constraint varies amongst different classes of noncoding DNA. We estimated the magnitude of selective constraint on a large dataset of mouse-rat gene orthologs and their surrounding noncoding DNA. Our analysis indicates that there are more than three times as many selectively constrained, nonrepetitive sites within noncoding DNA as in coding DNA in murids. The majority of these constrained noncoding sites appear to be located within intergenic regions, at distances greater than 5 kilobases from known genes. Our study also shows that in murids, intron length and mean intronic selective constraint are negatively correlated with intron ordinal number. Our results therefore suggest that functional intronic sites tend to accumulate toward the 5' end of murid genes. Our analysis also reveals that mean number of selectively constrained noncoding sites varies substantially with the function of the adjacent gene. We find that, among others, developmental and neuronal genes are associated with the greatest numbers of putatively functional noncoding sites compared with genes involved in electron transport and a variety of metabolic processes. Combining our estimates of the total number of constrained coding and noncoding bases we calculate that over twice as many deleterious mutations have occurred in intergenic regions as in known genic sequence and that the total genomic deleterious point mutation rate is 0.91 per diploid genome, per generation. This estimated rate is over twice as large as a previous estimate in murids
Transduplication resulted in the incorporation of two protein-coding sequences into the Turmoil-1 transposable element of C. elegans
Transposable elements may acquire unrelated gene fragments into their
sequences in a process called transduplication. Transduplication of
protein-coding genes is common in plants, but is unknown of in animals. Here,
we report that the Turmoil-1 transposable element in C. elegans has
incorporated two protein-coding sequences into its inverted terminal repeat
(ITR) sequences. The ITRs of Turmoil-1 contain a conserved RNA recognition
motif (RRM) that originated from the rsp- 2 gene and a fragment from the
protein-coding region of the cpg-3 gene. We further report that an open reading
frame specific to C. elegans may have been created as a result of a Turmoil-1
insertion. Mutations at the 5' splice site of this open reading frame may have
reactivated the transduplicated RRM moti
TranspoGene and microTranspoGene: transposed elements influence on the transcriptome of seven vertebrates and invertebrates
Transposed elements (TEs) are mobile genetic sequences. During the evolution
of eukaryotes TEs were inserted into active protein-coding genes, affecting
gene structure, expression and splicing patterns, and protein sequences.
Genomic insertions of TEs also led to creation and expression of new functional
non-coding RNAs such as micro- RNAs. We have constructed the TranspoGene
database, which covers TEs located inside proteincoding genes of seven species:
human, mouse, chicken, zebrafish, fruit fly, nematode and sea squirt. TEs were
classified according to location within the gene: proximal promoter TEs,
exonized TEs (insertion within an intron that led to exon creation), exonic TEs
(insertion into an existing exon) or intronic TEs. TranspoGene contains
information regarding specific type and family of the TEs, genomic and mRNA
location, sequence, supporting transcript accession and alignment to the TE
consensus sequence. The database also contains host gene specific data: gene
name, genomic location, Swiss-Prot and RefSeq accessions, diseases associated
with the gene and splicing pattern. In addition, we created microTranspoGene: a
database of human, mouse, zebrafish and nematode TEderived microRNAs. The
TranspoGene and micro- TranspoGene databases can be used by researchers
interested in the effect of TE insertion on the eukaryotic transcriptome
The Transcriptional Landscape of Marek’s Disease Virus in Primary Chicken B Cells Reveals Novel Splice Variants and Genes
Marek’s disease virus (MDV) is an oncogenic alphaherpesvirus that infects chickens and poses a serious threat to poultry health. In infected animals, MDV efficiently replicates in B cells in various lymphoid organs. Despite many years of research, the viral transcriptome in primary target cells of MDV remained unknown. In this study, we uncovered the transcriptional landscape of the very virulent RB1B strain and the attenuated CVI988/Rispens vaccine strain in primary chicken B cells using high-throughput RNA-sequencing. Our data confirmed the expression of known genes, but also identified a novel spliced MDV gene in the unique short region of the genome. Furthermore, de novo transcriptome assembly revealed extensive splicing of viral genes resulting in coding and non-coding RNA transcripts. A novel splicing isoform of MDV UL15 could also be confirmed by mass spectrometry and RT-PCR. In addition, we could demonstrate that the associated transcriptional motifs are highly conserved and closely resembled those of the host transcriptional machinery. Taken together, our data allow a comprehensive re-annotation of the MDV genome with novel genes and splice variants that could be targeted in further research on MDV replication and tumorigenesis
The Alternative Choice of Constitutive Exons throughout Evolution
Alternative cassette exons are known to originate from two processes
exonization of intronic sequences and exon shuffling. Herein, we suggest an
additional mechanism by which constitutively spliced exons become alternative
cassette exons during evolution. We compiled a dataset of orthologous exons
from human and mouse that are constitutively spliced in one species but
alternatively spliced in the other. Examination of these exons suggests that
the common ancestors were constitutively spliced. We show that relaxation of
the 59 splice site during evolution is one of the molecular mechanisms by which
exons shift from constitutive to alternative splicing. This shift is associated
with the fixation of exonic splicing regulatory sequences (ESRs) that are
essential for exon definition and control the inclusion level only after the
transition to alternative splicing. The effect of each ESR on splicing and the
combinatorial effects between two ESRs are conserved from fish to human. Our
results uncover an evolutionary pathway that increases transcriptome diversity
by shifting exons from constitutive to alternative splicin
RNA-Seq analysis of splicing in Plasmodium falciparum uncovers new splice junctions, alternative splicing and splicing of antisense transcripts.
Over 50% of genes in Plasmodium falciparum, the deadliest human malaria parasite, contain predicted introns, yet experimental characterization of splicing in this organism remains incomplete. We present here a transcriptome-wide characterization of intraerythrocytic splicing events, as captured by RNA-Seq data from four timepoints of a single highly synchronous culture. Gene model-independent analysis of these data in conjunction with publically available RNA-Seq data with HMMSplicer, an in-house developed splice site detection algorithm, revealed a total of 977 new 5' GU-AG 3' and 5 new 5' GC-AG 3' junctions absent from gene models and ESTs (11% increase to the current annotation). In addition, 310 alternative splicing events were detected in 254 (4.5%) genes, most of which truncate open reading frames. Splicing events antisense to gene models were also detected, revealing complex transcriptional arrangements within the parasite's transcriptome. Interestingly, antisense introns overlap sense introns more than would be expected by chance, perhaps indicating a functional relationship between overlapping transcripts or an inherent organizational property of the transcriptome. Independent experimental validation confirmed over 30 new antisense and alternative junctions. Thus, this largest assemblage of new and alternative splicing events to date in Plasmodium falciparum provides a more precise, dynamic view of the parasite's transcriptome
Characteristics of transposable element exonization within human and mouse
Insertion of transposed elements within mammalian genes is thought to be an
important contributor to mammalian evolution and speciation. Insertion of
transposed elements into introns can lead to their activation as alternatively
spliced cassette exons, an event called exonization. Elucidation of the
evolutionary constraints that have shaped fixation of transposed elements
within human and mouse protein coding genes and subsequent exonization is
important for understanding of how the exonization process has affected
transcriptome and proteome complexities. Here we show that exonization of
transposed elements is biased towards the beginning of the coding sequence in
both human and mouse genes. Analysis of single nucleotide polymorphisms (SNPs)
revealed that exonization of transposed elements can be population-specific,
implying that exonizations may enhance divergence and lead to speciation. SNP
density analysis revealed differences between Alu and other transposed
elements. Finally, we identified cases of primate-specific Alu elements that
depend on RNA editing for their exonization. These results shed light on TE
fixation and the exonization process within human and mouse genes.Comment: 11 pages, 4 figure
Needed for completion of the human genome: hypothesis driven experiments and biologically realistic mathematical models
With the sponsorship of ``Fundacio La Caixa'' we met in Barcelona, November
21st and 22nd, to analyze the reasons why, after the completion of the human
genome sequence, the identification all protein coding genes and their variants
remains a distant goal. Here we report on our discussions and summarize some of
the major challenges that need to be overcome in order to complete the human
gene catalog.Comment: Report and discussion resulting from the `Fundacio La Caixa' gene
finding meeting held November 21 and 22 2003 in Barcelon
- …