The reference annotations made for a genome sequence provide the framework
for all subsequent analyses of the genome. Correct annotation is particularly
important when interpreting the results of RNA-seq experiments where short
sequence reads are mapped against the genome and assigned to genes according to
the annotation. Inconsistencies in annotations between the reference and the
experimental system can lead to incorrect interpretation of the effect on RNA
expression of an experimental treatment or mutation in the system under study.
Until recently, the genome-wide annotation of 3-prime untranslated regions
received less attention than coding regions and the delineation of intron/exon
boundaries. In this paper, data produced for samples in Human, Chicken and A.
thaliana by the novel single-molecule, strand-specific, Direct RNA Sequencing
technology from Helicos Biosciences which locates 3-prime polyadenylation sites
to within +/- 2 nt, were combined with archival EST and RNA-Seq data. Nine
examples are illustrated where this combination of data allowed: (1) gene and
3-prime UTR re-annotation (including extension of one 3-prime UTR by 5.9 kb);
(2) disentangling of gene expression in complex regions; (3) clearer
interpretation of small RNA expression and (4) identification of novel genes.
While the specific examples displayed here may become obsolete as genome
sequences and their annotations are refined, the principles laid out in this
paper will be of general use both to those annotating genomes and those seeking
to interpret existing publically available annotations in the context of their
own experimental dataComment: 44 pages, 9 figure