HMMSplicer: A Tool for Efficient and Sensitive Discovery of Known and Novel Splice Junctions in RNA-Seq Data

A Ameur; A Mortazavi; B Langmead; BT Wilhelm; C Sidrauski; C Trapnell; C Trapnell; Cynthia Gibas; D Ramsköld; DA Benson; DW Bryant; ET Wang; F De Bona; F Lu; GA Heap; GE Crooks; H Li; H Li; H Nagasaki; H Richard; H Yoshida; JC Dohm; Joseph L. DeRisi; JS Cox; K Sorber; Katherine Sorber; KD Pruitt; KF Au; L Baum; M Deutsch; M Yano; MC Wahl; Michelle T. Dimon; MJ Gardner; PJ Shepard; Q Pan; R Li; R Lister; S Sen; S Stamm; TW Nilsen; U Nagalakshmi; WJ Kent; WJ Kent; Z Wang

HMMSplicer: A Tool for Efficient and Sensitive Discovery of Known and Novel Splice Junctions in RNA-Seq Data

Authors: A Ameur
A Mortazavi
B Langmead
BT Wilhelm
C Sidrauski
C Trapnell
C Trapnell
Cynthia Gibas
D Ramsköld
DA Benson
DW Bryant
ET Wang
F De Bona
F Lu
GA Heap
GE Crooks
H Li
H Li
H Nagasaki
H Richard
H Yoshida
JC Dohm
Joseph L. DeRisi
JS Cox
K Sorber
Katherine Sorber
KD Pruitt
KF Au
L Baum
M Deutsch
M Yano
MC Wahl
Michelle T. Dimon
MJ Gardner
PJ Shepard
Q Pan
R Li
R Lister
S Sen
S Stamm
TW Nilsen
U Nagalakshmi
WJ Kent
WJ Kent
Z Wang
Publication date: 1 January 2010
Publisher: Public Library of Science
Doi

Abstract

Background: High-throughput sequencing of an organism’s transcriptome, or RNA-Seq, is a valuable and versatile new strategy for capturing snapshots of gene expression. However, transcriptome sequencing creates a new class of alignment problem: mapping short reads that span exon-exon junctions back to the reference genome, especially in the case where a splice junction is previously unknown. Methodology/Principal Findings: Here we introduce HMMSplicer, an accurate and efficient algorithm for discovering canonical and non-canonical splice junctions in short read datasets. HMMSplicer identifies more splice junctions than currently available algorithms when tested on publicly available A. thaliana, P. falciparum, and H. sapiens datasets without a reduction in specificity. Conclusions/Significance: HMMSplicer was found to perform especially well in compact genomes and on genes with low expression levels, alternative splice isoforms, or non-canonical splice junctions. Because HHMSplicer does not rely on prebuilt gene models, the products of inexact splicing are also detected. For H. sapiens, we find 3.6 % of 39 splice sites and 1.4% of 59 splice sites are inexact, typically differing by 3 bases in either direction. In addition, HMMSplicer provides a score for every predicted junction allowing the user to set a threshold to tune false positive rates depending on the needs of the experiment. HMMSplicer is implemented in Python. Code and documentation are freely available a