Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire

Phuong Nguyen; Jing Ma; Deqing Pei; Caroline Obert; Cheng Cheng; Terrence L Geiger; A Casrouge; TP Arstila; MG Rudolph; JJ Moon; KK Wynn; EK Day; E Jouvin-Marche; R Pacholczyk; JR Currier; P Boudinot; K Kedzierska; X Liu; CS Hsieh; HS Robins; HS Robins; JD Freeman; PL Klarenbeek; P Nguyen; O Morozova; RK Thomas; ME Wallace; BJ Manfras; V Venturi; B Ewing; A Blank; JF Sydow; EE Schadt

Identification of errors introduced during high throughput sequencing of the T cell receptor repertoire

Authors: Phuong Nguyen
Jing Ma
Deqing Pei
Caroline Obert
Cheng Cheng
Terrence L Geiger
A Casrouge
TP Arstila
MG Rudolph
JJ Moon
KK Wynn
EK Day
E Jouvin-Marche
R Pacholczyk
JR Currier
P Boudinot
K Kedzierska
X Liu
CS Hsieh
HS Robins
HS Robins
JD Freeman
PL Klarenbeek
P Nguyen
O Morozova
RK Thomas
ME Wallace
BJ Manfras
V Venturi
B Ewing
A Blank
JF Sydow
EE Schadt
Publication date: 1 January 2008
Publisher: BioMed Central
Doi

Abstract

Abstract Background Recent advances in massively parallel sequencing have increased the depth at which T cell receptor (TCR) repertoires can be probed by >3log10, allowing for saturation sequencing of immune repertoires. The resolution of this sequencing is dependent on its accuracy, and direct assessments of the errors formed during high throughput repertoire analyses are limited. Results We analyzed 3 monoclonal TCR from TCR transgenic, Rag-/- mice using Illumina® sequencing. A total of 27 sequencing reactions were performed for each TCR using a trifurcating design in which samples were divided into 3 at significant processing junctures. More than 20 million complementarity determining region (CDR) 3 sequences were analyzed. Filtering for lower quality sequences diminished but did not eliminate sequence errors, which occurred within 1-6% of sequences. Erroneous sequences were pre-dominantly of correct length and contained single nucleotide substitutions. Rates of specific substitutions varied dramatically in a position-dependent manner. Four substitutions, all purine-pyrimidine transversions, predominated. Solid phase amplification and sequencing rather than liquid sample amplification and preparation appeared to be the primary sources of error. Analysis of polyclonal repertoires demonstrated the impact of error accumulation on data parameters. Conclusions Caution is needed in interpreting repertoire data due to potential contamination with mis-sequence reads. However, a high association of errors with phred score, high relatedness of erroneous sequences with the parental sequence, dominance of specific nt substitutions, and skewed ratio of forward to reverse reads among erroneous sequences indicate approaches to filter erroneous sequences from repertoire data sets.</p