    Comparing variant calling algorithms for target-exon sequencing in a large sample

    Abstract Background Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing. Results Using these calling methods, we detected over 27,500 variants at the targeted exons; >57% were singletons. The singletons identified by individual-based analyses were of the highest quality. However, individual-based analyses generated more missing genotypes (4.72%) than population-based (0.47%) and LD-aware (0.17%) analyses. Moreover, individual-based genotypes were the least concordant with array-based genotypes and replicates. Population-based genotypes were less concordant than genotypes from LD-aware analyses with extended haplotypes. We reanalyzed the same dataset with a second set of callers and showed again that the individual-based caller identified more high-quality singletons than the population-based caller. We also replicated this result in a second dataset of 57 genes sequenced at 127.5x in 3,124 individuals. Conclusions We recommend population-based analyses for high quality variant calls with few missing genotypes. With extended haplotypes, LD-aware methods generate the most accurate and complete genotypes. In addition, individual-based analyses should complement the above methods to obtain the most singleton variants.http://deepblue.lib.umich.edu/bitstream/2027.42/110906/1/12859_2015_Article_489.pd

    Target enrichment using parallel nanoliter quantitative PCR amplification

    Background: Next generation targeted resequencing is replacing Sanger sequencing at high pace in routine genetic diagnosis. The need for well validated, high quality enrichment platforms to complement the bench-top next generation sequencing devices is high. Results: We used the WaferGen Smartchip platform to perform highly parallelized PCR based target enrichment for a set of known cancer genes in a well characterized set of cancer cell lines from the NCI60 panel. Optimization of PCR assay design and cycling conditions resulted in a high enrichment efficiency. We provide proof of a high mutation rediscovery rate and have included technical replicates to enable SNP calling validation demonstrating the high reproducibility of our enrichment platform. Conclusions: Here we present our custom developed quantitative PCR based target enrichment platform. Using highly parallel nanoliter singleplex PCR reactions makes this a flexible and efficient platform. The high mutation validation rate shows this platform’s promise as a targeted resequencing method for multi-gene routine sequencing diagnostics

    Development and Validation of Targeted Next-Generation Sequencing Panels for Detection of Germline Variants in Inherited Diseases.

    Context.-The number of targeted next-generation sequencing (NGS) panels for genetic diseases offered by clinical laboratories is rapidly increasing. Before an NGS-based test is implemented in a clinical laboratory, appropriate validation studies are needed to determine the performance characteristics of the test. Objective.-To provide examples of assay design and validation of targeted NGS gene panels for the detection of germline variants associated with inherited disorders. Data Sources.-The approaches used by 2 clinical laboratories for the development and validation of targeted NGS gene panels are described. Important design and validation considerations are examined. Conclusions.-Clinical laboratories must validate performance specifications of each test prior to implementation. Test design specifications and validation data are provided, outlining important steps in validation of targeted NGS panels by clinical diagnostic laboratories

    Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing

    BACKGROUND: To facilitate the clinical implementation of genomic medicine by next-generation sequencing, it will be critically important to obtain accurate and consistent variant calls on personal genomes. Multiple software tools for variant calling are available, but it is unclear how comparable these tools are or what their relative merits in real-world scenarios might be. METHODS: We sequenced 15 exomes from four families using commercial kits (Illumina HiSeq 2000 platform and Agilent SureSelect version 2 capture kit), with approximately 120X mean coverage. We analyzed the raw data using near-default parameters with five different alignment and variant-calling pipelines (SOAP, BWA-GATK, BWA-SNVer, GNUMAP, and BWA-SAMtools). We additionally sequenced a single whole genome using the sequencing and analysis pipeline from Complete Genomics (CG), with 95% of the exome region being covered by 20 or more reads per base. Finally, we validated 919 single-nucleotide variations (SNVs) and 841 insertions and deletions (indels), including similar fractions of GATK-only, SOAP-only, and shared calls, on the MiSeq platform by amplicon sequencing with approximately 5000X mean coverage. RESULTS: SNV concordance between five Illumina pipelines across all 15 exomes was 57.4%, while 0.5 to 5.1% of variants were called as unique to each pipeline. Indel concordance was only 26.8% between three indel-calling pipelines, even after left-normalizing and intervalizing genomic coordinates by 20 base pairs. There were 11% of CG variants falling within targeted regions in exome sequencing that were not called by any of the Illumina-based exome analysis pipelines. Based on targeted amplicon sequencing on the MiSeq platform, 97.1%, 60.2%, and 99.1% of the GATK-only, SOAP-only and shared SNVs could be validated, but only 54.0%, 44.6%, and 78.1% of the GATK-only, SOAP-only and shared indels could be validated. Additionally, our analysis of two families (one with four individuals and the other with seven), demonstrated additional accuracy gained in variant discovery by having access to genetic data from a multi-generational family. CONCLUSIONS: Our results suggest that more caution should be exercised in genomic medicine settings when analyzing individual genomes, including interpreting positive and negative findings with scrutiny, especially for indels. We advocate for renewed collection and sequencing of multi-generational families to increase the overall accuracy of whole genomes

    Genomin analyysiohjelmisto toisen sukupolven sekvensaattoreille

    The next-generation sequencing (NGS) platforms create a large amount of sequence in short amount of time, when compared to first generation sequencers. An overview of the NGS platforms is provided with more in-depth look into Illumina Genome Analyzer II as that is used to create the data for the thesis. There were two main aims in this thesis. First, to create a pipeline which can be used to analyse genomic sequencing. Second, to use the pipeline to compare whole human exome capture methods from two manufacturers, Roche Nimblegen and Agilent. The pipeline is describe in detail in material and methods. All the inputs for the pipeline are described and examples shown. In the pipeline the given sequences are first aligned against the reference genome. Then various separate analysis is performed to retrieve variants and coverage of the sequencing. Supplementary results include paired-end anomalies, larger insertion and deletion polymorphisms and assembly of non-aligned sequences. The two capture methods are also described and changes to the manufacturers' recommended protocols are listed. Finally, the section has the options and various inputs used in the pipeline runs of the exome data. The results of the pipeline is a basic level of analysis of the sequencing as well as various graphs showing the quality of the run. All the output files intended for user are described. By using the results of the pipeline, the user can do more in-depth analysis as required by the project. When comparing the two exome capture methods, the Nimblegen capture was shown to be more efficient in capturing the CCDS exome. While the Agilent capture kit provided better one fold coverage over the exome, higher fold coverage (over 10 fold), which is required for reliable variant calling in nextgeneration sequencing, was better reached using the Nimblegen capture kit. Also, significantly fewer false positive paired-end anomalies were observed in the library created by using the Nimblegen capture.Toisen sukupolven sekvensointilaitteet tuottavat huomattavan suuren määrän sekvenssiä lyhyessä ajassa verrattuna ensimmäisen sukupolven laitteisiin. Taustaosassa annetaan yleiskuva eri toisen sukupolven sekvensaattorien toimintamenetelmistä. Tarkemmin paneudutaan Illumina Genome Analyzer II laitteeseen, jolla tuotettiin sekvenssit tätä tutkielmaa varten. Tällä tutkielmalla on kaksi tavoitetta. Ensimmäinen tavoite on tehdä analyysiohjelmisto genomista sekvensointia varten. Toinen tavoite on käyttää tätä ohjelmistoa vertailemaan ihmisen kaikkien geenien eksonien sekvensointimenetelmiä kahdelta eri valmistajalta, Roche Nimblegeniltä ja Agilentilta. Materiaali ja metodi osassa kuvataan ohjelmiston toiminta tarkemmin. Kaikista ohjelmistolle annettavista tiedostoista on kuvaus sekä esimerkki. Ohjelmisto linjaa sekvensointilaitteen tuottamat lyhyet sekvenssit vertailugenomia vastaan, etsii linjauksesta varioivia kohtia ja antaa tietoa miten tuotetut sekvenssit kattavat suunnitellut genomialueet. Lisäksi tulostiedostot sisältävät sekvenssiparien poikkeavuuksia, suurempien sekvenssin lisäyksen tai poiston aiheuttavia muutoksia ja yritetään yhdistellä ei linjattuja sekvenssejä isommiksi osiksi. Sekvensointi paketit eri valmistajilta myös esitellään ja tehdyt muutokset valmistajien suosittamiin ohjeisiin listataan. Viimeisenä osana käydään läpi työssä käytettyjen ohjelmistoajoille annetut tiedostot sekä muut niihin liittyvät muutokset. Analyysiohjelmiston tuloksena tuotetaan perustason analyysi sekvenssoinnista sekä sen laadusta. Kaikki tulostiedostot selitetään käyttäjälle. Tulosten perusteella voi käyttäjä sitten tehdä syvempää analyysia oman projektinsa tarpeiden mukaan. Eksomivertailussa Nimblegenin sekvensointimenetelmä näyttäisi olevan parempi kohdealueen sekvensointiin sekä omalla että itsenäisellä aluemäärittelyllä. Agilentin menetelmä tuotti laajemman yksinkertaisen sekvenssipeiton ihmisgenomin eksoneihin, mikä kuitenkin on liian vähäinen luotettavaa variaatioiden tunnistamista varten. Nimblegenin menetelmä sen sijaan kattoi enemmän tavoiteltuja sekvenssialueita kun vaadittiin variaatioiden tunnistamiseen riittävä sekvenssipeitto (vähintään 10 sekvenssiä). Nimblegenin menetelmä tuotti myös vähemmän virheellisiä sekvenssipoikkeavuuksia

    SynthEx: a synthetic-normal-based DNA sequencing tool for copy number alteration detection and tumor heterogeneity profiling

    TCGA head and neck squamous cell carcinoma clinical information of tumors used in comparisons (n = 100). (XLSX 55 kb

    Detailed molecular characterisation of acute myeloid leukaemia with a normal karyotype using targeted DNA capture

    This work is licensed under a Creative Commons Attribution 3.0 Unported License.-- et al.Advances in sequencing technologies are giving unprecedented insights into the spectrum of somatic mutations underlying acute myeloid leukaemia with a normal karyotype (AML-NK). It is clear that the prognosis of individual patients is strongly influenced by the combination of mutations in their leukaemia and that many leukaemias are composed of multiple subclones, with differential susceptibilities to treatment. Here, we describe a method, employing targeted capture coupled with next-generation sequencing and tailored bioinformatic analysis, for the simultaneous study of 24 genes recurrently mutated in AML-NK. Mutational analysis was performed using open source software and an in-house script (Mutation Identification and Analysis Software), which identified dominant clone mutations with 100% specificity. In each of seven cases of AML-NK studied, we identified and verified mutations in 2-4 genes in the main leukaemic clone. Additionally, high sequencing depth enabled us to identify putative subclonal mutations and detect leukaemia-specific mutations in DNA from remission marrow. Finally, we used normalised read depths to detect copy number changes and identified and subsequently verified a tandem duplication of exons 2-9 of MLL and at least one deletion involving PTEN. This methodology reliably detects sequence and copy number mutations, and can thus greatly facilitate the classification, clinical research, diagnosis and management of AML-NK.We acknowledge the use of the National Institute of Health Research (NIHR) Biomedical Research Centre, University of Cambridge. We thank Drs J Craig and C Crawley of Cambridge University NHS Hospitals trust for allowing us to approach their patients for samples. GV is funded by a Wellcome Trust Senior Fellowship in Clinical Science. Work in GV’s laboratory is also funded by Leukaemia Lymphoma Research and the Kay Kendal Leukaemia Fund.Peer Reviewe
