Search CORE

Springer - Publisher Connector

Halvade: scalable sequence analysis with MapReduce

Author: Costanza Pascal
Decap Dries
Fostier Jan
Herzeel Charlotte
Reumers Joke
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2015
Field of study

Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50x coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR

Springer - Publisher Connector

Archivsystem Ask23

QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles

Author: Aerssens Jeroen
Clement Lieven
Reumers Joke
Thys Kim
van der Borght Koen
van Vlijmen Herman
Verbist Bie
Wetzels Yves
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ("deep sequencing"), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. Results: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNVD). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNVHS). To also increase specificity, SNVs called were overruled when their frequency was below the 80th percentile calculated on the distribution of error frequencies (QQ-SNVHS-P80). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNVD performed similarly to the existing approaches. QQ-SNVHS was more sensitive on all test sets but with more false positives. QQ-SNVHS-P80 was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 %, QQ-SNVHS-P80 revealed a sensitivity of 100 % (vs. 40-60 % for the existing methods) and a specificity of 100 % (vs. 98.0-99.7 % for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 % were consistently detected by QQ-SNVHS-P80 from different generations of Illumina sequencers. Conclusions: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data

The Francis Crick Institute

SNPeffect: a database mapping molecular phenotypic effects of human non-synonymous coding SNPs

Author: Ferkinghoff-Borg Jesper
Reumers Joke
Rousseau Frederic
Schymkowitz Joost
Serrano Luis
Stricher Francois
Publication venue: Oxford University Press
Publication date: 17/12/2004
Field of study

Single nucleotide polymorphisms (SNPs) are an increasingly important tool for genetic and biomedical research. However, the accumulated sequence information on allelic variation is not matched by an understanding of the effect of SNPs on the functional attributes or ‘molecular phenotype’ of a protein. Towards this aim we developed SNPeffect, an online resource of human non-synonymous coding SNPs (nsSNPs) mapping phenotypic effects of allelic variation in human genes. SNPeffect contains 31 659 nsSNPs from 12 480 human proteins. The current release of SNPeffect incorporates data on protein stability, integrity of functional sites, protein phosphorylation and glycosylation, subcellular localization, protein turnover rates, protein aggregation, amyloidosis and chaperone interaction. The SNP entries are accessible through both a search and browse interface and are linked to most major biological databases. The data can be displayed as detailed descriptions of individual SNPs or as an overview of all SNPs for a given protein. SNPeffect will be regularly updated and can be accessed at http://snpeffect.vib.be/

CiteSeerX

A random effects model for the identification of differential splicing (REIDS) using exon and HTA arrays

Author: Göhlmann Hinrick W.H.
Kasim Adetayo
Reumers Joke
Shkedy Ziv
Talloen Willem
Van Moerbeke Marijke
Publication venue: BioMed Central
Publication date: 01/05/2017
Field of study

Background: Alternative gene splicing is a common phenomenon in which a single gene gives rise to multiple transcript isoforms. The process is strictly guided and involves a multitude of proteins and regulatory complexes. Unfortunately, aberrant splicing events do occur which have been linked to genetic disorders, such as several types of cancer and neurodegenerative diseases (Fan et al., Theor Biol Med Model 3:19, 2006). Therefore, understanding the mechanism of alternative splicing and identifying the difference in splicing events between diseased and healthy tissue is crucial in biomedical research with the potential of applications in personalized medicine as well as in drug development. Results: We propose a linear mixed model, Random Effects for the Identification of Differential Splicing (REIDS), for the identification of alternative splicing events. Based on a set of scores, an exon score and an array score, a decision regarding alternative splicing can be made. The model enables the ability to distinguish a differential expressed gene from a differential spliced exon. The proposed model was applied to three case studies concerning both exon and HTA arrays. Conclusion: The REIDS model provides a work flow for the identification of alternative splicing events relying on the established linear mixed model. The model can be applied to different types of arrays

Durham Research Online

Directory of Open Access Journals

Joint annotation of coding and non-coding single nucleotide polymorphisms and mutations in the SNPeffect and PupaSuite databases

Author: Conde
Conde
Conde
Feuk
Frederic Rousseau
Guigo
Hubbard
Ignacio Medina
Joaquin Dopazo
Joke Reumers
Joost Schymkowitz
Joost Van Durme
Lucia Conde
McCarty
Reumers
Reumers
Sandelin
Schymkowitz
Sebastian Maurer-Stroh
Sherry
Stranger
The UniProt Consortium
Wang
Zhu
Publication venue: Oxford University Press
Publication date: 01/01/2008
Field of study

Single nucleotide polymorphisms (SNPs) are, together with copy number variation, the primary source of variation in the human genome. SNPs are associated with altered response to drug treatment, susceptibility to disease and other phenotypic variation. Furthermore, during genetic screens for disease-associated mutations in groups of patients and control individuals, the distinction between disease causing mutation and polymorphism is often unclear. Annotation of the functional and structural implications of single nucleotide changes thus provides valuable information to interpret and guide experiments. The SNPeffect and PupaSuite databases are now synchronized to deliver annotations for both non-coding and coding SNP, as well as annotations for the SwissProt set of human disease mutations. In addition, SNPeffect now contains predictions of Tango2: an improved aggregation detector, and Waltz: a novel predictor of amyloid-forming sequences, as well as improved predictors for regions that are recognized by the Hsp70 family of chaperones. The new PupaSuite version incorporates predictions for SNPs in silencers and miRNAs including their targets, as well as additional methods for predicting SNPs in TFBSs and splice sites. Also predictions for mouse and rat genomes have been added. In addition, a PupaSuite web service has been developed to enable data access, programmatically. The combined database holds annotations for 4 965 073 regulatory as well as 133 505 coding human SNPs and 14 935 disease mutations, and phenotypic descriptions of 43 797 human proteins and is accessible via http://snpeffect.vib.be and http://pupasuite.bioinfo.cipf.es/

UCL Discovery

PupaSuite: finding functional single nucleotide polymorphisms for large-scale genotyping purposes

Author: Arbiza Leonardo
Conde Lucía
Dopazo Hernán
Dopazo Joaquín
Reumers Joke
Rousseau Frederic
Schymkowitz Joost
Vaquerizas Juan M.
Publication venue: Oxford University Press
Publication date: 01/07/2006
Field of study

We have developed a web tool, PupaSuite, for the selection of single nucleotide polymorphisms (SNPs) with potential phenotypic effect, specifically oriented to help in the design of large-scale genotyping projects. PupaSuite uses a collection of data on SNPs from heterogeneous sources and a large number of pre-calculated predictions to offer a flexible and intuitive interface for selecting an optimal set of SNPs. It improves the functionality of PupaSNP and PupasView programs and implements new facilities such as the analysis of user's data to derive haplotypes with functional information. A new estimator of putative effect of polymorphisms has been included that uses evolutionary information. Also SNPeffect database predictions have been included. The PupaSuite web interface is accessible through and through

Lirias

UCL Discovery

ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering

Author: Aerssens Jeroen
Bijnens Luc
Clement Lieven
Meys Joris
Reumers Joke
Talloen Willem
Thas Olivier
Thys Kim
Vapirev Alexander
Verbist Bie
Wetzels Yves
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Background: Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses. Results: Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step. Conclusions: ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection

Research Online

Genome dynamics of the human embryonic kidney 293 lineage in response to cell biology manipulations

Author: Boone Morgane
Callewaert Nico
Chen Jason
Drmanac Radoje
Lambrechts Diether
Lemmens Irma
Lin Yao-Cheng
Meuris Leander
Moisse Matthieu
Plaisance Stéphane
Reumers Joke
Soete Arne
Speleman Franki
Tavernier Jan
Van de Peer Yves
Van Roy Nadine
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

The HEK293 human cell lineage is widely used in cell biology and biotechnology. Here we use whole-genome resequencing of six 293 cell lines to study the dynamics of this aneuploid genome in response to the manipulations used to generate common 293 cell derivatives, such as transformation and stable clone generation (293T); suspension growth adaptation (293S); and cytotoxic lectin selection (293SG). Remarkably, we observe that copy number alteration detection could identify the genomic region that enabled cell survival under selective conditions (i.c. ricin selection). Furthermore, we present methods to detect human/vector genome breakpoints and a user-friendly visualization tool for the 293 genome data. We also establish that the genome structure composition is in steady state for most of these cell lines when standard cell culturing conditions are used. This resource enables novel and more informed studies with 293 cells, and we will distribute the sequenced cell lines to this effect

UPSpace at the University of Pretoria

Archivsystem Ask23

Whole-genome sequencing reveals a coding non-pathogenic variant tagging a non-coding pathogenic hexanucleotide repeat expansion in C9orf72 as cause of amyotrophic lateral sclerosis

Author: Goris An
Herdewyn Sarah
Kusters Benno
Lambrechts Diether
Matthijs Gert
Moisse Matthieu
Race Valérie
Reumers Joke
Robberecht Wim
Schelhaas Helenius J.
Van Damme Philip
van den Berg Leonard H.
Zhao Hui
Publication venue: Oxford University Press
Publication date: 01/01/2012
Field of study

Motor neuron degeneration in amyotrophic lateral sclerosis (ALS) has a familial cause in 10% of patients. Despite significant advances in the genetics of the disease, many families remain unexplained. We performed whole-genome sequencing in five family members from a pedigree with autosomal-dominant classical ALS. A family-based elimination approach was used to identify novel coding variants segregating with the disease. This list of variants was effectively shortened by genotyping these variants in 2 additional unaffected family members and 1500 unrelated population-specific controls. A novel rare coding variant in SPAG8 on chromosome 9p13.3 segregated with the disease and was not observed in controls. Mutations in SPAG8 were not encountered in 34 other unexplained ALS pedigrees, including 1 with linkage to chromosome 9p13.2–23.3. The shared haplotype containing the SPAG8 variant in this small pedigree was 22.7 Mb and overlapped with the core 9p21 linkage locus for ALS and frontotemporal dementia. Based on differences in coverage depth of known variable tandem repeat regions between affected and non-affected family members, the shared haplotype was found to contain an expanded hexanucleotide (GGGGCC)n repeat in C9orf72 in the affected members. Our results demonstrate that rare coding variants identified by whole-genome sequencing can tag a shared haplotype containing a non-coding pathogenic mutation and that changes in coverage depth can be used to reveal tandem repeat expansions. It also confirms (GGGGCC)n repeat expansions in C9orf72 as a cause of familial ALS