Search CORE

59 research outputs found

Halvade: whole genome analysis with MapReduce

Author: Costanza P
Decap Dries
Fostier Jan
Herzeel C
Reumers J
Publication venue
Publication date: 01/01/2014
Field of study

Halvade: scalable sequence analysis with MapReduce

Author: Costanza Pascal
Decap Dries
Fostier Jan
Herzeel Charlotte
Reumers Joke
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2015
Field of study

Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50x coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR

Crossref

Ghent University Academic Bibliography

PubMed Central

Archivsystem Ask23

elPrep 4 : a multithreaded framework for sequence analysis

Author: Costanza Pascal
Decap Dries
Fostier Jan
Herzeel Charlotte
Verachtert Wilfried
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep's parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources

Ghent University Academic Bibliography

Directory of Open Access Journals

FigShare

Multithreaded variant calling in elPrep 5

Author: Costanza Pascal
Decap Dries
Fostier Jan
Herzeel Charlotte
Verachtert Wilfried
Wuyts Roel
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2021
Field of study

We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK4. This makes elPrep 5 a suitable drop-in replacement for GATK4 when faster execution times are needed

Ghent University Academic Bibliography

Directory of Open Access Journals

Context-oriented software transactional memory in common lisp

Author: Cao Minh C.
Charlotte Herzeel
Herzeel C.
Kulkarni M.
Larus J. R.
Pascal Costanza
Theo D'Hondt
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

elPrep: high-performance preparation of sequence alignment/map files for variant calling

Author: C Raczy
Charlotte Herzeel
Christophe Antoniewski
D Decap
Dries Decap
G Cochrane
G Faust
G Tischler
G Van der Auwera
H Li
H Li
Jan Fostier
Joke Reumers
M DePristo
M Fritz
Pascal Costanza
R Blumofe
R Guimera
R Luo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 31/12/2014
Field of study

elPrep is a high-performance tool for preparing sequence alignment/map files for variant calling in sequencing pipelines. It can be used as a replacement for SAMtools and Picard for preparation steps such as filtering, sorting, marking duplicates, reordering contigs, and so on, while producing identical results. What sets elPrep apart is its software architecture that allows executing preparation pipelines by making only a single pass through the data, no matter how many preparation steps are used in the pipeline. elPrep is designed as a multithreaded application that runs entirely in memory, avoids repeated file I/O, and merges the computation of several preparation steps to significantly speed up the execution time. For example, for a preparation pipeline of five steps on a whole-exome BAM file (NA12878), we reduce the execution time from about 1: 40 hours, when using a combination of SAMtools and Picard, to about 15 minutes when using elPrep, while utilising the same server resources, here 48 threads and 23GB of RAM. For the same pipeline on whole-genome data (NA12878), elPrep reduces the runtime from 24 hours to less than 5 hours. As a typical clinical study may contain sequencing data for hundreds of patients, elPrep can remove several hundreds of hours of computing time, and thus substantially reduce analysis time and cost

Crossref

Ghent University Academic Bibliography

Directory of Open Access Journals

PubMed Central

Scipedia

FigShare

CrimeSPOT : A language and runtime for developing active wireless sensor network applications

Author: And
Arora
Christophe Scholliers
Coen De Roover
Deshpande
Forgy
Friedman-Hill
García-Herranz
Herzeel
Hughes
Martinez
Meditskos
Mottola
Román
Simon
Sugihara
Theo D’Hondt
Wolfgang De Meuter
Wouter Amerijckx
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Constructing Customized Interpreters from Reusable Evaluators Using Game

Author: A. Rigo
C. Hanson
C. Herzeel
H. Abelson
J.-P. Talpin
J.C. Reynolds
L.R. Nielsen
M. Anton Ertl
M. Snyder
M.S. Ager
O. Danvy
P. Graunke
S. Diehl
S. Liang
S.E. Ganz
T. Rompf
T. Schrijvers
W.D. Clinger
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

A highly efficient multi-core algorithm for clustering extremely large datasets

Author: A Ben-Hur
A Bertoni
A Jain
AK Jain
AR Adl-Tabatabai
AWF Edwards
B Andreopoulos
B Chapman
C Herzeel
Consortium IH
D Lea
D Smirnov
DR Barr
E Levine
F Müller
G Dalgin
HA Kestler
HA Kestler
Hans A Kestler
HW Kuhn
J Fridlyand
J Handl
J Larus
J MacQueen
Johann M Kraus
JW Sammon
K Fukunaga
L Hubert
L Kuncheva
M Anderson
M Ng
MK Kerr
N Shavit
P Jaccard
P Sham
PA Bernstein
R Development Core Team
R Duan
R Graham
R Jonker
R Rajwar
R Tibshirani
R Xu
RC Gentleman
S Monti
S Peyton-Jones
S Selim
T Kohonen
T Lange
U Drepper
W Feng
W Gropp
W Rand
WJ Conover
X Gao
X Gao
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.</p

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central