Search CORE

215 research outputs found

Fast genotyping of known SNPs through approximate

Author: Berger Leighton Bonnie
Shajii Ariya
Yorukoglu Deniz
Yu Yun William
Publication venue: 'Oxford University Press (OUP)'
Publication date: 16/05/2018
Field of study

Motivation: As the volume of next-generation sequencing (NGS) data increases, faster algorithms become necessary. Although speeding up individual components of a sequence analysis pipeline (e.g. read mapping) can reduce the computational cost of analysis, such approaches do not take full advantage of the particulars of a given problem. One problem of great interest, genotyping a known set of variants (e.g. dbSNP or Affymetrix SNPs), is important for characterization of known genetic traits and causative disease variants within an individual, as well as the initial stage of many ancestral and population genomic pipelines (e.g. GWAS). Results: We introduce lightweight assignment of variant alleles (LAVA), an NGS-based genotyping algorithm for a given set of SNP loci, which takes advantage of the fact that approximate matching of mid-size k-mers (with k = 32) can typically uniquely ide ntify loci in the human genome without full read alignment. LAVA accurately calls the vast majority of SNPs in dbSNP and Affymetrix's Genome-Wide Human SNP Array 6.0 up to about an order of magnitude faster than standard NGS genotyping pipelines. For Affymetrix SNPs, LAVA has significantly higher SNP calling accuracy than existing pipelines while using as low as ∼5 GB of RAM. As such, LAVA represents a scalable computational method for population-level genotyping studies as well as a flexible NGS-based replacement for SNP arrays. Availability and Implementation: LAVA software is available at http://lava.csail.mit.edu

DSpace@MIT

Big tranSMART for clinical decision making

Author: Wang Shicai
Publication venue: Computing, Imperial College London
Publication date: 01/05/2016
Field of study

Molecular profiling data based patient stratification plays a key role in clinical decision making, such as identification of disease subgroups and prediction of treatment responses of individual subjects. Many existing knowledge management systems like tranSMART enable scientists to do such analysis. But in the big data era, molecular profiling data size increases sharply due to new biological techniques, such as next generation sequencing. None of the existing storage systems work well while considering the three ”V” features of big data (Volume, Variety, and Velocity). New Key Value data stores like Apache HBase and Google Bigtable can provide high speed queries by the Key. These databases can be modeled as Distributed Ordered Table (DOT), which horizontally partitions a table into regions and distributes regions to region servers by the Key. However, none of existing data models work well for DOT. A Collaborative Genomic Data Model (CGDM) has been designed to solve all these is- sues. CGDM creates three Collaborative Global Clustering Index Tables to improve the data query velocity. Microarray implementation of CGDM on HBase performed up to 246, 7 and 20 times faster than the relational data model on HBase, MySQL Cluster and MongoDB. Single nucleotide polymorphism implementation of CGDM on HBase outperformed the relational model on HBase and MySQL Cluster by up to 351 and 9 times. Raw sequence implementation of CGDM on HBase gains up to 440-fold and 22-fold speedup, compared to the sequence alignment map format implemented in HBase and a binary alignment map server. The integration into tranSMART shows up to 7-fold speedup in the data export function. In addition, a popular hierarchical clustering algorithm in tranSMART has been used as an application to indicate how CGDM can influence the velocity of the algorithm. The optimized method using CGDM performs more than 7 times faster than the same method using the relational model implemented in MySQL Cluster.Open Acces

Spiral - Imperial College Digital Repository

Nephele: genotyping via complete composition vectors and MapReduce

Author: A Drummond
A McKenna
A Rambaut
AL Ghindilis
AS De Groot
AS Fauci
B Budowle
BJ Frey
C Macken
C Notredame
C Ranger
CA Cummings
CB Do
D Janies
D Wang
E Gabriel
EC Holmes
G Giribet
G Lin
G Lu
HL Yang
IM Wallace
J Bullard
J Dean
JC Wilgenbusch
JD Retief
JD Thompson
KH Chu
KS Li
L Campitelli
L Gao
L Stuyver
Lynette Hirschman
M Colosimo
M Li
M Lindh
Marc E Colosimo
Matthew W Peterson
MC Schatz
MW Peterson
N Saitou
RC Edgar
RC Edgar
SA McEwen
Scott Mardis
SJ Matthews
T Hughes
TB Reddy
TZ DeSantis Jr
U Rost
V Brendel
X Wu
X Wu
XF Wan
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences. Results Nephele is a suite of tools that uses the complete composition vector algorithm to represent each sequence in the dataset as a vector derived from its constituent k-mers by passing the need for multiple sequence alignment, and affinity propagation clustering to group the sequences into genotypes based on a distance measure over the vectors. Our methods produce results that correlate well with expert-defined clades or genotypes, at a fraction of the computational cost of traditional phylogenetic methods run on traditional hardware. Nephele can use the open-source Hadoop implementation of MapReduce to parallelize execution using multiple compute nodes. We were able to generate a neighbour-joined tree of over 10,000 16S samples in less than 2 hours. Conclusions We conclude that using Nephele can substantially decrease the processing time required for generating genotype trees of tens to hundreds of organisms at genome scale sequence coverage.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Experiences with workflows for automating data-intensive bioinformatics

Author: Bongcam-Rudlof Erik
Carrasco Hernández Guillermo
Forer Lucas
Giovacchini Mario
Kallio Aleksi
Kanduła Maciej M
Korpelainen Eija
Krachunov Milko
Kreil David P.
Kulev Ognyan
Lampa Samuel
Pireddu Luca
Schönherr Sebastian
Siretskiy Alexey
Spjuth Ola
Valls Guimera Roman
Vassilev Dimitar
Łabaj Pavel P.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution.Pubblicat

Springer - Publisher Connector

P-arch

PubMed Central

Publikationsserver der Universitätsbibliothek Bodenkultur Wien

SparkBeagle: Scalable Genotype Imputation from Distributed Whole-Genome Reference Panels in the Cloud

Author: Heljanko Keijo
Maarala Altti Ilari
Nunez Fontarnau Javier
Pärn Kalle
Publication venue: ACM
Publication date: 01/09/2020
Field of study

Massive whole-genome genotype reference panels now provide accurate and fast genotyping by imputation for high-resolution genome-wide association (GWA) studies. Imputation-assisted genotyping can increase the genomic coverage of genotypes and thus satisfy the resolution required in comprehensive GWA studies in a cost-effective manner. However, the imputation of missing genotypes from large reference panels is a compute-intensive process that requires high-performance computing (HPC). Although HPC uses extremely distributed and parallel computing, current imputation tools, and existing algorithms have not been developed to fully exploit the power of distributed computing. To this end, we have developed SparkBeagle, a scalable, fast, and accurate distributed genotype imputation tool based on popular Beagle software. SparkBeagle is designed for HPC and cloud computing environments and it is implemented on top of the Apache Spark distributed computing framework. We have carried out scalability experiments by imputing 64,976,316 variants of 2504 samples from the 1000 Genomes reference panel in the cloud. SparkBeagle shows near-linear scalability while increasing the number of computing nodes. A speedup of 30x was achieved with 40 nodes. The imputation time of the whole data set decreased from 565 minutes to 18 minutes compared to a single node parallel execution. Near identical imputation accuracy was measured in the concordance analysis between the original Beagle and the distributed SparkBeagle tool.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Inference of Ancestral Recombination Graphs through Topological Data Analysis

Author: Camara Pablo G.
Levine Arnold J.
Rabadan Raul
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2016
Field of study

The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, they are computationally costly to reconstruct, usually being infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build upon previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations, human recombination, and horizontal evolution in finches inhabiting the Gal\'apagos Islands.Comment: 33 pages, 12 figures. The accompanying software, instructions and example files used in the manuscript can be obtained from https://github.com/RabadanLab/TARGe

arXiv.org e-Print Archive

Princeton University Open Access Repository

Crossref

Columbia University Academic Commons

Directory of Open Access Journals

PubMed Central

FigShare

Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends

Author: A Darling
A Matsunaga
A McKenna
A Rajaraman
A Thusoo
AB Friedman
AE Youssef
AV Nguyen
B He
B MacLean
B Meng
B Xu
B Zhang
Behrouz H Far
C Olston
C-F Juang
Christopher Naugler
CM Cusack
D Markonis
D Purves
DE Bell
DR Bean
E Kohlwey
Emad A Mohammed
F Omri
F Wang
G Kumar
GF Coulouris
GM Shepherd
GS Sadasivam
H Braak
H Horiguchi
H Huang
H Nordberg
I Foster
J Dean
J Gurtowski
J Xiaojing
JD Owens
JG Reid
JS Almeida
K Shvachko
K Zhang
L Dai
L Feldkamp
L Gao
L Wang
M de Oliveira Branco
M Gaggero
M Hämäläinen
M Isard
M Jonas
M Mazurek
M Olson
MA Musen
MC Schatz
ME Colosimo
MJ Brodie
N Raghava
N Satish
NV Chawla
PF Fabene
R Díaz-Uriarte
RC Taylor
RE Bryant
RS Kaplan
S Devaraj
S Herculano-Houzel
S Lewis
S Schönherr
S Shuman
S Yaramakala
S Zhao
SL Peyton Jones
T White
TA Tatusova
W Gropp
W Raghupathi
W Wang
W-P Chen
W-P Lee
X Qiu
Y Aphinyanaphongs
Y Wang
Y-J Chang
Y-L Lin
Z Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Optimization strategies for fast detection of positive selection on phylogenetic trees

Author: Pacher Christoph
Robinson-Rechavi Marc
Salamin Nicolas
Schabauer Hannes
Stamatakis Alexandros
Stockinger Heinz
Valle Mario
Publication venue
Publication date: 02/08/2017
Field of study

Motivation: The detection of positive selection is widely used to study gene and genome evolution, but its application remains limited by the high computational cost of existing implementations. We present a series of computational optimizations for more efficient estimation of the likelihood function on large-scale phylogenetic problems. We illustrate our approach using the branch-site model of codon evolution. Results: We introduce novel optimization techniques that substantially outperform both CodeML from the PAML package and our previously optimized sequential version SlimCodeML. These techniques can also be applied to other likelihood-based phylogeny software. Our implementation scales well for large numbers of codons and/or species. It can therefore analyse substantially larger datasets than CodeML. We evaluated FastCodeML on different platforms and measured average sequential speedups of FastCodeML (single-threaded) versus CodeML of up to 5.8, average speedups of FastCodeML (multi-threaded) versus CodeML on a single node (shared memory) of up to 36.9 for 12 CPU cores, and average speedups of the distributed FastCodeML versus CodeML of up to 170.9 on eight nodes (96 CPU cores in total). Availability and implementation: ftp://ftp.vital-it.ch/tools/FastCodeML/. Contact: [email protected] or [email protected]

RERO DOC Digital Library