Search CORE

31 research outputs found

SMusket: Spark-based DNA error correction on distributed-memory systems

Author: Expósito Roberto R.
González-Domínguez Jorge
Touriño Juan
Publication venue: Elsevier B.V.
Publication date: 01/01/2020
Field of study

©2020 Elsevier B.V. All rights reserved. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/bync-nd/4.0/. This version of the article has been accepted for publication in Future Generation Computer Systems. The Version of Record is available online at https://doi.org/10.1016/j.future.2019.10.038This is the accepted version of: R. R. Expósito, J. González-Domínguez, and J. Touriño, "SMusket: Sparkbased DNA error correction on distributed-memory systems", Future Generation Computer Systems, vol. 111, pp. 698-713, 2020, https://doi.org/10.1016/j.future.2019.10.038[Abstract]: Next-Generation Sequencing (NGS) technologies have revolutionized genomics research over the last decade, bringing new opportunities for scientists to perform groundbreaking biological studies. Error correction in NGS datasets is considered an important preprocessing step in many workflows as sequencing errors can severely affect the quality of downstream analysis. Although current error correction approaches provide reasonably high accuracies, their computational cost can be still unacceptable when processing large datasets. In this paper we propose SparkMusket (SMusket), a Big Data tool built upon the open-source Apache Spark cluster computing framework to boost the performance of Musket, one of the most widely adopted and top-performing multithreaded correctors. Our tool efficiently exploits Spark features to implement a scalable error correction algorithm intended for distributed-memory systems built using commodity hardware. The experimental evaluation on a 16-node cluster using four publicly available datasets has shown that SMusket is up to 15.3 times faster than previous state-of-the-art MPI-based tools, also providing a maximum speedup of 29.8 over its multithreaded counterpart. SMusket is publicly available under an open-source license at https://github.com/rreye/smusketThis work was supported by the Ministry of Economy, Industry and Competitiveness of Spain and FEDER, Spain funds of the European Union (project TIN2016-75845-P, AEI/FEDER/EU); and by Xunta de Galicia, Spain (projects ED431G/01 and ED431C 2017/04).Xunta de galicia; ED431G/01Xunta de Galicia; ED431C 2017/0

Repositorio da Universidade da Coruña

Crossref

MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud

Author: Expósito Roberto R.
González-Domínguez Jorge
Touriño Juan
Veiga Jorge
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2017
Field of study

This is a pre-copyedited, author-produced version of an article accepted for publication in Bioinformatics following peer review. The version of record Roberto R. Expósito, Jorge Veiga, Jorge González-Domínguez, Juan Touriño; MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud, Bioinformatics, Volume 33, Issue 17, 1 September 2017, Pages 2762–2764 is available online at: https://doi.org/10.1093/bioinformatics/btx307[Abstract] This article presents MarDRe, a de novo cloud-ready duplicate and near-duplicate removal tool that can process single- and paired-end reads from FASTQ/FASTA datasets. MarDRe takes advantage of the widely adopted MapReduce programming model to fully exploit Big Data technologies on cloud-based infrastructures. Written in Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for scalable Big Data processing. On a 16-node cluster deployed on the Amazon EC2 cloud platform, MarDRe is up to 8.52 times faster than a representative state-of-the-art tool.Ministerio de Economia y Competitividad; TIN2016-75845-PMinisterio de Educación; FPU014/0280

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

The Servet 3.0 benchmark suite: characterization of network performance degradation

Author: Expósito Roberto R.
González-Domínguez Jorge
López Taboada Guillermo
Martín María J.
Touriño Juan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

This is a post-peer-review, pre-copyedit version of an article published in Computers & Electrical Engineering. The final authenticated version is available online at: https://doi.org/10.1016/j.compeleceng.2013.08.012.[Abstract] Servet is a suite of benchmarks focused on extracting a set of parameters with high influence on the overall performance of multicore clusters. These parameters can be used to optimize the performance of parallel applications by adapting part of their behavior to the characteristics of the machine. Up to now the tool considered network bandwidth as constant and independent of the communication pattern. Nevertheless, the inter-node communication bandwidth decreases on modern large supercomputers depending on the number of cores per node that simultaneously access the network and on the distance between the communicating nodes. This paper describes two new benchmarks that improve Servet by characterizing the network performance degradation depending on these factors. This work also shows the experimental results of these benchmarks on a Cray XE6 supercomputer and some examples of how real parallel codes can be optimized by using the information about network degradation.Ministerio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación; AP2008-01578Ministerio de Educación; AP2010-4348European Commision; HPC-Europa2 Programme; 22839

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Analysis of I/O Performance on an Amazon EC2 Cluster Compute and High I/O Platform

Author: Doallo Ramón
Expósito Roberto R.
González-Domínguez Jorge
López Taboada Guillermo
Ramos Garea Sabela
Touriño Juan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

“This is a post-peer-review, pre-copyedit version of an article published in Journal of Grid Computing. The final authenticated version is available online at: https://doi.org/10.1007/s10723-013-9250-y[Abstract] Cloud computing is currently being explored by the scientific community to assess its suitability for High Performance Computing (HPC) environments. In this novel paradigm, compute and storage resources, as well as applications, can be dynamically provisioned on a pay-per-use basis. This paper presents a thorough evaluation of the I/O storage subsystem using the Amazon EC2 Cluster Compute platform and the recent High I/O instance type, to determine its suitability for I/O-intensive applications. The evaluation has been carried out at different layers using representative benchmarks in order to evaluate the low-level cloud storage devices available in Amazon EC2, ephemeral disks and Elastic Block Store (EBS) volumes, both on local and distributed file systems. In addition, several I/O interfaces (POSIX, MPI-IO and HDF5) commonly used by scientific workloads have also been assessed. Furthermore, the scalability of a representative parallel I/O code has also been analyzed at the application level, taking into account both performance and cost metrics. The analysis of the experimental results has shown that available cloud storage devices can have different performance characteristics and usage constraints. Our comprehensive evaluation can help scientists to increase significantly (up to several times) the performance of I/O-intensive applications in Amazon EC2 cloud. An example of optimal configuration that can maximize I/O performance in this cloud is the use of a RAID 0 of 2 ephemeral disks, TCP with 9,000 bytes MTU, NFS async and MPI-IO on the High I/O instance type, which provides ephemeral disks backed by Solid State Drive (SSD) technology.Ministerio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación; AP2010-4348Galicia. Consellería de Cultura, Educación e Ordenación Universitaria; ref. 2010/

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Intervención familiar en la esquizofrenia. Su diseminación en un área de salud.

Author: Baena Ruiz E
Fernández Fernández J
Inglott Domínguez Rafael
Touriño González R
Publication venue: Asociación Española de Neuropsiquiatría
Publication date: 01/01/2004
Field of study

Se presenta el proceso de diseminación de un programa de intervención familiar en la esquizofrenia en un área de salud, integrado en la asistencia habitual de la red de salud mental. Se discuten las razones que han hecho posible su implantación. Palabras clave: Intervención familiar, Psicoeducación, Esquizofrenia

Revista de la Asociación Española de Neuropsiquiatría

Intervención familiar en la esquizofrenia. Su diseminación en un área de salud.

Author: Baena Ruiz E
Fernández Fernández J
Inglott Domínguez Rafael
Touriño González R
Publication venue: Asociación Española de Neuropsiquiatría
Publication date: 01/01/2004
Field of study

Repositorio de la Asociación Española de Neuropsiquiatría

Directory of Open Access Journals

Revista de la Asociación Española de Neuropsiquiatría

A 2D algorithm with asymmetric workload for the UPC conjugate gradient method

Author: DH Bailey
H Shan
J González-Domínguez
JC Pichel
Jorge González-Domínguez
Juan Touriño
María J. Martín
Osni A. Marques
R Barrett
R Vuduc
Y Saad
Y Zheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-014-1300-0[Abstract] This paper examines four different strategies, each one with its own data distribution, for implementing the parallel conjugate gradient (CG) method and how they impact communication and overall performance. Firstly, typical 1D and 2D distributions of the matrix involved in CG computations are considered. Then, a new 2D version of the CG method with asymmetric workload, based on leaving some threads idle during part of the computation to reduce communication, is proposed. The four strategies are independent of sparse storage schemes and are implemented using Unified Parallel C (UPC), a Partitioned Global Address Space (PGAS) language. The strategies are evaluated on two different platforms through a set of matrices that exhibit distinct sparse patterns, demonstrating that our asymmetric proposal outperforms the others except for one matrix on one platform.Ministerio de Economía y Competitividad; TIN2013-42148-PXunta de Galicia; GRC2013/055United States. Department of Energy; DEAC03-76SF0009

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

HSRA: Hadoop-based spliced read aligner for RNA sequencing data

Author: A Dobin
A McKenna
A Mortazavi
A O’Driscoll
AD Smith
B Fjukstad
B Langmead
B Langmead
B Langmead
B Langmead
B Schmidt
D Decap
D Decap
D Hong
D Kim
D Kim
D Peters
G Baruzzo
H Li
H Li
H Li
H Nordberg
J Dean
J González-Domínguez
J Luo
J Sirén
JC Marioni
JM Abuín
JM Abuín
JM Mullaney
Jorge González-Domínguez
Juan Touriño
K Wang
KR Kukurba
L Pireddu
M Niemenmaa
M Zaharia
MC Schatz
NL Bray
Q Zou
R Li
R Patro
Roberto R. Expósito
RR Expósito
Ruslan Kalendar
RV Pandey
S Ghemawat
S Huang
S Pepke
T Nguyen
TD Wu
U Ferraro Petrillo
Z Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

[Abstract] Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user’s guide are publicly available for download at http://hsra.dec.udc.es.Ministerio de Economía, Industria y Competitividad; TIN2016-75845-PXunta de Galicia; ED431G/0

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Directory of Open Access Journals

Recommended from our members

Assessment of the anthelmintic activity of medicinal plant extracts and purified condensed tannins against free-living and parasitic stages of Oesophagostomum dentatum

Author: A Dangolla
A Novobilský
A Novobilský
A Roepstorff
A Scalbert
AC Kotze
AL Molan
Andrew R Williams
AR Williams
CA Ramírez-Restrepo
Christos Fryganas
F Heckendorn
G Stepek
H Hoste
HM Ropiak
Honorata M Ropiak
I Mueller-Harvey
IA Sutherland
Irene Mueller-Harvey
J Charlier
J Keiser
JB Githiori
JD Reed
JH Niezen
JL Fitzpatrick
K Salajpal
K Tolossa
L Gu
LS Roberts
M Ondrovics
M Várady
MM Cowan
N Martinez-Micaelo
ND Sargison
NL Butter
Olivier Desrues
OM Hale
R González
R Kumarasingha
RB Gasser
S Athanasiadou
S Athanasiadou
S Brunet
S Brunet
S Brunet
S Brunet
S Déprez
S Gerwert
S Touriño
Stig M Thamsborg
T Esatbeyoglu
TB Stewart
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Background: Plant-derived condensed tannins (CT) show promise as a complementary option to treat gastrointestinal helminth infections, thus reducing reliance on synthetic anthelmintic drugs. Most studies on the anthelmintic effects of CT have been conducted on parasites of ruminant livestock. Oesophagostomum dentatum is an economically important parasite of pigs, as well as serving as a useful laboratory model of helminth parasites due to the ability to culture it in vitro for long periods through several life-cycle stages. Here, we investigated the anthelmintic effects of CT on multiple life-cycles stages of O. dentatum. Methods: Extracts and purified fractions were prepared from five plants containing CT and analysed by HPLC-MS. Anthelmintic activity was assessed at five different stages of the O. dentatum life cycle; the development of eggs to infective third-stage larvae (L3), the parasitic L3 stage, the moult from L3 to fourth-stage larvae (L4), the L4 stage and the adult stage. Results: Free-living larvae of O. dentatum were highly susceptible to all five plant extracts. In contrast, only two of the five extracts had activity against L3, as evidenced by migration inhibition assays, whilst three of the five extracts inhibited the moulting of L3 to L4. All five extracts reduced the motility of L4, and the motility of adult worms exposed to a CT-rich extract derived from hazelnut skins was strongly inhibited, with electron microscopy demonstrating direct damage to the worm cuticle and hypodermis. Purified CT fractions retained anthelmintic activity, and depletion of CT from extracts by pre-incubation in polyvinylpolypyrrolidone removed anthelmintic effects, strongly suggesting CT as the active molecules. Conclusions: These results suggest that CT may have promise as an alternative parasite control option for O. dentatum in pigs, particularly against adult stages. Moreover, our results demonstrate a varied susceptibility of different life-cycle stages of the same parasite to CT, which may offer an insight into the anthelmintic mechanisms of these commonly found plant compounds

Central Archive at the University of Reading

Crossref

Springer - Publisher Connector

Copenhagen University Research Information System

PubMed Central