SeQual: Big Data Tool to Perform Quality Control and Data Preprocessing of Large NGS Datasets

Abstract

[Abstract] This paper presents SeQual, a scalable tool to efficiently perform quality control of large genomic datasets. Our tool currently supports more than 30 different operations (e.g., filtering, trimming, formatting) that can be applied to DNA/RNA reads in FASTQ/FASTA formats to improve subsequent downstream analyses, while providing a simple and user-friendly graphical interface for non-expert users. Furthermore, SeQual takes full advantage of Big Data technologies to process massive datasets on distributed-memory systems such as clusters by relying on the open-source Apache Spark cluster computing framework. Our scalable Spark-based implementation allows to reduce the runtime from more than three hours to less than 20 minutes when processing a paired-end dataset with 251 million reads per input file on an 8-node multi-core cluster.10.13039/501100004837-Ministry of Science and Innovation of Spain (Grant Number: TIN2016-75845-P and PID2019-104184RB-I00) 10.13039/501100004837-AEI/FEDER/EU (Grant Number: 10.13039/501100011033) 10.13039/501100010801-Xunta de Galicia and FEDER funds (Centro de Investigación de Galicia accreditation 2019–2022 and the Consolidation Program of Competitive Reference Groups) (Grant Number: ED431G 2019/01 and ED431C 2017/04)Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2017/0

    Similar works