32 research outputs found
MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud
This is a pre-copyedited, author-produced version of an article accepted for publication in Bioinformatics following peer review. The version of record Roberto R. Expósito, Jorge Veiga, Jorge González-DomÃnguez, Juan Touriño; MarDRe: efficient MapReduce-based removal of duplicate DNA reads in the cloud, Bioinformatics, Volume 33, Issue 17, 1 September 2017, Pages 2762–2764 is available online at: https://doi.org/10.1093/bioinformatics/btx307[Abstract] This article presents MarDRe, a de novo cloud-ready duplicate and near-duplicate removal tool that can process single- and paired-end reads from FASTQ/FASTA datasets. MarDRe takes advantage of the widely adopted MapReduce programming model to fully exploit Big Data technologies on cloud-based infrastructures. Written in Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for scalable Big Data processing. On a 16-node cluster deployed on the Amazon EC2 cloud platform, MarDRe is up to 8.52 times faster than a representative state-of-the-art tool.Ministerio de Economia y Competitividad; TIN2016-75845-PMinisterio de Educación; FPU014/0280
Why democratize bioinformatics?
Network bioinformatics and web-based data collection instruments have the capacity to improve
the efficiency of the UK’s appropriately high levels of investment into cardiovascular research.
A very large proportion of scientific data falls into the long-tail of the cardiovascular research
distribution curve, with numerous small independent research efforts yielding a rich variety of
specialty data sets. The merging of such myriad datasets and the eradication of data silos, plus
linkage with outcomes could be greatly facilitated through the provision of a national set of
standardised data collection instruments—a shared-cardioinformatics library of tools designed
by and for clinical academics active in the long-tail of biomedical research. Across the
cardiovascular research domain, like the rest of medicine, the national aggregation and
democratization of diverse long-tail data is the best way to convert numerous small but
expensive cohort data sources into big data, expanding our knowledge-base, breaking down
translational barriers, improving research efficiency and with time, improving patient outcomes
PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark
Large-scale data processing techniques, currently known as Big-Data, are used to manage the huge amount of data that are generated by sequencers. Although these techniques have significant advantages, few biological applications have adopted them. In the Bioinformatic scientific area, Multiple Sequence Alignment (MSA) tools are widely applied for evolution and phylogenetic analysis, homology and domain structure prediction. Highly-rated MSA tools, such as MAFFT, ProbCons and T-Coffee (TC), use the probabilistic consistency as a prior step to the progressive alignment stage in order to improve the final accuracy. In this paper, a novel approach named PPCAS (Probabilistic Pairwise model for Consistency-based multiple alignment in Apache Spark) is presented. PPCAS is based on the MapReduce processing paradigm in order to enable large datasets to be processed with the aim of improving the performance and scalability of the original algorithm.This work was supported by the MEyC-Spain [contract TIN2014-53234-C2-2-R]
MapReduce Algorithms for Inferring Gene Regulatory Networks from Time-Series Microarray Data Using an Information-Theoretic Approach
Gene regulation is a series of processes that control gene expression and its
extent. The connections among genes and their regulatory molecules, usually
transcription factors, and a descriptive model of such connections, are known
as gene regulatory networks (GRNs). Elucidating GRNs is crucial to understand
the inner workings of the cell and the complexity of gene interactions. To
date, numerous algorithms have been developed to infer gene regulatory
networks. However, as the number of identified genes increases and the
complexity of their interactions is uncovered, networks and their regulatory
mechanisms become cumbersome to test. Furthermore, prodding through
experimental results requires an enormous amount of computation, resulting in
slow data processing. Therefore, new approaches are needed to expeditiously
analyze copious amounts of experimental data resulting from cellular GRNs. To
meet this need, cloud computing is promising as reported in the literature.
Here we propose new MapReduce algorithms for inferring gene regulatory networks
on a Hadoop cluster in a cloud environment. These algorithms employ an
information-theoretic approach to infer GRNs using time-series microarray data.
Experimental results show that our MapReduce program is much faster than an
existing tool while achieving slightly better prediction accuracy than the
existing tool.Comment: 19 pages, 5 figure
Data analytics with mapreduce in apache spark and hadoop systems
MapReduce comes from a traditional problem solving method: separating a big problem and solving each small parts. With the target of computing larger dataset in more efficient and cheaper way, this is implement into a programming mode to deal with massive quantity of data. The users get a map function and use it to abstract dataset into key / value logical pair and then use a reduce function to group all value with the same key. With this mode, task can be automatic spread the job into clusters grouped by lots of normal computers. MapReduce program can be easily implemented and gain much more efficiency than tradition computing programs. In this paper there are some sample programs and one GRN detection algorithm program to study about it.
Detecting gene regulatory networks (GRN), the regulatory molecules connection among various genes, is one of the main subjects in understanding gene biology. Although there are algorithms developed for this target, the increase of gene size and their complexity make the processing time more and more hard and slow. MapReduce mode with parallelize computing can be one way to overcome these problems. In this paper, a well-defined framework to parallelize mutual information algorithm is presented. The experiments and result performances shows the improvement of using parallelizing MapReduce model
GenoMetric Query Language: A novel approach to large-scale genomic data management
Motivation: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art ‘big data’ computing strategies, with
abstraction levels beyond available tool capabilities.
Results: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic ‘big data’ analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets.
Availability and implementation: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online
Portability of Scientific Workflows in NGS Data Analysis: A Case Study
The analysis of next-generation sequencing (NGS) data requires complex
computational workflows consisting of dozens of autonomously developed yet
interdependent processing steps. Whenever large amounts of data need to be
processed, these workflows must be executed on a parallel and/or distributed
systems to ensure reasonable runtime. Porting a workflow developed for a
particular system on a particular hardware infrastructure to another system or
to another infrastructure is non-trivial, which poses a major impediment to the
scientific necessities of workflow reproducibility and workflow reusability. In
this work, we describe our efforts to port a state-of-the-art workflow for the
detection of specific variants in whole-exome sequencing of mice. The workflow
originally was developed in the scientific workflow system snakemake for
execution on a high-performance cluster controlled by Sun Grid Engine. In the
project, we ported it to the scientific workflow system SaasFee that can
execute workflows on (multi-core) stand-alone servers or on clusters of
arbitrary sizes using the Hadoop. The purpose of this port was that also owners
of low-cost hardware infrastructures, for which Hadoop was made for, become
able to use the workflow. Although both the source and the target system are
called scientific workflow systems, they differ in numerous aspects, ranging
from the workflow languages to the scheduling mechanisms and the file access
interfaces. These differences resulted in various problems, some expected and
more unexpected, that had to be resolved before the workflow could be run with
equal semantics. As a side-effect, we also report cost/runtime ratios for a
state-of-the-art NGS workflow on very different hardware platforms: A
comparably cheap stand-alone server (80 threads), a mid-cost, mid-sized cluster
(552 threads), and a high-end HPC system (3784 threads)