43 research outputs found
Distributed gene clinical decision support system based on cloud computing
Background: The clinical decision support system can effectively break the limitations of doctors’ knowledge and reduce the possibility of misdiagnosis to enhance health care. The traditional genetic data storage and analysis methods based on stand-alone environment are hard to meet the computational requirements with the rapid genetic data growth for the limited scalability.
Methods: In this paper, we propose a distributed gene clinical decision support system, which is named GCDSS. And a prototype is implemented based on cloud computing technology. At the same time, we present CloudBWA which is a novel distributed read mapping algorithm leveraging batch processing strategy to map reads on Apache Spark.
Results: Experiments show that the distributed gene clinical decision support system GCDSS and the distributed read mapping algorithm CloudBWA have outstanding performance and excellent scalability. Compared with state-of-the-art distributed algorithms, CloudBWA achieves up to 2.63 times speedup over SparkBWA. Compared with stand-alone algorithms, CloudBWA with 16 cores achieves up to 11.59 times speedup over BWA-MEM with 1 core.
Conclusions: GCDSS is a distributed gene clinical decision support system based on cloud computing techniques. In particular, we incorporated a distributed genetic data analysis pipeline framework in the proposed GCDSS system. To boost the data processing of GCDSS, we propose CloudBWA, which is a novel distributed read mapping algorithm to leverage batch processing technique in mapping stage using Apache Spark platform.
Keywords: Clinical decision support system, Cloud computing, Spark, Alluxio, Genetic data analysis, Read mappin
BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments
Advances in sequencing techniques have led to exponential growth in
biological data, demanding the development of large-scale bioinformatics
experiments. Because these experiments are computation- and data-intensive,
they require high-performance computing (HPC) techniques and can benefit from
specialized technologies such as Scientific Workflow Management Systems (SWfMS)
and databases. In this work, we present BioWorkbench, a framework for managing
and analyzing bioinformatics experiments. This framework automatically collects
provenance data, including both performance data from workflow execution and
data from the scientific domain of the workflow application. Provenance data
can be analyzed through a web application that abstracts a set of queries to
the provenance database, simplifying access to provenance information. We
evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree
assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a
RASopathy analysis workflow. We analyze each workflow from both computational
and scientific domain perspectives, by using queries to a provenance and
annotation database. Some of these queries are available as a pre-built feature
of the BioWorkbench web application. Through the provenance data, we show that
the framework is scalable and achieves high-performance, reducing up to 98% of
the case studies execution time. We also show how the application of machine
learning techniques can enrich the analysis process
Simulation of the performance of complex data-intensive workflows
PhD ThesisRecently, cloud computing has been used for analytical and data-intensive processes
as it offers many attractive features, including resource pooling, on-demand capability
and rapid elasticity. Scientific workflows use these features to tackle the problems of
complex data-intensive applications. Data-intensive workflows are composed of many
tasks that may involve large input data sets and produce large amounts of data as
output, which typically runs in highly dynamic environments. However, the resources
should be allocated dynamically depending on the demand changes of the work
flow, as over-provisioning increases the cost and under-provisioning causes Service Level
Agreement (SLA) violation and poor Quality of Service (QoS). Performance prediction
of complex workflows is a necessary step prior to the deployment of the workflow.
Performance analysis of complex data-intensive workflows is a challenging task due
to the complexity of their structure, diversity of big data, and data dependencies, in
addition to the required examination to the performance and challenges associated
with running their workflows in the real cloud.
In this thesis, a solution is explored to address these challenges, using a Next Generation
Sequencing (NGS) workflow pipeline as a case study, which may require hundreds/
thousands of CPU hours to process a terabyte of data. We propose a methodology to
model, simulate and predict runtime and the number of resources used by the complex
data-intensive workflows. One contribution of our simulation methodology is that it
provides an ability to extract the simulation parameters (e.g., MIPs and BW values)
that are required for constructing a training set and a fairly accurate prediction of
the run time for input for cluster sizes much larger than ones used in training of the
prediction model. The proposed methodology permits the derivation of run time prediction
based on historical data from the provenance fi les. We present the run time
prediction of the complex workflow by considering different cases of its running in the
cloud such as execution failure and library deployment time. In case of failure, the
framework can apply the prediction only partially considering the successful parts of
the pipeline, in the other case the framework can predict with or without considering
the time to deploy libraries. To further improve the accuracy of prediction, we propose
a simulation model that handles I/O contention
HSRA: Hadoop-based spliced read aligner for RNA sequencing data
[Abstract] Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user’s guide are publicly available for download at http://hsra.dec.udc.es.Ministerio de EconomĂa, Industria y Competitividad; TIN2016-75845-PXunta de Galicia; ED431G/0