11 research outputs found
Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools
Scalable and efficient processing of genome sequence data, i.e. for variant
discovery, is key to the mainstream adoption of High Throughput technology for
disease prevention and for clinical use. Achieving scalability, however,
requires a significant effort to enable the parallel execution of the analysis
tools that make up the pipelines. This is facilitated by the new Spark versions
of the well-known GATK toolkit, which offer a black-box approach by
transparently exploiting the underlying Map Reduce architecture. In this paper
we report on our experience implementing a standard variant discovery pipeline
using GATK 4.0 with Docker-based deployment over a cluster. We provide a
preliminary performance analysis, comparing the processing times and cost to
those of the new Microsoft Genomics Services
Portability of Scientific Workflows in NGS Data Analysis: A Case Study
The analysis of next-generation sequencing (NGS) data requires complex
computational workflows consisting of dozens of autonomously developed yet
interdependent processing steps. Whenever large amounts of data need to be
processed, these workflows must be executed on a parallel and/or distributed
systems to ensure reasonable runtime. Porting a workflow developed for a
particular system on a particular hardware infrastructure to another system or
to another infrastructure is non-trivial, which poses a major impediment to the
scientific necessities of workflow reproducibility and workflow reusability. In
this work, we describe our efforts to port a state-of-the-art workflow for the
detection of specific variants in whole-exome sequencing of mice. The workflow
originally was developed in the scientific workflow system snakemake for
execution on a high-performance cluster controlled by Sun Grid Engine. In the
project, we ported it to the scientific workflow system SaasFee that can
execute workflows on (multi-core) stand-alone servers or on clusters of
arbitrary sizes using the Hadoop. The purpose of this port was that also owners
of low-cost hardware infrastructures, for which Hadoop was made for, become
able to use the workflow. Although both the source and the target system are
called scientific workflow systems, they differ in numerous aspects, ranging
from the workflow languages to the scheduling mechanisms and the file access
interfaces. These differences resulted in various problems, some expected and
more unexpected, that had to be resolved before the workflow could be run with
equal semantics. As a side-effect, we also report cost/runtime ratios for a
state-of-the-art NGS workflow on very different hardware platforms: A
comparably cheap stand-alone server (80 threads), a mid-cost, mid-sized cluster
(552 threads), and a high-end HPC system (3784 threads)
Programming Languages for Data-Intensive HPC Applications: a Systematic Mapping Study
A major challenge in modelling and simulation is the need to combine expertise in both software technologies and a given scientific domain. When High-Performance Computing (HPC) is required to solve a scientific problem, software development becomes a problematic issue. Considering the complexity of the software for HPC, it is useful to identify programming languages that can be used to alleviate this issue. Because the existing literature on the topic of HPC is very dispersed, we performed a Systematic Mapping Study (SMS) in the context of the European COST Action cHiPSet. This literature study maps characteristics of various programming languages for data-intensive HPC applications, including category, typical user profiles, effectiveness, and type of articles. We organised the SMS in two phases. In the first phase, relevant articles are identified employing an automated keyword-based search in eight digital libraries. This lead to an initial sample of 420 papers, which was then narrowed down in a second phase by human inspection of article abstracts, titles and keywords to 152 relevant articles published in the period 2006–2018. The analysis of these articles enabled us to identify 26 programming languages referred to in 33 of relevant articles. We compared the outcome of the mapping study with results of our questionnaire-based survey that involved 57 HPC experts. The mapping study and the survey revealed that the desired features of programming languages for data-intensive HPC applications are portability, performance and usability. Furthermore, we observed that the majority of the programming languages used in the context of data-intensive HPC applications are text-based general-purpose programming languages. Typically these have a steep learning curve, which makes them difficult to adopt. We believe that the outcome of this study will inspire future research and development in programming languages for data-intensive HPC applications.Additional co-authors: Sabri Pllana, Ana Respício, José Simão, Luís Veiga, Ari Vis
Scientific Workflows: Moving Across Paradigms
Modern scientific collaborations have opened up the opportunity to solve complex problems that require both multidisciplinary expertise and large-scale computational experiments. These experiments typically consist of a sequence of processing steps that need to be executed on selected computing platforms. Execution poses a challenge, however, due to (1) the complexity and diversity of applications, (2) the diversity of analysis goals, (3) the heterogeneity of computing platforms, and (4) the volume and distribution of data. A common strategy to make these in silico experiments more manageable is to model them as workflows and to use a workflow management system to organize their execution. This article looks at the overall challenge posed by a new order of scientific experiments and the systems they need to be run on, and examines how this challenge can be addressed by workflows and workflow management systems. It proposes a taxonomy of workflow management system (WMS) characteristics, including aspects previously overlooked. This frames a review of prevalent WMSs used by the scientific community, elucidates their evolution to handle the challenges arising with the emergence of the “fourth paradigm,” and identifies research needed to maintain progress in this area
Scalable and efficient whole-exome data processing using workflows on the cloud
Dataflow-style workflows offer a simple, high-level programming model for flexible prototyping of scientific applications as an attractive alternative to low-level scripting. At the same time, workflow management systems (WFMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workflow over a cloud infrastructure. In theory, the combination of these properties makes workflows a natural choice for implementing Big Data processing pipelines, common for instance in bioinformatics. In practice, however, correct workflow design for parallel Big Data problems can be complex and very time-consuming. In this paper we present our experience in porting a genomics data processing pipeline from an existing scripted implementation deployed on a closed HPC cluster, to a workflow-based design deployed on the Microsoft Azure public cloud. We draw two contrasting and general conclusions from this project. On the positive side, we show that our solution based on the e-Science Central WFMS and deployed in the cloud clearly outperforms the original HPC-based implementation achieving up to 2.3× speed-up. However, in order to deliver such performance we describe the importance of optimising the workflow deployment model to best suit the characteristics of the cloud computing infrastructure. The main reason for the performance gains was the availability of fast, node-local SSD disks delivered by D-series Azure VMs combined with the implicit use of local disk resources by e-Science Central workflow engines. These conclusions suggest that, on parallel Big Data problems, it is important to couple understanding of the cloud computing architecture and its software stack with simplicity of design, and that further efforts in automating parallelisation of complex pipelines are required.</p
Scalable and efficient whole-exome data processing using workflows on the cloud
Dataflow-style workflows offer a simple, high-level programming model for flexible prototyping of scientific applications as an attractive alternative to low-level scripting. At the same time, workflow management systems (WFMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workflow over a cloud infrastructure. In theory, the combination of these properties makes workflows a natural choice for implementing Big Data processing pipelines, common for instance in bioinformatics. In practice, however, correct workflow design for parallel Big Data problems can be complex and very time-consuming. In this paper we present our experience in porting a genomics data processing pipeline from an existing scripted implementation deployed on a closed HPC cluster, to a workflow-based design deployed on the Microsoft Azure public cloud. We draw two contrasting and general conclusions from this project. On the positive side, we show that our solution based on the e-Science Central WFMS and deployed in the cloud clearly outperforms the original HPC-based implementation achieving up to 2.3× speed-up. However, in order to deliver such performance we describe the importance of optimising the workflow deployment model to best suit the characteristics of the cloud computing infrastructure. The main reason for the performance gains was the availability of fast, node-local SSD disks delivered by D-series Azure VMs combined with the implicit use of local disk resources by e-Science Central workflow engines. These conclusions suggest that, on parallel Big Data problems, it is important to couple understanding of the cloud computing architecture and its software stack with simplicity of design, and that further efforts in automating parallelisation of complex pipelines are required.</p
Automatic deployment and reproducibility of workflow on the Cloud using container virtualization
PhD ThesisCloud computing is a service-oriented approach to distributed computing that has
many attractive features, including on-demand access to large compute resources. One
type of cloud applications are scientific work
ows, which are playing an increasingly
important role in building applications from heterogeneous components. Work
ows are
increasingly used in science as a means to capture, share, and publish computational
analysis. Clouds can offer a number of benefits to work
ow systems, including the
dynamic provisioning of the resources needed for computation and storage, which has
the potential to dramatically increase the ability to quickly extract new results from
the huge amounts of data now being collected.
However, there are increasing number of Cloud computing platforms, each with different
functionality and interfaces. It therefore becomes increasingly challenging to
de ne work
ows in a portable way so that they can be run reliably on different clouds.
As a consequence, work
ow developers face the problem of deciding which Cloud to
select and - more importantly for the long-term - how to avoid vendor lock-in.
A further issue that has arisen with work
ows is that it is common for them to stop
being executable a relatively short time after they were created. This can be due to
the external resources required to execute a work
ow - such as data and services -
becoming unavailable. It can also be caused by changes in the execution environment
on which the work
ow depends, such as changes to a library causing an error when a
work
ow service is executed. This "work
ow decay" issue is recognised as an impediment
to the reuse of work
ows and the reproducibility of their results. It is becoming
a major problem, as the reproducibility of science is increasingly dependent on the
reproducibility of scientific work
ows.
In this thesis we presented new solutions to address these challenges. We propose a new
approach to work
ow modelling that offers a portable and re-usable description of the
work
ow using the TOSCA specification language. Our approach addresses portability
by allowing work
ow components to be systematically specifed and automatically
- v -
deployed on a range of clouds, or in local computing environments, using container
virtualisation techniques.
To address the issues of reproducibility and work
ow decay, our modelling and deployment
approach has also been integrated with source control and container management
techniques to create a new framework that e ciently supports dynamic work
ow deployment,
(re-)execution and reproducibility.
To improve deployment performance, we extend the framework with number of new
optimisation techniques, and evaluate their effect on a range of real and synthetic
work
ows.Ministry of Higher Education and
Scientific Research in Iraq and Mosul Universit
Linguagens para a Computação de Alto Desempenho, utilizadas no processamento de Big Data: Um Estudo de Mapeamento Sistemático
Big Data são conjuntos de informação de alto Volume, Velocidade e/ou Variedade que
exigem formas inovadoras e económicas de processamento, que permitem uma melhor
percepção, tomada de decisões e automação de processos.
Desde 2002, a taxa de melhoria do desempenho em processadores simples diminuiu
bruscamente. A fim de aumentar o poder dos processadores, foram utilizados múltiplos
cores, em paralelo, num único chip. Para conseguir beneficiar deste tipo de arquiteturas,
é necessário reescrever os programas sequenciais. O objetivo da Computação de Alto
Desempenho (CAD) é estudar as metodologias e técnicas que permitem a exploração
destas arquiteturas. O desafio é a necessidade de combinar o desenvolvimento de Software
para a CAD com a gestão e análise de Big Data. Quando a computação paralela e
distribuída é obrigatória, o código torna-se mais difícil. Para tal, é necessário saber quais
são as linguagens a utilizar para facilitar essa tarefa.
Pelo facto da literatura existente sobre o tópico da CAD se encontrar muito dispersa,
foi conduzido um Estudo de Mapeamento Sistemático (EMS), que agrega caraterísticas sobre
as diferentes linguagens encontradas (categoria; natureza; perfis de utilizador típicos;
eficácia; tipos de artigos publicados na área), no processamento de Big Data, auxiliando
estudantes, investigadores, ou outros profissionais que necessitem de uma introdução ou
uma visão panorâmica sobre este tema.
A pesquisa de artigos foi efetuada numa busca automatizada, baseada em palavraschave,
nas bases de dados de 8 bibliotecas digitais selecionadas. Este processo resultou
numa amostra inicial de 420 artigos, que foi reduzida a 152 artigos, publicados entre
Janeiro de 2006 e Março de 2018. A análise manual desses artigos permitiu-nos identificar
26 linguagens em 33 publicações incluídas. Sumarizei e comparei as informações com
as opiniões de profissionais. Os resultados indicaram que a maioria destas linguagens
são Linguagem de Propósito Geral (LPG) em vez de Linguagem de Domínio Específico
(LDE), o que nos leva a concluir que existe uma oportunidade de investigação aplicada
de linguagens que tornem a codificação mais fácil para os especialistas do domínio