3,739 research outputs found
Evaluation and optimization of Big Data Processing on High Performance Computing Systems
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
Hoxe en día, moitas organizacións empregan tecnoloxías Big Data para extraer
información de grandes volumes de datos. A medida que o tamaño destes volumes
crece, satisfacer as demandas de rendemento das aplicacións de procesamento
de datos masivos faise máis difícil. Esta Tese céntrase en avaliar e optimizar estas
aplicacións, presentando dúas novas ferramentas chamadas BDEv e Flame-MR. Por
unha banda, BDEv analiza o comportamento de frameworks de procesamento Big
Data como Hadoop, Spark e Flink, moi populares na actualidade. BDEv xestiona
a súa configuración e despregamento, xerando os conxuntos de datos de entrada
e executando cargas de traballo previamente elixidas polo usuario. Durante cada
execución, BDEv extrae diversas métricas de avaliación que inclúen rendemento,
uso de recursos, eficiencia enerxética e comportamento a nivel de microarquitectura.
Doutra banda, Flame-MR permite optimizar o rendemento de aplicacións Hadoop
MapReduce. En xeral, o seu deseño baséase nunha arquitectura dirixida por eventos
capaz de mellorar a eficiencia dos recursos do sistema mediante o solapamento da
computación coas comunicacións. Ademais de reducir o número de copias en memoria
que presenta Hadoop, emprega algoritmos eficientes para ordenar e mesturar os
datos. Flame-MR substitúe o motor de procesamento de datos MapReduce de xeito
totalmente transparente, polo que non é necesario modificar o código de aplicacións
xa existentes. A mellora de rendemento de Flame-MR foi avaliada de maneira exhaustiva
en sistemas clúster e cloud, executando tanto benchmarks estándar coma
aplicacións pertencentes a casos de uso reais. Os resultados amosan unha redución
de entre un 40% e un 90% do tempo de execución das aplicacións. Esta Tese proporciona
aos usuarios e desenvolvedores de Big Data dúas potentes ferramentas
para analizar e comprender o comportamento de frameworks de procesamento de
datos e reducir o tempo de execución das aplicacións sen necesidade de contar con
coñecemento experto para elo.[Resumen]
Hoy en día, muchas organizaciones utilizan tecnologías Big Data para extraer
información de grandes volúmenes de datos. A medida que el tamaño de estos volúmenes
crece, satisfacer las demandas de rendimiento de las aplicaciones de procesamiento
de datos masivos se vuelve más difícil. Esta Tesis se centra en evaluar y
optimizar estas aplicaciones, presentando dos nuevas herramientas llamadas BDEv
y Flame-MR. Por un lado, BDEv analiza el comportamiento de frameworks de procesamiento
Big Data como Hadoop, Spark y Flink, muy populares en la actualidad.
BDEv gestiona su configuración y despliegue, generando los conjuntos de datos de
entrada y ejecutando cargas de trabajo previamente elegidas por el usuario. Durante
cada ejecución, BDEv extrae diversas métricas de evaluación que incluyen rendimiento,
uso de recursos, eficiencia energética y comportamiento a nivel de microarquitectura.
Por otro lado, Flame-MR permite optimizar el rendimiento de aplicaciones
Hadoop MapReduce. En general, su diseño se basa en una arquitectura dirigida por
eventos capaz de mejorar la eficiencia de los recursos del sistema mediante el solapamiento
de la computación con las comunicaciones. Además de reducir el número
de copias en memoria que presenta Hadoop, utiliza algoritmos eficientes para ordenar
y mezclar los datos. Flame-MR reemplaza el motor de procesamiento de datos
MapReduce de manera totalmente transparente, por lo que no se necesita modificar
el código de aplicaciones ya existentes. La mejora de rendimiento de Flame-MR ha
sido evaluada de manera exhaustiva en sistemas clúster y cloud, ejecutando tanto
benchmarks estándar como aplicaciones pertenecientes a casos de uso reales. Los
resultados muestran una reducción de entre un 40% y un 90% del tiempo de ejecución
de las aplicaciones. Esta Tesis proporciona a los usuarios y desarrolladores de
Big Data dos potentes herramientas para analizar y comprender el comportamiento
de frameworks de procesamiento de datos y reducir el tiempo de ejecución de las
aplicaciones sin necesidad de contar con conocimiento experto para ello.[Abstract]
Nowadays, Big Data technologies are used by many organizations to extract
valuable information from large-scale datasets. As the size of these datasets increases,
meeting the huge performance requirements of data processing applications
becomes more challenging. This Thesis focuses on evaluating and optimizing these
applications by proposing two new tools, namely BDEv and Flame-MR. On the one
hand, BDEv allows to thoroughly assess the behavior of widespread Big Data processing
frameworks such as Hadoop, Spark and Flink. It manages the configuration
and deployment of the frameworks, generating the input datasets and launching the
workloads specified by the user. During each workload, it automatically extracts
several evaluation metrics that include performance, resource utilization, energy efficiency
and microarchitectural behavior. On the other hand, Flame-MR optimizes
the performance of existing Hadoop MapReduce applications. Its overall design is
based on an event-driven architecture that improves the efficiency of the system
resources by pipelining data movements and computation. Moreover, it avoids redundant
memory copies present in Hadoop, while also using efficient sort and merge
algorithms for data processing. Flame-MR replaces the underlying MapReduce data
processing engine in a transparent way and thus the source code of existing applications
does not require to be modified. The performance benefits provided by Flame-
MR have been thoroughly evaluated on cluster and cloud systems by using both
standard benchmarks and real-world applications, showing reductions in execution
time that range from 40% to 90%. This Thesis provides Big Data users with powerful
tools to analyze and understand the behavior of data processing frameworks and
reduce the execution time of the applications without requiring expert knowledge
An evaluation of galaxy and ruffus-scripting workflows system for DNA-seq analysis
>Magister Scientiae - MScFunctional genomics determines the biological functions of genes on a global scale by
using large volumes of data obtained through techniques including next-generation
sequencing (NGS). The application of NGS in biomedical research is gaining in
momentum, and with its adoption becoming more widespread, there is an increasing
need for access to customizable computational workflows that can simplify, and offer
access to, computer intensive analyses of genomic data. In this study, the Galaxy and
Ruffus frameworks were designed and implemented with a view to address the
challenges faced in biomedical research. Galaxy, a graphical web-based framework,
allows researchers to build a graphical NGS data analysis pipeline for accessible,
reproducible, and collaborative data-sharing. Ruffus, a UNIX command-line framework
used by bioinformaticians as Python library to write scripts in object-oriented style,
allows for building a workflow in terms of task dependencies and execution logic. In
this study, a dual data analysis technique was explored which focuses on a comparative
evaluation of Galaxy and Ruffus frameworks that are used in composing analysis
pipelines. To this end, we developed an analysis pipeline in Galaxy, and Ruffus, for the
analysis of Mycobacterium tuberculosis sequence data. Furthermore, this study aimed
to compare the Galaxy framework to Ruffus with preliminary analysis revealing that the
analysis pipeline in Galaxy displayed a higher percentage of load and store instructions.
In comparison, pipelines in Ruffus tended to be CPU bound and memory intensive. The
CPU usage, memory utilization, and runtime execution are graphically represented in
this study. Our evaluation suggests that workflow frameworks have distinctly different
features from ease of use, flexibility, and portability, to architectural designs
Framework for Supporting Genomic Operations
Next Generation Sequencing (NGS) is a family of technologies for reading the DNA or RNA, capable of producing whole genome sequences at an impressive speed, and causing a revolution of both biological research and medical practice. In this exciting scenario, while a huge number of specialized bio-informatics programs extract information from sequences, there is an increasing need for a new generation of systems and frameworks capable of integrating such information, providing holistic answers to the needs of biologists and clinicians. To respond to this need, we developed GMQL, a new query language for genomic data management that operates on heterogeneous genomic datasets. In this paper, we focus on three domain-specific operations of GMQL used for the efficient processing of operations on genomic regions, and we describe their efficient implementation; the paper develops a theory of binning strategies as a generic approach to parallel execution of genomic operations, and then describes how binning is embedded into two efficient implementations of the operations using Flink and Spark, two emerging frameworks for data management on the cloud
Conceptual models and databases for searching the genome
Genomics is an extremely complex domain, in terms of concepts, their relations, and their representations in data. This tutorial introduces the use of ER models in the context of genomic systems: conceptual models are of great help for simplifying this domain and making it actionable. We carry out a review of successful models presented in the literature for representing biologically relevant entities and grounding them in databases. We draw a difference between conceptual models that aim to explain the domain and conceptual models that aim to support database design and heterogeneous data integration. Genomic experiments and/or sequences are described by several metadata, specifying information on the sampled organism, the used technology, and the organizational process behind the experiment. Instead, we call data the actual regions of the genome that have been read by sequencing technologies and encoded into a machiner readable representation. First, we show how data and metadata can be modeled, then we exploit the proposed models for designing search systems, visualizers, and analysis environments. Both domains of human genomics and viral genomics are addressed, surveying several use cases and applications of broader public interest. The tutorial is relevant to the EDBT community because it demonstrates the usefulness of conceptual models’ principles within very current domains; in addition, it offers a concrete example of conceptual models’ use, setting the premises for interdisciplinary collaboration with a greater public (possibly including life science researchers)
- …