Search CORE

159 research outputs found

Why High-Performance Modelling and Simulation for Big Data Applications Matters

Author: Aldinucci M.
Bracciali A.
Grelck C.
Larsson E.
Niewiadomska-Szynkiewicz E.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

International Migration, Integration and Social Cohesion online publications

Mario. A system for iterative and interactive processing of biological data

Author: Ernstsen Martin
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 01/01/2013
Field of study

This thesis address challenges in metagenomic data processing on clusters of computers; in particular the need for interactive response times during development, debugging and tuning of data processing pipelines. Typical metagenomics pipelines batch process data, and have execution times ranging from hours to months, making configuration and tuning time consuming and impractical. We have analyzed the data usage of metagenomic pipelines, including a visualization frontend, to develop an approach that use an online, data-parallel processing model, where changes in the pipeline configuration are quickly reflected in updated pipeline output available to the user. We describe the design and implementation of the Mario system that real- izes the approach. Mario is a distributed system built on top of the HBase storage system, that provide data processing using commonly used bioinformatics applications, interactive tuning, automatic parallelization and data provenance support. We evaluate Mario and its underlying storage system, HBase, using a benchmark developed to simulate I/O loads that are representative for biological data processing. The results show that Mario adds less than 100 milliseconds to the end-to-end latency of processing one item of data. This low latency, combined with Mario’s storage of all intermediate data generated by the processing, enables easy parameter tuning. In addition to improved interactivity, Mario also offer integrated data provenance, by storing detailed pipeline configurations associated with the data. The evaluation of Mario demonstrate that it can be used to achieve more interactivity in the configuration of pipelines for processing biological data. We believe that biology researchers can take advantage of this interactivity to perform better parameter tuning, which may lead to more accurate analyses, and ultimately to new scientific discoveries

Munin - Open Research Archive

NORA - Norwegian Open Research Archives

Improving Pipelining Tools for Pre-processing Data

Author: Lage Yeray
Laza Rosalía
Méndez José Ramón
Novo-Lourés María
Pavón Reyes
Ruano-Ordás David
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 01/01/2022
Field of study

The last several years have seen the emergence of data mining and its transformation into a powerful tool that adds value to business and research. Data mining makes it possible to explore and find unseen connections between variables and facts observed in different domains, helping us to better understand reality. The programming methods and frameworks used to analyse data have evolved over time. Currently, the use of pipelining schemes is the most reliable way of analysing data and due to this, several important companies are currently offering this kind of services. Moreover, several frameworks compatible with different programming languages are available for the development of computational pipelines and many research studies have addressed the optimization of data processing speed. However, as this study shows, the presence of early error detection techniques and developer support mechanisms is very limited in these frameworks. In this context, this study introduces different improvements, such as the design of different types of constraints for the early detection of errors, the creation of functions to facilitate debugging of concrete tasks included in a pipeline, the invalidation of erroneous instances and/or the introduction of the burst-processing scheme. Adding these functionalities, we developed Big Data Pipelining for Java (BDP4J, https://github.com/sing-group/bdp4j), a fully functional new pipelining framework that shows the potential of these features

Re-UNIR

DIALNET

Improving pipelining tools for pre-processing data

Author: Lage Yeray
Laza Fidalgo Rosalía
Méndez Reboredo José Ramón
Novo Lourés María
Pavón Rial Maria Reyes
Ruano Ordás David Alfonso
Publication venue: Sistemas Informáticos de Nova Xeración
Publication date: 04/12/2023
Field of study

Investigo