Search CORE

168 research outputs found

Mining Large Data Sets on Grids: Issues and Prospects

Author: Skillicorn David
Talia Domenico
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 21/02/2012
Field of study

When data mining and knowledge discovery techniques must be used to analyze large amounts of data, high-performance parallel and distributed computers can help to provide better computational performance and, as a consequence, deeper and more meaningful results. Recently grids, composed of large-scale, geographically distributed platforms working together, have emerged as effective architectures for high-performance decentralized computation. It is natural to consider grids as tools for distributed data-intensive applications such as data mining, but the underlying patterns of computation and data movement in such applications are different from those of more conventional high-performance computation. These differences require a different kind of grid, or at least a grid with significantly different emphases. This paper discusses the main issues, requirements, and design approaches for the implementation of grid-based knowledge discovery systems. Furthermore, some prospects and promising research directions in datacentric and knowledge-discovery oriented grids are outlined

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Workflow Systems for Science: Concepts and Tools

Author: Domenico Talia
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2013
Field of study

The wide availability of high-performance computing systems, Grids and Clouds, allowed scientists and engineers to implement more and more complex applications to access and process large data repositories and run scientific experiments in silico on distributed computing platforms. Most of these applications are designed as workflows that include data analysis, scientific computation methods, and complex simulation techniques. Scientific applications require tools and high-level mechanisms for designing and executing complex workflows. For this reason, in the past years, many efforts have been devoted towards the development of distributed workflow management systems for scientific applications. This paper discusses basic concepts of scientific workflows and presents workflow system tools and frameworks used today for the implementation of application in science and engineering on high-performance computers and distributed systems. In particular, the paper reports on a selection of workflow systems largely used for solving scientific problems and discusses some open issues and research challenges in the area

Crossref

Directory of Open Access Journals

Open Access Repository

A Workflow-oriented Language for Scalable Data Analytics

Author: Marozzo Fabrizio
Talia Domenico
Trunfio Paolo
Publication venue
Publication date: 01/01/2014
Field of study

Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014). Porto (Portugal), August 27-28, 2014.Data in digital repositories are everyday more and more massive and distributed. Therefore analyzing them requires efficient data analysis techniques and scalable storage and computing platforms. Cloud computing infrastructures offer an effective support for addressing both the computational and data storage needs of big data mining and parallel knowledge discovery applications. In fact, complex data mining tasks involve data- and compute-intensive algorithms that require large and efficient storage facilities together with high performance processors to get results in acceptable times. In this paper we describe a Data Mining Cloud Framework (DMCF) designed for developing and executing distributed data analytics applications as workflows of services. We describe also a workflow-oriented language, called JS4Cloud, to support the design and execution of script-based data analysis workflows on DMCF. We finally present a data analysis application developed with JS4Cloud, and the scalability achieved executing it on DMCF.The work presented in this paper has been partially supported by EU under the COST programme Action IC1305, ’Network for Sustainable Ultrascale Computing (NESUS)’

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Peer-to-Peer Metadata Management for Knowledge Discovery Applications in Grids

Author: Antoniu Gabriel
Congiusta Antonio
Monnet Sébastien
Talia Domenico
Trunfio Paolo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2007
Field of study

Computational Grids are powerful platforms gathering computational power and storage space from thousands of geographically distributed resources. The applications running on such platforms need to efﬁciently and reliably access the various and heterogeneous distributed resources they offer. This can be achieved by using metadata information describing all available resources. It is therefore crucial to provide efﬁcient metadata management architectures and frameworks. In this paper we describe the design of a Grid metadata management service. We focus on a particular use case: the Knowledge Grid architecture which provides high-level Grid services for distributed knowledge discovery applications. Taking advantage of an existing Grid data-sharing service, namely JuxMem, the proposed solution lies at the border between peer-to-peer systems and Web services

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

HeteroPar 2014, APCIE 2014, and TASUS 2014 Special Issue

Author: Carretero Jésus
Jeannot Emmanuel
Julius Žilinskas
Lefèvre Laurent
Rünger Gudula
Talia Domenico
Čiegis Raimondas
Publication venue: 'Royal College of Obstetricians & Gynaecologists (RCOG)'
Publication date: 01/01/2016
Field of study

International audienceThis is the editorial of the special issue of the HeteroPar 2014, APCIE 2014, and TASUS 2014 workshop

HAL-ENS-LYON

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Using social media for sub-event detection during disasters

Author: Domenico Talia
Fabrizio Marozzo
Francesco Branda
Loris Belcastro
Muhammad Imran
Paolo Trunfio
Themis Palpanas
Publication venue
Publication date: 02/06/2021
Field of study

AbstractSocial media platforms have become fundamental tools for sharing information during natural disasters or catastrophic events. This paper presents SEDOM-DD (Sub-Events Detection on sOcial Media During Disasters), a new method that analyzes user posts to discover sub-events that occurred after a disaster (e.g., collapsed buildings, broken gas pipes, floods). SEDOM-DD has been evaluated with datasets of different sizes that contain real posts from social media related to different natural disasters (e.g., earthquakes, floods and hurricanes). Starting from such data, we generated synthetic datasets with different features, such as different percentages of relevant posts and/or geotagged posts. Experiments performed on both real and synthetic datasets showed that SEDOM-DD is able to identify sub-events with high accuracy. For example, with a percentage of relevant posts of 80% and geotagged posts of 15%, our method detects the sub-events and their areas with an accuracy of 85%, revealing the high accuracy and effectiveness of the proposed approach

Open Access Repository

Evaluating data caching techniques in DMCF workflows using Hercules

Author: Carretero Pérez Jesús
García Blas Javier
Marozzo Fabrizio
Rodrigo Duro Francisco José
Talia Domenico
Trunfio Paolo
Publication venue
Publication date: 01/01/2015
Field of study

The Data Mining Cloud Framework (DMCF) is an environment for designing and executing data analysis workflows in cloud platforms. Currently, DMCF relies on the default storage of the public cloud provider for any I/O related operation. This implies that the I/O performance of DMCF is limited by the performance of the default storage. In this work we propose the usage of the Hercules system within DMCF as an ad-hoc storage system for temporary data produced inside workflow-based applications. Hercules is a distributed in-memory storage system highly scalable and easy to deploy. The proposed solution takes advantage of the scalability capabilities of Hercules to avoid the bandwidth limits of the default storage. Early experimental results are presented in this paper, they show promising performance, particularly for write operations, compared to the performance obtained using the default storage services.This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). This work is partially supported by the grant TIN2013-41350-P, Scalable Data Management Techniques for High-End Computing Systems from the Spanish Ministry of Economy and Competitiveness

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Block size estimation for data partitioning in HPC applications using machine learning techniques

Author: Badia Rosa M.
Cantini Riccardo
Ejarque Jorge
Marozzo Fabrizio
Orsino Alessio
Talia Domenico
Trunfio Paolo
Vazquez Fernando
Publication venue
Publication date: 19/11/2022
Field of study

The extensive use of HPC infrastructures and frameworks for running data-intensive applications has led to a growing interest in data partitioning techniques and strategies. In fact, finding an effective partitioning, i.e. a suitable size for data blocks, is a key strategy to speed-up parallel data-intensive applications and increase scalability. This paper describes a methodology for data block size estimation in HPC applications, which relies on supervised machine learning techniques. The implementation of the proposed methodology was evaluated using as a testbed dislib, a distributed computing library highly focused on machine learning algorithms built on top of the PyCOMPSs framework. We assessed the effectiveness of our solution through an extensive experimental evaluation considering different algorithms, datasets, and infrastructures, including the MareNostrum 4 supercomputer. The results we obtained show that the methodology is able to efficiently determine a suitable way to split a given dataset, thus enabling the efficient execution of data-parallel applications in high performance environments

arXiv.org e-Print Archive

A Data-Aware Scheduling Strategy for Executing Large-Scale Distributed Workflows

Author: Domenico Talia
Fabrizio Marozzo
Loris Belcastro
Paolo Trunfio
Salvatore Giampa
Publication venue
Publication date: 01/01/2021
Field of study

Task scheduling is a crucial key component for the efficient execution of data-intensive applications on distributed environments, by which many machines must be coordinated to reduce execution times and bandwidth consumption. This paper presents ADAGE, a data-aware scheduler designed to efficiently execute data-intensive workflows in large-scale computers. The proposed scheduler is based on three key features:

i

) critical path analysis, for discovering the critical tasks of a workflow and reducing data transferring between nodes;

ii

) work giving, a new dynamic planning strategy for migrating tasks from overloaded to unloaded nodes; and

iii

) task replication, which executes task replicas on different nodes for improving both execution time and fault tolerance. Experiments performed on a distributed computing environment composed of up to 1,024 processing nodes show that ADAGE achieves better performances than existing scheduling systems, obtaining an average reduction of up to 66% in execution time

Open Access Repository

HeteroPar 2014, APCIE 2014, and TASUS 2014 Special Issue

Author: Carretero Jésus
Jeannot Emmanuel
Julius Žilinskas
Lefèvre Laurent
Rünger Gudula
Talia Domenico
Čiegis Raimondas
Publication venue: 'Royal College of Obstetricians & Gynaecologists (RCOG)'
Publication date: 01/01/2016
Field of study

International audienceThis is the editorial of the special issue of the HeteroPar 2014, APCIE 2014, and TASUS 2014 workshop

INRIA a CCSD electronic archive server