Search CORE

10 research outputs found

Leveraging Semantics to Improve Reproducibility in Scientific Workflows

Author: Corcho Oscar
Deelman Ewa
Ferreira da Silva Rafael
Pérez-Henández María S
Rynge Mats
Santana-Perez Idafen
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2014
Field of study

Reproducibility of published results is a cornerstone in scientific publishing and progress. Therefore, the scientific community has been encouraging authors and editors to publish their contributions in a verifiable and understandable way. Efforts such as the Reproducibility Initiative [1], or the Reproducibility Projects on Biology [2] and Psychology [3] domains, have been defining standards and patterns to assess whether an experimental result is reproducible

Archivo Digital UPM

A Semantic-Based Approach to Attain Reproducibility of Computational Environments in Scientic Work ows: A Case Study

Author: Corcho Oscar
Deelman E
Ferreira da Silva R
Pérez-Hernández M
Rynge M
Santana-Perez I
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2014
Field of study

Reproducible research in scientic work ows is often addressed by tracking the provenance of the produced results. While this approach allows inspecting intermediate and nal results, improves understanding, and permits replaying a work ow execution, it does not ensure that the computational environment is available for subsequent executions to reproduce the experiment. In this work, we propose describing the resources involved in the execution of an experiment using a set of semantic vocabularies, so as to conserve the computational environment. We dene a process for documenting the work ow application, management system, and their dependencies based on 4 domain ontologies. We then conduct an experimental evaluation sing a real work ow application on an academic and a public Cloud platform. Results show that our approach can reproduce an equivalent execution environment of a predened virtual machine image on both computing platforms

Archivo Digital UPM

Distributed storage optimization using multi-agent systems in Hadoop

Author: Abouchabaka Jaafar
Mahdaoui Rabie
Rafalia Najat
Sais Manar
Publication venue: EDP Sciences
Publication date: 01/01/2023
Field of study

Understanding data and extracting information from it are the main objectives of data science, especially when it comes to big data. To achieve these goals, it is necessary to collect and process massive data sets, arriving at the system in different formats at great velocity. The Big Data era has brought us new challenges in data storage and management, and existing state-ofthe-art data storage and processing tools are poised to meet the challenges while posing challenges to the next generation of data. Big Data storage optimization is essential for improving the overall efficiency of Big Data systems by maximizing the use of storage resources. It also reduces the energy consumption of Big Data systems, resulting in financial savings, environmental protection, and improved system performance. Hadoop provides a solution for storing and analysing large quantities of data. However, Hadoop can encounter storage management problems due to its distributed nature and the management of large volumes of data. In order to meet future challenges, the system needs to intelligently manage its storage system. The use of a multi-agent system presents a promising approach for efficiently managing hot and cold data in HDFS. These systems offer a flexible, distributed solution for solving complex problems. This work proposes an approach based on a multi-agent system capable of gathering information on data access activity in the HDFS cluster. Using this information, it classifies data according to its temperature (hot or cold) and makes decisions about data replication based on its classification. In addition, it compresses unused data to manage resources efficiently and reduce storage space usage

Directory of Open Access Journals

A Semantic-Based Approach to Attain Reproducibility of Computational Environments in Scientific Workflows: A Case Study

Author: Ewa Deelman
Idafen Santana-Perez
María S Pérez-Hernández
Oscar Corcho
Rafael Ferreira Da Silva
Publication venue
Publication date: 03/04/2020
Field of study

Abstract. Reproducible research in scientific workflows is often addressed by tracking the provenance of the produced results. While this approach allows inspecting intermediate and final results, improves understanding, and permits replaying a workflow execution, it does not ensure that the computational environment is available for subsequent executions to reproduce the experiment. In this work, we propose describing the resources involved in the execution of an experiment using a set of semantic vocabularies, so as to conserve the computational environment. We define a process for documenting the workflow application, management system, and their dependencies based on 4 domain ontologies. We then conduct an experimental evaluation using a real workflow application on an academic and a public Cloud platform. Results show that our approach can reproduce an equivalent execution environment of a predefined virtual machine image on both computing platforms

CiteSeerX

A Semantic-Based Approach to Attain Reproducibility of Computational Environments in Scientific Workflows: A Case Study

Author: Ewa Deelman
Idafen Santana-Perez
María S Pérez-Hernández
Oscar Corcho
Rafael Ferreira Da Silva
Publication venue
Publication date: 03/04/2020
Field of study

CiteSeerX

Multiobjective Reliable Cloud Storage with Its Particle Swarm Optimization Algorithm

Author: Lei Fan
Liming Wang
Sha Meng
Xiyang Liu
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

Information abounds in all fields of the real life, which is often recorded as digital data in computer systems and treated as a kind of increasingly important resource. Its increasing volume growth causes great difficulties in both storage and analysis. The massive data storage in cloud environments has significant impacts on the quality of service (QoS) of the systems, which is becoming an increasingly challenging problem. In this paper, we propose a multiobjective optimization model for the reliable data storage in clouds through considering both cost and reliability of the storage service simultaneously. In the proposed model, the total cost is analyzed to be composed of storage space occupation cost, data migration cost, and communication cost. According to the analysis of the storage process, the transmission reliability, equipment stability, and software reliability are taken into account in the storage reliability evaluation. To solve the proposed multiobjective model, a Constrained Multiobjective Particle Swarm Optimization (CMPSO) algorithm is designed. At last, experiments are designed to validate the proposed model and its solution PSO algorithm. In the experiments, the proposed model is tested in cooperation with 3 storage strategies. Experimental results show that the proposed model is positive and effective. The experimental results also demonstrate that the proposed model can perform much better in alliance with proper file splitting methods

Crossref

Directory of Open Access Journals

Survey on Deduplication Techniques in Flash-Based Storage

Author: bowling
du
freudenbrger
ha
kilvansky
kim
kim
lee
li
li
li
mao
mao
meyer
peng
seagate
shiming
wei
xia
yim
zhang
zhang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2018
Field of study

Data deduplication importance is growing with the growth of data volumes. The domain of data deduplication is in active development. Recently it was influenced by appearance of Solid State Drive. This new type of disk has significant differences from random access memory and hard disk drives and is widely used now. In this paper we propose a novel taxonomy which reflects the main issues related to deduplication in Solid State Drive. We present a survey on deduplication techniques focusing on flash-based storage. We also describe several Open Source tools implementing data deduplication and briefly describe open research problems related to data deduplication in flash-based storage systems

Crossref

Directory of Open Access Journals

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

Author: Bo Mao
Clements A. T.
Debnath B.
Dong W.
El-Shimi A.
Guerra J.
Guo F.
Gupta D.
Himelstein M.
Hong Jiang
Koller R.
Kruus E.
Lei Tian
Lillibridge M.
Lillibridge M.
Meister D.
Meyer D. T.
Nath P.
Polte M.
Quinlan S.
Rhea S.
Srinivasan K.
Suzhen Wu
Ungureanu C.
Xia W.
Yang T.
Yinjin Fu
Zhu B.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

Author: Fu Yinjin
Jiang Hong
Mao Bo
Tian Lei
Wu Suzhen
吴素贞
毛波
Publication venue: ASSOC COMPUTING MACHINERY
Publication date
Field of study

China National Science Foundation [61100033]; US NSF [NSF-CNS-1116606, NSF-CNS-1016609, NSF-IIS-0916859]; Scientific Research Foundation for the Returned Overseas Chinese Scholars; State Education Ministry; Huawei Innovation Research ProgramData deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM ( virtual machine) platforms. However, the performance of restore operations from a deduplicated backup can be significantly lower than that without deduplication. The main reason lies in the fact that a file or block is split into multiple small data chunks that are often located in different disks after deduplication, which can cause a subsequent read operation to invoke many disk IOs involving multiple disks and thus degrade the read performance significantly. While this problem has been by and large ignored in the literature thus far, we argue that the time is ripe for us to pay significant attention to it in light of the emerging cloud storage applications and the increasing popularity of the VM platform in the cloud. This is because, in a cloud storage or VM environment, a simple read request on the client side may translate into a restore operation if the data to be read or a VM suspended by the user was previously deduplicated when written to the cloud or the VM storage server, a likely scenario considering the network bandwidth and storage capacity concerns in such an environment. To address this problem, in this article, we propose SAR, an SSD (solid-state drive)-Assisted Read scheme, that effectively exploits the high random-read performance properties of SSDs and the unique data-sharing characteristic of deduplication-based storage systems by storing in SSDs the unique data chunks with high reference count, small size, and nonsequential characteristics. In this way, many read requests to HDDs are replaced by read requests to SSDs, thus significantly improving the read performance of the deduplicationbased storage systems in the cloud. The extensive trace-driven and VM restore evaluations on the prototype implementation of SAR show that SAR outperforms the traditional deduplication-based and flash-based cache schemes significantly, in terms of the average response times

Xiamen University Institutional Repository