60 research outputs found

    A Workflow-oriented Language for Scalable Data Analytics

    Get PDF
    Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014). Porto (Portugal), August 27-28, 2014.Data in digital repositories are everyday more and more massive and distributed. Therefore analyzing them requires efficient data analysis techniques and scalable storage and computing platforms. Cloud computing infrastructures offer an effective support for addressing both the computational and data storage needs of big data mining and parallel knowledge discovery applications. In fact, complex data mining tasks involve data- and compute-intensive algorithms that require large and efficient storage facilities together with high performance processors to get results in acceptable times. In this paper we describe a Data Mining Cloud Framework (DMCF) designed for developing and executing distributed data analytics applications as workflows of services. We describe also a workflow-oriented language, called JS4Cloud, to support the design and execution of script-based data analysis workflows on DMCF. We finally present a data analysis application developed with JS4Cloud, and the scalability achieved executing it on DMCF.The work presented in this paper has been partially supported by EU under the COST programme Action IC1305, ’Network for Sustainable Ultrascale Computing (NESUS)’

    Peer-to-Peer Metadata Management for Knowledge Discovery Applications in Grids

    Get PDF
    Computational Grids are powerful platforms gathering computational power and storage space from thousands of geographically distributed resources. The applications running on such platforms need to efïŹciently and reliably access the various and heterogeneous distributed resources they offer. This can be achieved by using metadata information describing all available resources. It is therefore crucial to provide efïŹcient metadata management architectures and frameworks. In this paper we describe the design of a Grid metadata management service. We focus on a particular use case: the Knowledge Grid architecture which provides high-level Grid services for distributed knowledge discovery applications. Taking advantage of an existing Grid data-sharing service, namely JuxMem, the proposed solution lies at the border between peer-to-peer systems and Web services

    Using social media for sub-event detection during disasters

    Get PDF
    AbstractSocial media platforms have become fundamental tools for sharing information during natural disasters or catastrophic events. This paper presents SEDOM-DD (Sub-Events Detection on sOcial Media During Disasters), a new method that analyzes user posts to discover sub-events that occurred after a disaster (e.g., collapsed buildings, broken gas pipes, floods). SEDOM-DD has been evaluated with datasets of different sizes that contain real posts from social media related to different natural disasters (e.g., earthquakes, floods and hurricanes). Starting from such data, we generated synthetic datasets with different features, such as different percentages of relevant posts and/or geotagged posts. Experiments performed on both real and synthetic datasets showed that SEDOM-DD is able to identify sub-events with high accuracy. For example, with a percentage of relevant posts of 80% and geotagged posts of 15%, our method detects the sub-events and their areas with an accuracy of 85%, revealing the high accuracy and effectiveness of the proposed approach

    Evaluating data caching techniques in DMCF workflows using Hercules

    Get PDF
    The Data Mining Cloud Framework (DMCF) is an environment for designing and executing data analysis workflows in cloud platforms. Currently, DMCF relies on the default storage of the public cloud provider for any I/O related operation. This implies that the I/O performance of DMCF is limited by the performance of the default storage. In this work we propose the usage of the Hercules system within DMCF as an ad-hoc storage system for temporary data produced inside workflow-based applications. Hercules is a distributed in-memory storage system highly scalable and easy to deploy. The proposed solution takes advantage of the scalability capabilities of Hercules to avoid the bandwidth limits of the default storage. Early experimental results are presented in this paper, they show promising performance, particularly for write operations, compared to the performance obtained using the default storage services.This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS). This work is partially supported by the grant TIN2013-41350-P, Scalable Data Management Techniques for High-End Computing Systems from the Spanish Ministry of Economy and Competitiveness

    A Data-Aware Scheduling Strategy for Executing Large-Scale Distributed Workflows

    Get PDF
    Task scheduling is a crucial key component for the efficient execution of data-intensive applications on distributed environments, by which many machines must be coordinated to reduce execution times and bandwidth consumption. This paper presents ADAGE, a data-aware scheduler designed to efficiently execute data-intensive workflows in large-scale computers. The proposed scheduler is based on three key features: ii ) critical path analysis, for discovering the critical tasks of a workflow and reducing data transferring between nodes; iiii ) work giving, a new dynamic planning strategy for migrating tasks from overloaded to unloaded nodes; and iiiiii ) task replication, which executes task replicas on different nodes for improving both execution time and fault tolerance. Experiments performed on a distributed computing environment composed of up to 1,024 processing nodes show that ADAGE achieves better performances than existing scheduling systems, obtaining an average reduction of up to 66% in execution time

    Block size estimation for data partitioning in HPC applications using machine learning techniques

    Full text link
    The extensive use of HPC infrastructures and frameworks for running data-intensive applications has led to a growing interest in data partitioning techniques and strategies. In fact, finding an effective partitioning, i.e. a suitable size for data blocks, is a key strategy to speed-up parallel data-intensive applications and increase scalability. This paper describes a methodology for data block size estimation in HPC applications, which relies on supervised machine learning techniques. The implementation of the proposed methodology was evaluated using as a testbed dislib, a distributed computing library highly focused on machine learning algorithms built on top of the PyCOMPSs framework. We assessed the effectiveness of our solution through an extensive experimental evaluation considering different algorithms, datasets, and infrastructures, including the MareNostrum 4 supercomputer. The results we obtained show that the methodology is able to efficiently determine a suitable way to split a given dataset, thus enabling the efficient execution of data-parallel applications in high performance environments

    Determinants of SARS-CoV-2 Contagiousness in Household Contacts of Symptomatic Adult Index Cases

    Get PDF
    BACKGROUND: Identifying determinants of the novel severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) transmission in settings of contagion is fundamental to inform containment strategies. We assessed SARS-CoV-2 cycle threshold value (Ct) from the first diagnostic nasal–pharyngeal swab of symptomatic index cases and which demographic or clinical characteristics among cases and contacts are associated with transmission risk within households. METHODS: This is a retrospective prevalence study on secondary SARS-CoV-2 cases (SC) among the household contacts of symptomatic adult index cases randomly sampled from all the SARS-CoV-2-positive diagnostic nasopharyngeal swabs analyzed at our regional referral hospital (Amedeo di Savoia Hospital, Turin, Italy) in March, 2020. Index cases underwent a telephone survey to collect their demographic and clinical data and all their household contacts. The Ct value of RdRp gene from the first diagnostic swab of index cases was recorded and index cases were grouped according to Ct tertiles (A < first tertile, first ≀ B ≀ second tertile, C ≄ second tertile). Post hoc analysis was performed in SC as well as contacts that did not undergo SARS-CoV-2 testing but developed compatible signs and symptoms. Non-parametric tests and generalized linear models were run. RESULTS: Index (n = 72) and contact (n = 164) median age was 54 (48–63) and 32 (20–56) years, respectively. A total of 60, 50, and 54 subjects were contacts of group A, B, and C index cases, respectively; 35.9% of contacts were SC. Twenty-four further subjects (14.6%) met the criteria for symptom-based likely positive SC. The secondary attack rate was 36.0% (28.6–43.4), assuming a mean incubation period of 5 days and a maximum infectious period of 20 days. SC prevalence differed between Ct groups (53.3% A, 32.0% B, 20.4% C; p < 0.001). No difference in SC was found according to sex, presence of signs/symptoms, and COVID-19 severity of index cases, or according to contacts’ sex and number per household. The age of both index cases [aOR 4.52 (1.2–17.0) for 60 vs. ≀45 years old] and contacts [aOR 3.66 (1.3–10.6) for 60 vs. ≀45years old] and the Ct of the index [aOR 0.17 (0.07–0.4) for Ct ≄ 31.8 vs. Ct < 24.4] independently associated with SC risk. Sensitivity analysis including symptoms-based likely positive SC supported all the previous results. CONCLUSION: In confined transmission settings such as households, PCR Ct values may inform on the contagiousness of infected subjects and age may modulate transmission/contagion risk

    A Data-Aware Scheduling Strategy for DMCF workflows over Hercules

    Get PDF
    Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016). Sofia (Bulgaria), October, 6-7, 2016.As data-intensive scientific prevalence arises, there is a necessity of simplifying the development, deployment, and execution of complex data analysis applications. The Data Mining Cloud Framework is a service-oriented system for allowing users to design and execute data analysis applications, defined as workflows, on cloud platforms, relying on cloud-provided storage services for I/O operations. Hercules is an in-memory I/O solution that can be deployed as an alternative to cloud storage services, providing additional performance and flexibility features. This work extends the DMCF-Hercules cooperation by applying novel data placement and task scheduling techniques for exposing and exploiting data locality in data-intensive workflows.This work is partially supported by EU under the COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)
    • 

    corecore